Harvesting of Data in URBANITE

By Fritz Meiners

The significance and heterogeneous nature of the data sources encountered in URBANITE has been outlined in a previous blog post[1]. Each pilot city provides a variety of mobility related data, which form the basis of the visualization and recommendations presented to policy makers. While these data sources may have a common domain, their format, content, and scope may vary wildly. Additionally, some data sources may not be tied to an individual city, but offer data on a global scale. Examples are OpenStreetMap[2] and OpenWeatherMap[3], but also OpenAQ[4] featuring data on air pollution. As their names suggest, these services offer (at least partially) Open Data. Regardless of their nature and origin,all these different kinds of data must be homogenized to be of any use. Also, a distinction must be made between the actual data and metadata. Harvesting (i.e. fetching, pre-processing, and exporting of data for further use) is therefore a non-trivial problem.

For data, the Smart Data Models[5] by FIWARE have been chosen as a common baseline, giving particular relevance to those[MI1] models from the Smart Cities domain. . Data can be serialized as NGSI-LD[6], which is Linked Data. For metadata, DCAT-AP[7] is the model of choice, again a Linked Data based concept. These models have been chosen because of their Open Source licencing and coherence. Especially with regards to data models it was considered of importance, for a superior result, that all models were developed by a single organization. To summarize, depending on the data source, some or all of the following steps are required before the recommendation engine can start its work:

Data Import
Data Preparation
Data Transformation
Data Export

One approach to tackle this task is building a monolithic application that incorporates all of these steps, extending it with each new data source. However, this may not scale particularly well and would become rather convoluted, especially when considering that most of these steps must be tailored to each data source. As such, the solution proposed for URBANITE is that of a pipeline. One pipeline is set up for each data source, and each step is handled by a more or less generic module. This process is shown below:

Each pipeline begins with the automated retrieval of data. This data can change over time. Hence, a Scheduler triggers each pipeline at set intervals. Since the APIs encountered are so heterogeneous, a dedicated Importer is required for each type. All it does is to download the data, and pass this data on to the next module. The task of the Preparator is then to clean and homogenize the data. This can involve removal of invalid values, or enrichment by filling in those missing values. Also, metadata can be extracted from the data if it is not provided by the data source directly. Next the Transformer will convert the data into the applicable NGSI-LD format, and the metadata into the DCAT-AP format. For JSON and XML data this can be achieved using JavaScript and XSLT scripts respectively. Finally, the Exporter takes the homogenized information passed by the Transformer and uploads it into the store (data) and catalogue (metadata).

Each module in the pipeline is agnostic to the others. This ensures loose coupling of the components and allows for flexible orchestration of the pipelines. If a source already provides NGSI/DCAT-AP conformant data and metadata respectively, the pipeline would only consist of an Importer and an Exporter. Likewise, for complex transformations it may be easier to split this task into two by simply running two Transformers one after the other.

Another key benefit of this approach is the scalability, since the number of instances launched can be configured individually for each module. If a transformation turns out to be very complex (and thereby may become a bottleneck), multiple Transformers can be run in parallel without needing to scale up the other modules too.

Overall, the pipeline approach offers the flexibility and scalability required for handling the heterogeneous data sources encountered today and in the future. The concept is implemented by the Piveau Consus[8] software stack, named after the ancient Roman god of harvest. In comparison to traditional ETL[9] frameworks Piveau Consus supports Linked Data, a key requirement for URBANITE. Furthermore, thanks to the loose coupling of its modules, Piveau Consus can be a very lightweight solution and easily adaptable to the task at hand.

[1] https://urbanite-project.eu/content/data-sources-urban-mobility-urbanite-project

[2] https://www.openstreetmap.org/

[3] https://openweathermap.org/

[4] https://openaq.org/

[5] https://www.fiware.org/developers/smart-data-models/

[6] https://en.wikipedia.org/wiki/NGSI-LD

[7] https://op.europa.eu/en/web/eu-vocabularies/dcat-ap

[8] https://www.piveau.de/en/

[9] Extract Transform Load