Integrating nested data into knowledge graphs with RML fields Thomas Delva, Dylan Van Assche, Pieter Heyvaert, Ben De Meester, and Anastasia Dimou In Proceedings of the 2nd International Workshop on Knowledge Graph Construction 2021
To support business decisions or improve operational efficiency, heterogeneous data is often integrated into a knowledge graph. This integration can be achieved with one of the existing declarative mapping languages, which offer declarative data integration in the form of knowledge graphs. However, current mapping languages cannot always integrate data with nested structure, such as JSON or XML files or JSON documents stored in a database column. We designed a backwards-compatible extension of the RDF Mapping Language (RML) which empowers it to integrate nested data: RML fields. In this paper, we introduce RML fields, compare it with the state of the art in mapping languages, and validate it on mapping challenges formulated by the Knowledge Graph Construction W3C community group. Our extension allows to address several of the challenges related to nested data that were previously not possible. RML fields can be used to integrate even more datasets into knowledge graphs with all the advantages of using a language specially designed for that purpose. Our extension currently is intended to integrate multiple data sets independently, but some use cases require joins or other operations during knowledge graph generation, which we will investigate in the future.
Leveraging Web of Things W3C recommendations for knowledge graphs generation Dylan Van Assche, Gerald Haesendonck, Gertjan De Mulder, Thomas Delva, Pieter Heyvaert, Ben De Meester, and Anastasia Dimou In Proceedings of the 21st International Conference on Web Engineering 2021
Constructing a knowledge graph with mapping languages, such as RML or SPARQL-Generate, allows seamlessly integrating heterogeneous data by defining access-specific definitions for e.g., databases or files. However, such mapping languages have limited support for describing Web APIs and no support for describing data with varying velocities, as needed for e.g., streams, neither for the input data nor for the output RDF. This hampers the smooth and reproducible generation of knowledge graphs from heterogeneous data and their continuous integration for consumption since each implementation provides its own extensions. Recently, the Web of Things (WoT) Working Group released a set of recommendations to provide a machine-readable description of metadata and network-facing interfaces for Web APIs and streams. In this paper, we investigated (i) how mapping languages can be aligned with the newly specified recommendations to describe and handle heterogeneous data with varying velocities and Web APIs, and (ii) how such descriptions can be used to indicate how the generated knowledge graph should be exported. We extended RML’s Logical Source to support WoT descriptions of Web APIs and streams, and introduced RML’s Logical Target to describe the generated knowledge graph reusing the same descriptions. We implemented these extensions in the RMLMapper and RMLStreamer, and validated our approach in two use cases. Mapping languages are now able to use the same descriptions to define the input data but also the output RDF. This way, our work paves the way towards more reproducible workflows for knowledge graph generation.
Efficient Live Public Transport Data Sharing for Route Planning on the Web Julián Rojas Meléndez, Dylan Van Assche, Harm Delva, Pieter Colpaert, and Ruben Verborgh In Proceedings of the 20th International Conference on Web Engineering 2020
Web-based information services transformed how we interact with public transport. Discovering alternatives to reach destinations and obtaining live updates about them is necessary to optimize journeys and improve the quality of travellers’ experience. However, keeping travellers updated with opportune information is demanding. Traditional Web APIs for live public transport data follow a polling approach and allocate all data processing on either data providers, lowering data accessibility, or data consumers, increasing the costs of innovative solutions. Moreover, data processing load increases further because previously obtained route plans are fully recalculated when live updates occur. In between solutions sharing processing load between clients and servers, and alternative Web API architectures were not thoroughly investigated yet. We study performance trade-offs of polling and push-based Web architectures to efficiently publish and consume live public transport data. We implement (i) alternative architectures that allow sharing data processing load between clients and servers, and evaluate their performance following polling- and push-based approaches; (ii) a rollback mechanism that extends the Connection Scan Algorithm to avoid unnecessary full route plan recalculations upon live updates. Evaluations show polling as a more efficient alternative on CPU and RAM but hint towards push-based alternatives when bandwidth is a concern. Clients update route plan results 8–10 times faster with our rollback approach. Smarter API design combining polling and push-based Web interfaces for live public transport data reduces the intrinsic costs of data sharing by equitably distributing the processing load between clients and servers. Future work can investigate more complex multimodal transport scenarios.
Publiceren van updates over openbaarvervoersdata Dylan Van Assche 2019
Publishing real time public transport data can be a heavy task for Open Data publishers (RPC APIs) and consumers (data dumps). With Linked Connections we try to find a compromise between both worlds. Since Linked Connections is not optimised for real time data yet, we investigate how the Linked Connections server can publish real time data in a cost-efficient way. At the same time, we examine how we can optimise the algorithms to use Linked Connections on the clients. We propose a list of changes to these algorithms to reduce the processing time and computing resources. By using a separate real time resource on the Linked Connections server, only the updates to the data must be transferred. We want to reduce further the necessary resources for the client and the server for Linked Connections. We achieve this goal by modifying the Connection Scan Algorithm (CSA) and liveboard algorithms on the client. We reduced the processing time of a journey with a factor of 2 - 3 using the CSA. A liveboard could be updated 10 times faster than in the original implementation. The amount of network bandwidth is reduced by 15 % for the CSA and by 21 % for the liveboard algorithm. Based on these results, we conclude that a publish-subscribe approach (Server-Sent-Events) is more efficient than a polling approach (HTTP polling) for real time Open Data. Using a real time resource, the efficiency of the client’s algorithms is significantly improved.