zazuko / knowledge-graph-forum

Knowledge Graph Forum, Switzerland
67 stars 3 forks source link

Semantic and Transparent ETL pipelines #19

Closed katsi closed 1 year ago

katsi commented 1 year ago

What would you like to talk about

At Inter IKEA Systems B.V. we find the traditional approach of transforming non-graph data sources to RDF, such as RML, R2RML, insufficient for defining data transformation. This is due to two reasons:

  1. RML does not provide a way to semantically describe the data source and attach context to it
  2. the data transformation that can be defined in RML is too simple We have created two ontologies: one for describing data sources and another for defining data transformation functions and mappings. This work builds on the previous work by Katariina Kari. One benefit of this approach is that one can centrally define data transformation functions and resuse them throughout their ETL pipelines.

Who might be interested in that (technical audience, management, data scientists, etc.)

This is intended for technical audiences, who want to implement their ETL pipelines in a performant and scalable way.

What would you like to know from other participants, what feedback are you looking for

We are looking for other teams that are developing a similar approach. We want to publish this as an open-source project so we are also looking for contributions.

What is your background (name, affiliation, etc)

Katariina Kari, Lead Ontologist, Inter IKEA Systems B.V.

Would you be open to have the session filmed & published later?

Might or might not happen, does absolutely not influence if your session is selected or not. YES/NO answer. YES

VladimirAlexiev commented 1 year ago

hi @katsi ! What you describe in https://github.com/katsi/rml-generator strongly reminds me of the problem I tried to solve with my tool https://github.com/VladimirAlexiev/rdf2rml. It takes this approach: the RDF schema is a simple RDF example, and source tables/fields are embedded in this example.

Indeed RML and R2ML are very verbose (YARRML partially solves this for RML, but is not visual). To make this one node: image

It takes 15 nodes in R2RML and probably a bit more in RML: image

I have tools that make the diagrams, and transform the model to a R2RML script.

These images are from https://rawgit2.com/VladimirAlexiev/my/master/pres/20161128-rdfpuml-rdf2rml/index-full.html#orgac87b06. The full reference is:

@InProceedings{Alexiev2016-rdfpuml-rdf2rml,
  author       = {Vladimir Alexiev},
  title        = {{RDF by Example: rdfpuml for True RDF Diagrams, rdf2rml for R2RML Generation}},
  booktitle    = {Semantic Web in Libraries 2016 (SWIB 2016)},
  year         = 2016,
  month        = nov,
  address      = {Bonn, Germany},
  url_Slides   = {http://rawgit2.com/VladimirAlexiev/my/master/pres/20161128-rdfpuml-rdf2rml/index.html},
  url_HTML     = {http://rawgit2.com/VladimirAlexiev/my/master/pres/20161128-rdfpuml-rdf2rml/index-full.html},
  keywords     = {RDF, visualization, PlantUML, cultural heritage, NLP, NIF, EHRI, R2RML, generation, model-driven, RDF by Example, rdfpuml, rdf2rml},
  url_Video    = {https://youtu.be/4WoYlaGF6DE},
  howpublished = {presentation},
  abstract     = {RDF is a graph data model, so the best way to understand RDF data schemas 
(ontologies, application profiles, RDF shapes) is with a diagram. 
Many RDF visualization tools exist, but they either focus on large graphs 
(where the details are not easily visible), 
or the visualization results are not satisfactory, or manual tweaking of the diagrams is required. 
We describe a tool *rdfpuml* that makes true diagrams directly from Turtle examples 
using PlantUML and GraphViz. 
Diagram readability is of prime concern, and rdfpuml introduces various diagram control mechanisms 
using triples in the puml: namespace. 
Special attention is paid to inlining and visualizing various Reification mechanisms (described with PRV). 
We give examples from Getty CONA, Getty Museum, AAC (mappings of museum data to CIDOC CRM), 
Multisensor (NIF and FrameNet), EHRI (Holocaust Research into Jewish social networks), 
Duraspace (Portland Common Data Model for holding metadata in institutional repositories), 
Video annotation. 

If the example instances include SQL queries and embedded field names, 
they can describe a mapping precisely. 
Another tool *rdf2rdb* generates R2RML transformations from such examples, saving about 15x in complexity.},
}

I've submitted a paper "Generation of Declarative Transformations from Semantic Models" to a conference, which describes more recent developments of this tool. It can now generate SPARQL transformations for tabular data (for TARQL and OntoRefine).

Ping me if you'd like to chat.

katsi commented 1 year ago

What I wanted to demonstrate with the RML generator is that taking a different mapping type of approach as an alternative to RML would be more beneficial. On Thursday I will show a new approach that does the following:

...and from those can generate RDF. No need to generate any R2RML or RML.

VladimirAlexiev commented 1 year ago

@katsi Can you send me your presentation, and maybe the video later? It seems the event is only in person: https://eventfrog.ch/de/p/wissenschaft-und-technik/knowledge-graph-forum-2022-en-6972880506569257209.html

Cheers!

ktk commented 1 year ago

@VladimirAlexiev how about simply joining, you are probably in Basel that week anyway?

VladimirAlexiev commented 1 year ago

No, Ilian Uzunov is (life science)

dpriskorn commented 1 year ago

What would you like to talk about

At Inter IKEA Systems B.V. we find the traditional approach of transforming non-graph data sources to RDF, such as RML, R2RML, insufficient for defining data transformation. This is due to two reasons:

  1. RML does not provide a way to semantically describe the data source and attach context to it

  2. the data transformation that can be defined in RML is too simple

We have created two ontologies: one for describing data sources and another for defining data transformation functions and mappings. This work builds on the previous work by Katariina Kari. One benefit of this approach is that one can centrally define data transformation functions and resuse them throughout their ETL pipelines.

Who might be interested in that (technical audience, management, data scientists, etc.)

This is intended for technical audiences, who want to implement their ETL pipelines in a performant and scalable way.

What would you like to know from other participants, what feedback are you looking for

We are looking for other teams that are developing a similar approach. We want to publish this as an open-source project so we are also looking for contributions.

What is your background (name, affiliation, etc)

Katariina Kari, Lead Ontologist, Inter IKEA Systems B.V.

Would you be open to have the session filmed & published later?

Might or might not happen, does absolutely not influence if your session is selected or not. YES/NO answer.

YES

I have a question. What is the goal here?

You have a bunch of json data that is strings not things. Mapping it like in the prior work you linked to does not add a lot of value from my point of view.

You cannot infer much knowledge after doing that mapping what I can see. 🤷‍♂️

In fact why not reconcile against some semantic rich source instead? In the example you could reconcile BE to https://www.wikidata.org/wiki/Q31 and by doing that you get both the continent, the name of the country in many languages, the fact that it is one of the worlds few kingdoms, etc. etc.

There is really no need to store all that semantic data locally from my point of view.

WDYT?

katsi commented 1 year ago

You have a bunch of json data that is strings not things. Mapping it like in the prior work you linked to does not add a lot of value from my point of view.

The prior work I did was a POC for showing that by describing a data source we can generate RML. It's just to prove that the alternative approach is as expressive as RML. It has nothing to do with the semantic transformation of the data.

This new approach – to which slides are now up in this repo – discusses the actual semantic data transformation, which is done via centrally stored functions. Please, have a look at the slides, and if you have more questions, I am happy to answer them.

ktk commented 1 year ago

I'm locking the issue, this was meant as a CfP, not as a board to discuss stuff by people that did not join the event.