Derive shapes from maps

tpluscode commented 1 year ago

I would like to propose a new feature where minimal SHACL shapes are generated from the mappings. The purpose is to generate a starting point for defining more specific constraints over the output data. For example, given the mapping shown in the language reference

map AirportMapping from airport {
    subject template "http://airport.example.com/{0}" with id;

    graphs
        template "http://airport.example.com/graph/stop/{0}" with id;
        constant "http://www.w3.org/ns/r2rml#defaultGraph";

    types transit.Stop

    properties
        transit.route from stop with datatype xsd.integer;
        wgs84_pos.lat from latitude;
        wgs84_pos.long from longitude;
}

One would be able to produce a shape with minimal constraints.

<AirportMappingShape>
  a sh:NodeShape ;
  sh:targetClass transit:Stop ;
  sh:property 
    <AirportMappingShape/transit:route> ,
    <AirportMappingShape/wgs84_pos:lat> ,
    <AirportMappingShape/wgs84_pos:long> ;
.

<AirportMappingShape/transit:route>
  sh:path transit:route ;
  sh:datatype xsd:integer ;
  sh:nodeKind sh:Literal ;
.

<AirportMappingShape/wgs84_pos:lat>
  sh:path wgs84_pos:lat ;
  sh:nodeKind sh:Literal ;
.

<AirportMappingShape/wgs84_pos:long>
  sh:path wgs84_pos:long ;
  sh:nodeKind sh:Literal ;
.

It's important property shapes are named nodes, so that they would be extendable by adding properties in a separate document and merging them. Give multiple mappings for same predicate might require sh:or or different node kind such as sh:NamedNodeOrLiteral

To implement this feature, I would propose to slightly adapt (and also simplify) the feature proposed in #115. I will create a draft PR to illustrate

mchlrch commented 1 year ago

Shapes derived from the mapping don't necessarily describe the output graph of the pipeline, often there are post-processing steps after the mapping.

Nevertheless, there are likely cases for which shapes derived from the mapping are useful (maybe also for troubleshooting pipelines or the mapping itself by validating intermediate results).

Some things to consider, if shapes are derived from the mapping (in general, not related to the proposal in PR https://github.com/zazuko/rdf-mapping-dsl/pull/126 ... more of a "notes-to-self"):

The mapping might be overspecified and not respresentative of the resulting data graph (eg. using an xpath expression that doesn't match anything)
A mapping block declaring multiple types would result in a shape targeting multiple classes
One graph resource can be populated from multiple mapping blocks. In this case only the sum of the constraints from the resulting multiple NodeShapes would describe the resource (and the derived NodeShapes could not be sh:closed individually)
Mapping blocks are aligned to input blocks (eg. a table). One input block can have multiple mapping blocks
In the mapping block, we don't have an alias for the property, so the property name would have to be used verbatim. This could turn out to become an issue if the generated shapes are extended with statements from a separate document and the schema changes

(Unrelated to this feature request, but related to the last point of the above list) Decoupling the mapping from the schema by means of pointing from the mapping to shape elements, rather than schema elements could be an option to facilitate handling schema changes (shape-first, shape-as-contract).

My plan is to make xrm more hackable, in order to unlock possibilites for toolchain improvements outside of the xrm editor itself. Like #127 and #128

mchlrch commented 1 year ago

For one-time scaffolding, introspecting the shapes from the output graph of the pipeline might be an alternative.

Here's a query to illustrate this, based on the construct query that SPEX is running in "introspection" mode. I used this in a customer project.

Note: The query has dependencies on spif: functions which GraphDB has built-in. They need to be replaced for running the query on other stores.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mobi: <https://schema.mobicorp.ch/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX schema: <http://schema.org/>
PREFIX spif: <http://spinrdf.org/spif#>
CONSTRUCT {
    ?nodeShape a sh:NodeShape .
    ?nodeShape sh:targetClass ?cls .
    ?nodeShape sh:property ?propertyShape .
    ?propertyShape a sh:PropertyShape .
    ?propertyShape sh:path ?property .
    ?propertyShape sh:class ?linktype .
    ?propertyShape sh:datatype ?datatype .
} WHERE {
    VALUES ?cls {
        #            mobi:Table
        #            mobi:Column
        mobi:Mitarbeiter
        mobi:Organisationseinheit
    }
    ?subject a ?cls .
    ?subject ?property ?object .
    OPTIONAL {
        ?object a ?linktype .
    }    
    MINUS {
        # --- blacklist ---
        VALUES ?cls {
            rdf:Property
            owl:TransitiveProperty
            owl:SymmetricProperty
            rdf:List
            rdfs:Class
            rdfs:Datatype
            rdfs:ContainerMembershipProperty
            # -------------
            mobi:ArchitektursichtElement
            mobi:OrganisationsElement
            mobi:ProzessElement
            mobi:FunktionsElement
            mobi:IntegrationsElement
            mobi:InformationsElement
            # -------------
            mobi:Informationsobjekt
            mobi:Informationsobjektbeziehung
            mobi:Informationsattribut
            mobi:Rollenbesetzung
            # -------------
            mobi:edc\/UiView
            mobi:edc\/Link
            sh:PropertyShape
            skos:ConceptScheme
            skos:Concept
        } 
        ?subject a ?cls .
    } 
    BIND(DATATYPE(?object) AS ?datatype)
    BIND(spif:buildURI("<urn:NodeShape:{?1}>", spif:encodeURL(str(?cls))) AS ?nodeShape)
    BIND(spif:buildURI("<urn:PropertyShape:{?1}/{?2}>", spif:encodeURL(str(?cls)), spif:encodeURL(str(?property))) AS ?propertyShape)
}

tpluscode commented 1 year ago

Shapes derived from the mapping don't necessarily describe the output graph of the pipeline, often there are post-processing steps after the mapping.

Yes, I realised that too while thinking about my proposal. In museumplus it is just like that. The XRM is only temporary representation and has nothing in common with the final representation.

Maybe I did not mention that precisely, but my idea was that shapes defined in XRM could also be unrelated to the mapping itself.

-node-shape PersonNodeShape from PersonMapping {
+node-shape PersonNodeShape {
}

That way one could take advantage of a simpler syntax although that would be slightly incomplete without nice support for vocabularies (re #14).

My plan is to make xrm more hackable

I cannot really comment on that but I'm intrigued about how hackability helps. Let's discuss that

mchlrch commented 7 months ago

Paper: RML2SHACL: RDF Generation Is Shaping Up https://lirias.kuleuven.be/retrieve/641696

CC @BenjaminHofstetter

zazuko / xrm

Derive shapes from maps #125