zazuko / rdf-validate-shacl

Validate RDF data purely in JavaScript. An implementation of the W3C SHACL specification on top of the RDFJS stack.
MIT License
98 stars 13 forks source link

How to validate literals based on their datatype IRI? #46

Closed wouterbeek closed 3 years ago

wouterbeek commented 3 years ago

I do not understand how literals should be validated based on their datatype IRI. I make the following observations:

  1. For some literals specifying the datatype IRI with sh:datatype seems to suffice in order to also check their lexical form. An example of this is xsd:boolean, where lexical form "-false" is currently not accepted because the minus sign is not part of the syntax for Boolean lexical forms.

  2. For some literals specifying the datatype IRI with sh:datatype does not seem sufficient, since incorrect lexical forms are still accepted. An example of this is xsd:double for which "--1.1e0" is accepted, even though the double occurrence of the hyphen is not supported by the floating-point syntax.

  3. At the same time, it is also not clear how regular expressions could be manually specified in order to fix the absence of lexical form validation (see #44 for generic issues with the way in which regular expressions are currently supported). For example, specifying the regular expression sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" copied from the XSD standard alongside sh:datatype xsd:double still allows validates literals like "--1.1e1"^^xsd:double as ok, even though they violate both the datatype IRI and the regular expression specifications.

At the moment it is difficult for me to determine what is intended behavior and what is a bug. It would be great if SHACL could be used to validate literals, but I am not sure whether (1) such validation is indeed intended by the SHACL standard, and whether (2) it is technologically feasible to implement such validation with contemporary technology.

tpluscode commented 3 years ago

Could you provide the above cases complete with shapes and sample data?

Also, please check with SHACL playground to see what are the results there

wouterbeek commented 3 years ago

@tpluscode I have not done anything complicated yet. I think that even the most simple things like the XSD literals do not work. I can still share my files of course :-)

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
[ a <C>;
  <p> "-false"^^xsd:boolean; # This will not validate when `sh:datatype xsd:boolean` is used.
  <r> "--1.1e0"^^xsd:double ]. # This will validate when `sh:datatype xsd:double` is used.

And this is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <p> ],
    [ sh:datatype xsd:double;
      sh:path <r>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ]; # This does not do anything at all IIUC.
  sh:targetClass <C> ].
wouterbeek commented 3 years ago

I have added a couple more example. This is mostly a copy/paste from the XSD standard. I have replaced backward slashes with double backward slashes, since this seems to be required. Since I do not know the Regex grammar, I do know whether the Regexes are valid (the library does not give feedback when a Regex cannot be processed).

This is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <boolean>;
      sh:pattern "false|true|0|1" ],
    [ sh:datatype xsd:date;
      sh:path <date>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:dateTime;
      sh:path <dateTime>;
      sh:pattern """
-?([1-9][0-9]{3,}|0[0-9]{3})
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
T(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))
(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?""" ],
    [ sh:datatype xsd:decimal;
      sh:path <decimal>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ],
    [ sh:datatype xsd:double;
      sh:path <double>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)([Ee](\\+|-)?[0-9]+)? |(\\+|-)?INF|NaN" ],
    [ sh:datatype xsd:duration;
      sh:path <duration>;
      sh:pattern """
-?P( ( ( [0-9]+Y([0-9]+M)?([0-9]+D)?
       | ([0-9]+M)([0-9]+D)?
       | ([0-9]+D)
       )
       (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+(\\.[0-9]+)?S)
          )
       )?
    )
  | (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+(\\.[0-9]+)?S)
       )
    )
  )""" ],
    [ sh:datatype xsd:gMonth;
      sh:path <gMonth>;
      sh:pattern "--(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYear;
      sh:path <gYear>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYearMonth;
      sh:path <gYearMonth>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:string;
      sh:path <string>;
      sh:pattern "\\S" ],
    [ sh:datatype xsd:time;
      sh:path <time>;
      sh:pattern "(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ];
  sh:targetClass <C> ].

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
<i>
  a <C>;
  <boolean> false, "0"^^xsd:boolean;
  <date> "-1-01-01"^^xsd:date;
  <dateTime> "-1-01-01T00:00:00-00:00"^^xsd:dateTime;
  <decimal> -01.10, "-02.20"^^xsd:decimal;
  <double> -1.1e+0, "-2.2e+0"^^xsd:double;
  <duration> "-1-01-01T00:00:00-00:00"^^xsd:duration;
  <gMonth> "--01"^^xsd:gMonth;
  <gYear> "-1"^^xsd:gYear, "111111"^^xsd:gYear;
  <gYearMonth> "-1-01Z"^^xsd:gYear, "111111-01Z"^^xsd:gYear;
  <string> "😺", "😺"^^xsd:string;
  <time> "00:00:00-00:00"^^xsd:time.

Since Regex is a crude approach for validating lexical forms, it would be better if lexical forms could also be validated by specifying the datatype IRI (sh:datatype). If that is not feasible, then having proper Regex support would at least allow us to add sh:pattern triples based on the presence of sh:datatype triples.

tpluscode commented 3 years ago

After looking at your examples in the SHACL playground and the spec I have a few observations:

  1. Boolean acts wrong, where the library treats the truthiness of the literal. Thus 0 becomes false and pretty much anything else becomes true. We probably inherited that issue too
  2. You got those regex from W3C XML Schema? I think the whitespace is a problem in some. For example, the double expression has a space before the |(\\+|-)?INF|NaN patterns. Remove that space and it will work
  3. Otherwise you will need to add start/end of line symbols ^$. Without them you risk matching only portion of the literal.
  4. Strangely, decimal actually gets validated by the datatype constraint alone
  5. The regex created by the library probably needs a u flag to handle unicode correctly image

Now, while the spec does not mention checking the lexical correctness of literals, it could be added as an option to the library. What do you think @martinmaillard ?

martinmaillard commented 3 years ago

This library already uses rdf-validate-datatype to validate the lexical correctness of literals. So if something gets validated wrong, it's probably a bug.