zazuko / rdf-validate-shacl

Validate RDF data purely in JavaScript. An implementation of the W3C SHACL specification on top of the RDFJS stack.
MIT License
98 stars 13 forks source link

Regular expressions with Unicode #44

Open kad-beekw opened 3 years ago

kad-beekw commented 3 years ago

Thanks for maintaining this great library!

Observation

I'm unable to properly validate place names using sh:pattern. Place names may include spaces, single quotes, hyphens, and some non-ASCII Unicode characters. Examples of place names that should succeed are 's-Gravenhage, The Hague, and Köln.

If I understand the somewhat cryptic XSD standard (link), then this should be expressible in the following way:

prefix sh: <http://www.w3.org/ns/shacl#>
[ sh:property
    [ sh:pattern "\\p{S}+";
      sh:path <label> ];
  sh:targetClass <C> ].

But the following data does not validate:

[ a <C>;
  <label> "Köln" ].

Since many natural languages include characters that do not occur in simple ASCII ranges like [A-Za-z], and because natural language information is very common in RDF data, support for validating Unicode strings in sh:pattern is useful in many cases.

Expected

The ability to use category escapes in sh:pattern, specifically for natural language content for which simple ranges are difficult/impossible to express.

tpluscode commented 3 years ago

The pattern is but a simple escaped regex. You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

This totally works:

[ sh:pattern "\\S+" ]
kad-beekw commented 3 years ago

@tpluscode Thanks for your response!

You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

I see the following trail when I look through the standard:

  1. The SHACL standard refers to SPARQL 1.1 for the regular expression functionality: https://www.w3.org/TR/shacl/#PatternConstraintComponent
  2. The SPARQL 1.1 standard refers to XPath 3.1: https://www.w3.org/TR/sparql11-query/#func-regex
  3. The XPath 3.1 standard refers to XSD 1.0: https://www.w3.org/TR/xpath-functions/#regex-syntax
  4. But I think that the XSD 1.1 standard supersedes version 1.0: https://www.w3.org/TR/xmlschema11-2/#cces

The main discussion I think is whether XSD 1.0 or XSD 1.1 should be used.

To be honest, I like your regex notation better, since it is a bit simpler :-). However, I can imagine that there is benefit from following the specification. There may be cases in which a regular expression stored in SHACL can be matched and reused in a SPARQL query. (I'm not sure whether this is a good use case, but what I'm getting at is that when the same regex notation is used across SHACL, SPARQL and XSD this may facilitate cross over use cases.)

tpluscode commented 3 years ago

I must admit I am a little confused myself, not having dug deep before.

You seem correct about how you followed you nose from SHACL to XSD specs. Section 7.1 of XPath seems to suggest that XSD 1.1 should be used, does it?

That said, the examples in SHACL spec to use the simple escaping (it's pretty much just the backslash). And FWIW the section for sh:pattern says

The values of sh:pattern in a shape are valid pattern arguments for the SPARQL REGEX function.

This is definitely valid SPARQL :)

filter ( regex( ?name, "^\\S+" ) )
kad-beekw commented 3 years ago

@tpluscode Thanks, XSD 1.1 indeed seems to be the intended standard for regex in SHACL (and SPARQL). I do not have enough knowledge of XSD to determine whether \S is also valid. When I look at the XSD 1.1 standard I can only find the charProp grammar rule using within \p{...} or \P{....} notation:

[85] | catEsc | ::= | '\p{' charProp '}'
[86] | complEsc | ::= | '\P{' charProp '}'
[87] | charProp | ::= | IsCategory \| IsBlock

Maybe \p{S} is commonly written as \S in SPARQL? If so, this may be a de facto extension of the XSD 1.1 syntax?

Whatever the case may be, some regex strings that seem to be valid in XSD 1.1 do not seem to be supported by this SHACL library. Maybe this is not so bad: the XSD 1.1 standard is sufficiently unreadable to prevent large groups of users from picking up the regex grammar described in it. Maybe the de facto way of writing regex is more popular.