Observation

I'm unable to properly validate place names using sh:pattern. Place names may include spaces, single quotes, hyphens, and some non-ASCII Unicode characters. Examples of place names that should succeed are 's-Gravenhage, The Hague, and Köln.

If I understand the somewhat cryptic XSD standard (link), then this should be expressible in the following way:

prefix sh: <http://www.w3.org/ns/shacl#>
[ sh:property
    [ sh:pattern "\\p{S}+";
      sh:path <label> ];
  sh:targetClass <C> ].

But the following data does not validate:

[ a <C>;
  <label> "Köln" ].

Since many natural languages include characters that do not occur in simple ASCII ranges like [A-Za-z], and because natural language information is very common in RDF data, support for validating Unicode strings in sh:pattern is useful in many cases.

Expected

The ability to use category escapes in sh:pattern, specifically for natural language content for which simple ranges are difficult/impossible to express.

tpluscode commented 3 years ago

The pattern is but a simple escaped regex. You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

This totally works:

[ sh:pattern "\\S+" ]

kad-beekw commented 3 years ago

@tpluscode Thanks for your response!

You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

I see the following trail when I look through the standard:

The SHACL standard refers to SPARQL 1.1 for the regular expression functionality: https://www.w3.org/TR/shacl/#PatternConstraintComponent
The SPARQL 1.1 standard refers to XPath 3.1: https://www.w3.org/TR/sparql11-query/#func-regex
The XPath 3.1 standard refers to XSD 1.0: https://www.w3.org/TR/xpath-functions/#regex-syntax
But I think that the XSD 1.1 standard supersedes version 1.0: https://www.w3.org/TR/xmlschema11-2/#cces

The main discussion I think is whether XSD 1.0 or XSD 1.1 should be used.

To be honest, I like your regex notation better, since it is a bit simpler :-). However, I can imagine that there is benefit from following the specification. There may be cases in which a regular expression stored in SHACL can be matched and reused in a SPARQL query. (I'm not sure whether this is a good use case, but what I'm getting at is that when the same regex notation is used across SHACL, SPARQL and XSD this may facilitate cross over use cases.)

tpluscode commented 3 years ago

I must admit I am a little confused myself, not having dug deep before.

You seem correct about how you followed you nose from SHACL to XSD specs. Section 7.1 of XPath seems to suggest that XSD 1.1 should be used, does it?

That said, the examples in SHACL spec to use the simple escaping (it's pretty much just the backslash). And FWIW the section for sh:pattern says

The values of sh:pattern in a shape are valid pattern arguments for the SPARQL REGEX function.

This is definitely valid SPARQL :)

filter ( regex( ?name, "^\\S+" ) )

kad-beekw commented 3 years ago

@tpluscode Thanks, XSD 1.1 indeed seems to be the intended standard for regex in SHACL (and SPARQL). I do not have enough knowledge of XSD to determine whether \S is also valid. When I look at the XSD 1.1 standard I can only find the charProp grammar rule using within \p{...} or \P{....} notation:

[85] | catEsc | ::= | '\p{' charProp '}'
[86] | complEsc | ::= | '\P{' charProp '}'
[87] | charProp | ::= | IsCategory \| IsBlock

Maybe \p{S} is commonly written as \S in SPARQL? If so, this may be a de facto extension of the XSD 1.1 syntax?

Whatever the case may be, some regex strings that seem to be valid in XSD 1.1 do not seem to be supported by this SHACL library. Maybe this is not so bad: the XSD 1.1 standard is sufficiently unreadable to prevent large groups of users from picking up the regex grammar described in it. Maybe the de facto way of writing regex is more popular.

zazuko / rdf-validate-shacl

Regular expressions with Unicode #44

Observation

Expected