yago-naga / yago4

Yago 4 - the next version of Yago
https://yago-knowledge.org/downloads/yago-4
GNU General Public License v3.0
90 stars 15 forks source link

Invalid XMLSchema date #3

Closed elad-shaked closed 4 years ago

elad-shaked commented 4 years ago

In yago-wd-facts.nt there are statements such that the object is: "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>

e.g grep output

yago-wd-facts.nt:3511416:<http://yago-knowledge.org/resource/Dong_Xian> <http://schema.org/deathDate>   "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
yago-wd-facts.nt:4012715:<http://yago-knowledge.org/resource/Tryphon>   <http://schema.org/deathDate>   "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
yago-wd-facts.nt:4023877:<http://yago-knowledge.org/resource/Emperor_Ai_of_Han> <http://schema.org/deathDate>   "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
yago-wd-facts.nt:4088637:<http://yago-knowledge.org/resource/Granius_Licinianus>    <http://schema.org/deathDate>   "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>    .
yago-wd-facts.nt:18885804:<http://yago-knowledge.org/resource/Lucius_Arruntius_Camillus_Scribonianus>   <http://schema.org/birthDate>   "0000"^^<http://www.w3.org/2001/XMLSchema#gYear>    .

According to XMLSchema "0000" is not a valid gYear: https://www.w3.org/TR/xmlschema-2/#noYearZero

Tpt commented 4 years ago

Thank you for this bug report.

This problem roots in the XML schema specifications. Indeed XML schema 1.0 states that year 0 is not valid but XML schema 1.1 changed its mind and states that "0000"^^xsd:gYear should be used for year -1 BC c.f. xsd:dateTime "value space" note in XML schema 1.1 part 2 specification. It is also what states the recent XPath specifications that are used for the SPARQL built in functions.

We have chosen here to follow the recent XML schema and XPath specifications. It also has the advantage to make computations easier (no need of special case of BC dates). So, I don't think we will change this. Sorry if it makes some data usage harder but we had to pick a side. The third option would have been to not introduce BC dates in YAGO, but I'm not sure it would have been better.

Side note: following XML schema specifications, all dates are using the Gregorian calendar.

elad-shaked commented 4 years ago

I understand. I encountered this when I was using RDF4J, since it uses org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl::newXMLGregorianCalendar. I recently contributed some PRs to RDF4J and am willing to to so again. How do you suggest I approach this?

Tpt commented 4 years ago

It would be amazing to update rdf4j indeed. Thank you! Maybe by having a XMLSchema 1.1 vs 1.0 option.

After some investigation, Xerces seems to have some support for this. I am not sure how to use it properly (I am not very familiar with Xerces). Jena seems to have simply copy/pasted these classes. But it's not a very nice way to go.

elad-shaked commented 4 years ago

I opened this issue over at RDF4J's repo. Given the response I dont think this will be resolved (at least I was convinced)