wikipathways / GPML2RDF

GPML2RDF converter
Apache License 2.0
4 stars 2 forks source link

Test for spaces #85

Closed DeniseSl22 closed 3 years ago

DeniseSl22 commented 3 years ago

Testing the WP.ttl in Blazegraph gave the following error:

IRI included an unencoded space: '32' [line 70194]

Later same message for: [line 112936] + [line 113894] (same mistake in ID); [line 178154] +[line 178738] (HMDB IDs combined with letters) + [line 290761] (EC-code as ID with space) (and still more are showing up, but you get the idea right @egonw ). Lines 70186 -70194:

<http://identifiers.org/ec-code/tryptophan_hydroxylase>
        a                   wp:DataNode , wp:Protein ;
        rdfs:label          "TPH"^^xsd:string ;
        dc:identifier       <http://identifiers.org/ec-code/tryptophan_hydroxylase> ;
        dc:source           "Enzyme Nomenclature"^^xsd:string ;
        dcterms:identifier  "tryptophan hydroxylase"^^xsd:string ;
        dcterms:isPartOf    <http://identifiers.org/wikipathways/WP4156_r111627> , <http://rdf.wikipathways.org/Pathway/WP4156_r111627/WP/Interaction/id10629ddb> ;
        wp:isAbout          <http://rdf.wikipathways.org/Pathway/WP4156_r111627/DataNode/eb2b3> ;
        foaf:page           <http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=tryptophan hydroxylase> .

Lines 112936-112939:

<http://identifiers.org/pubmed/; 23567973>
        a                 wp:PublicationReference ;
        dcterms:isPartOf  <http://identifiers.org/wikipathways/WP4847_r111515> ;
        foaf:page         <http://www.ncbi.nlm.nih.gov/pubmed/; 23567973> .

Lines 178146 - 178156 (not including next example, this is similar).

<http://identifiers.org/hmdb/HMDB00linic_acid>
        a                              wp:DataNode , wp:Metabolite ;
        rdfs:label                     "Betulinic acid"^^xsd:string ;
        dc:identifier                  <http://identifiers.org/hmdb/HMDB00linic_acid> ;
        dc:source                      "HMDB"^^xsd:string ;
        dcterms:bibliographicCitation  <http://identifiers.org/pubmed/26750873> ;
        dcterms:identifier             "HMDB00linic acid"^^xsd:string ;
        dcterms:isPartOf               <http://identifiers.org/wikipathways/WP4874_r110018> , <http://rdf.wikipathways.org/Pathway/WP4874_r110018/WP/Interaction/id99b4adfe> ;
        wp:bdbHmdb                     <http://identifiers.org/hmdb/Betulinic acid> ;
        wp:isAbout                     <http://rdf.wikipathways.org/Pathway/WP4874_r110018/DataNode/f76a7> ;
        foaf:page                      <http://www.hmdb.ca/metabolites/Betulinic acid> .

Lines 290752-290761:

<http://identifiers.org/brenda/EC_2.4.1.17>
        a                              wp:Protein , wp:DataNode ;
        rdfs:label                     "UDP-glucuronyltransferase"^^xsd:string ;
        dc:identifier                  <http://identifiers.org/brenda/EC_2.4.1.17> ;
        dc:source                      "BRENDA"^^xsd:string ;
        dcterms:bibliographicCitation  <http://identifiers.org/pubmed/22254058> ;
        dcterms:identifier             "EC 2.4.1.17"^^xsd:string ;
        dcterms:isPartOf               <http://identifiers.org/wikipathways/WP4238_r107164> ;
        wp:isAbout                     <http://rdf.wikipathways.org/Pathway/WP4238_r107164/DataNode/fb0cb> , <http://rdf.wikipathways.org/Pathway/WP4238_r107164/DataNode/acf65> ;
        foaf:page                      <http://www.brenda-enzymes.org/php/result_flat.php4?ecno=EC 2.4.1.17> .

Also check:

I've fixed these in the .ttl file first, then Blazegraph could upload the data :). I'll also fix it on WP in the PWs themselves, but this is a lot of manual work, which I don't want to do every time I need to test something in Blazegraph.

I'm not sure about the fix: update the GPML2RDF, create a Unit test, or warning messages in PV (or all of these)....

DeniseSl22 commented 3 years ago

After some more debugging, I believe the main problems arise when identifiers with spaces are added (I had this problem once before, with a "\n" in a reference). So quick fix strategy for now would be:

  1. Do not convert a line in GPML for IDs with a space or line break in them in the WP RDF. (in GPML2RDF).
  2. Create a Unit test for the GPML RDF, so we can find mistakes and fix them and the correct info will end up in WP RDF. --> WP Curator
  3. Add a warning in PV when people add spaces, line breaks etc. -->PV code
DeniseSl22 commented 3 years ago

Okay, checked and improved all PWs now, so RDF_All should load into Blazegraph directly (will test this next).

egonw commented 3 years ago
  1. Do not convert a line in GPML for IDs with a space or line break in them in the WP RDF. (in GPML2RDF).

Easier said then done. Often the newlines are some unicode and not easy to detect, actually. Like &#10; or &#0Al.

egonw commented 3 years ago

3. Add a warning in PV when people add spaces, line breaks etc.

Yes, would be nice of PV applied the format checks.

egonw commented 3 years ago

Another weird thing, there is actually a check for PubMed identifiers that are not numbers :( No idea why these do not show up :/

egonw commented 3 years ago

Okay, I pushed some patched to the GPMLRDF repo and the curation repo.

egonw commented 3 years ago

Okay, the latest wp.ttl RDF I can load in Blazegraph. I created unit tests for the situation that cannot easily be caught in the RDF generation at this moment.

egonw commented 3 years ago

@DeniseSl22, can you let me know if this issue can be closed or not?

DeniseSl22 commented 3 years ago

Okay, just checked the All_RDF from today, and loaded without issues in Blazegraph :D . Thanks for fixing this!