Closed DeniseSl22 closed 3 years ago
After some more debugging, I believe the main problems arise when identifiers with spaces are added (I had this problem once before, with a "\n" in a reference). So quick fix strategy for now would be:
Okay, checked and improved all PWs now, so RDF_All should load into Blazegraph directly (will test this next).
- Do not convert a line in GPML for IDs with a space or line break in them in the WP RDF. (in GPML2RDF).
Easier said then done. Often the newlines are some unicode and not easy to detect, actually. Like
or �Al
.
3. Add a warning in PV when people add spaces, line breaks etc.
Yes, would be nice of PV applied the format checks.
Another weird thing, there is actually a check for PubMed identifiers that are not numbers :( No idea why these do not show up :/
Okay, I pushed some patched to the GPMLRDF repo and the curation repo.
Okay, the latest wp.ttl RDF I can load in Blazegraph. I created unit tests for the situation that cannot easily be caught in the RDF generation at this moment.
@DeniseSl22, can you let me know if this issue can be closed or not?
Okay, just checked the All_RDF from today, and loaded without issues in Blazegraph :D . Thanks for fixing this!
Testing the WP.ttl in Blazegraph gave the following error:
Later same message for: [line 112936] + [line 113894] (same mistake in ID); [line 178154] +[line 178738] (HMDB IDs combined with letters) + [line 290761] (EC-code as ID with space) (and still more are showing up, but you get the idea right @egonw ). Lines 70186 -70194:
Lines 112936-112939:
Lines 178146 - 178156 (not including next example, this is similar).
Lines 290752-290761:
Also check:
I've fixed these in the .ttl file first, then Blazegraph could upload the data :). I'll also fix it on WP in the PWs themselves, but this is a lot of manual work, which I don't want to do every time I need to test something in Blazegraph.
I'm not sure about the fix: update the GPML2RDF, create a Unit test, or warning messages in PV (or all of these)....