wikipathways / GPML2RDF

GPML2RDF converter
Apache License 2.0
4 stars 2 forks source link

RDF has confusing datanodes for WP4657 #110

Closed marvinm2 closed 3 years ago

marvinm2 commented 3 years ago

Freddie asked me about this issue. WP4657 on WikiPathways has no histone genes but the RDF gives some: https://bit.ly/39a0jCu

Also the original ttl file does not have them. (attache WP4657.txt d)

Somehow these histone genes show up. What could be the cause?

marvinm2 commented 3 years ago

All the histone genes seem to be linked to a list of ncbi genes and ensembl genes: https://bit.ly/3AkinFM

But none of these appear in the WP

marvinm2 commented 3 years ago

Found the cause in the WP (not gpml) rdf file for this pathway (attached) WP4657.txt

The protein annotated with https://identifiers.org/uniprot/P62805 has all these names and IDs.

fehrhart commented 3 years ago

@marvinm2 @egonw @mkutmon We know now where it comes from - but still may want to solve this mapping issue. Its not one pathway, but a few hundred which have this extensive histone gene mappings. And, these histones make according to the WP RDF the MOST ABUNDANT genes in WikiPathways:

image

Maybe with an updated bridgeDb for gene/geneproducts? or if it is from the source, check with Uniprot if that is really intended?

egonw commented 3 years ago

Found the cause in the WP (not gpml) rdf file for this pathway (attached) WP4657.txt

This is something essential to realize: in the WikiPathways RDF world, WPRDF is not the "RDF of a pathway". That is the GPMLRDF. WPRDF is the full biological knowledge in WikiPathways.

egonw commented 3 years ago

"Debugged" it and the problem is the P62805 UniProt identifiers in the pathways:

image

The multiple gene mappings come originally from Ensembl/UniProt:

image

egonw commented 3 years ago

@fehrhart, there are 34 pathways with that UniProt identifier: https://bit.ly/39v8BFj

I have created a unit test for it.

egonw commented 3 years ago

Of these, 33 are Reactome pathways: https://bit.ly/3lD7GrK