wikipathways / SARS-CoV-2-WikiPathways

Temporary repository of RDF of selected pathways from WikiPathways, supporting the Wikidata bot.
https://wikipathways.github.io/SARS-CoV-2-WikiPathways/
Creative Commons Zero v1.0 Universal
2 stars 3 forks source link

Lack of interactions in RDF files? [Export to BEL via PathMe] #2

Open ddomingof opened 4 years ago

ddomingof commented 4 years ago

Dear @egonw,

As discussed via twitter we dont find almost any relationship in exported files we do via PathMe. Our parser runs the following SPARQL query (see below) to get the relationships across RDF "nodes" but it is coming empty for almost all files.

Are these files generated with a different RDF exporter than the normal ones in WikiPathways? We probably have to change something in this query because it is often the case that this happens in some normal pathways where "cartoons" are drawn (e.g., https://www.wikipathways.org/index.php/Pathway:WP107)

Example that comes empty from the COVID19 pathways: https://www.wikipathways.org/index.php/Pathway:WP4862

"""
SELECT DISTINCT
    (?source_entry AS ?source)
    (?dc_source AS ?source)
    (?target_entry AS ?target)
    (?dc_target AS ?target)
    ?uri_id
    (STRAFTER(STR(?uri_id), "/Interaction/") AS ?identifier)
    (STRAFTER(STR(?uri_type), str(wp:)) AS ?interaction_types)
    (STRAFTER(STR(?ncbigene_source), str(ncbigene:)) AS ?source )
    (STRAFTER(STR(?ncbigene_target), str(ncbigene:)) AS ?target )
WHERE {
   ?pathway a wp:Pathway .
   ?uri_id dcterms:isPartOf ?pathway .
   ?uri_id a wp:DirectedInteraction .
   ?uri_id rdf:type ?uri_type .
   ?uri_id wp:source ?source_entry .
   ?uri_id wp:target ?target_entry .
   optional {?source_entry dcterms:identifier ?dc_source .}
   optional {?target_entry dcterms:identifier ?dc_target .}
   optional {?source_entry wp:bdbEntrezGene ?ncbigene_source .}
   optional {?target_entry wp:bdbEntrezGene ?ncbigene_target .}
}
"""
ddomingof commented 4 years ago

Probably we have to change the SPARQL query @cthoyt

cthoyt commented 4 years ago

this query looks like it has all sorts of smells in it...

cthoyt commented 4 years ago

Here's a new query for getting all directed interactions. Caveats: only extracting entrez identifiers/labels when its a gene. not differentiating between gene and gene products. Takes about 20-25 seconds to run.

*note - not sure how to infer polarity. Is the "increases" or "decreases" stored in the interaction somewhere? I updated it with the group concat on the interaction type. I guess this is the way to infer.

SELECT DISTINCT
    ?pathwayIdentifier
    ?pathwayTitle
    ?interaction
    (GROUP_CONCAT(DISTINCT ?interactionType; separator=", ") AS ?interactionTypes)
    ?sourceNamespace
    ?sourceIdentifier
    ?sourceEntrezIdentifier
    ?sourceEntrezLabel
    ?targetNamespace
    ?targetIdentifier
    ?targetEntrezIdentifier
    ?targetEntrezLabel
WHERE {
   ?pathway a wp:Pathway .
   OPTIONAL { ?pathway dcterms:identifier ?pathwayIdentifier . }
   OPTIONAL { ?pathway dc:title ?pathwayTitle . }
   ?interaction dcterms:isPartOf ?pathway . 
   ?interaction a wp:DirectedInteraction .
   ?interaction a ?interactionType .
   ?interaction wp:source ?source .
   ?interaction wp:target ?target .
   OPTIONAL { 
        ?source wp:bdbEntrezGene ?sourceEntrez . 
        ?sourceEntrez dcterms:identifier ?sourceEntrezIdentifier .    
        ?sourceEntrez rdfs:label ?sourceEntrezLabel .
    }
   OPTIONAL { 
        ?target wp:bdbEntrezGene ?targetEntrez . 
        ?targetEntrez dcterms:identifier ?targetEntrezIdentifier .  
        ?targetEntrez rdfs:label ?targetEntrezLabel .  
   }
   OPTIONAL { ?source dc:source ?sourceNamespace . }
   OPTIONAL { ?target dc:source ?targetNamespace . }
   OPTIONAL { ?source dcterms:identifier ?sourceIdentifier . }
   OPTIONAL { ?source dcterms:identifier ?targetIdentifier . }
}

Example python code for getting this into a pandas dataframe (@egonw i was going to add this to https://www.wikipathways.org/index.php/Help:WikiPathways_Sparql_queries#Code_examples but we never got my WikiPathways account recovered...)

from io import StringIO

import pandas as pd
import requests

URL = 'http://sparql.wikipathways.org/sparql'
res = requests.get(URL, params=dict(query=..., format='text/csv'))
df = pd.read_csv(StringIO(res.text))
Chris-Evelo commented 4 years ago

There are a lot of different issues in this thread. Some known some not. Let me start with a simple one:

Example that comes empty from the COVID19 pathways: https://www.wikipathways.org/index.php/Pathway:WP4862

I checked the actual pathway. I suppose that is still under development. While it has many genes it doesn't have any interactions. So it makes sense that the RDF doesn't have any either.

(There is one exception, two genes are in a complex, and you could argue that the RDF should have that relationship. That might be something to look into in the future)

Chris-Evelo commented 4 years ago

Pathway: https://www.wikipathways.org/index.php/Pathway:WP4862 has no interactions. So it makes sense that that one returns none. There is caveat since two genes form a complex, and you could argue that that is an interaction. But complex participation is not currently interpreted in the semantic part of the RDF AFAIK.

You are right about pathway: https://www.wikipathways.org/index.php/Pathway:WP107 too. Interactions that we currently capture are currently only between gene products of any kind and metabolites. That pathway contains only one of these. Many of the other "interactions/arrows" are indeed more cartoon-like representations of processes described with text labels or not even really connected to anything. A problematic aspect here is the groups. This is something we haven't solved yet. If you draw an interaction between two groups we do not really know what that means. Is that a reaction between "any of" and "any of" or "all"? And is that the same on both sides of the arrow? For now, and as far as I know, there is no semantic interpretation of an interaction that involves a group (or a complex for that matter, but it might be easier to solve that).

ddomingof commented 4 years ago

Thanks for the explanations. We just wanted to make sure you hadn't changed the RDG schema. Then we will wait until the relationships are curated. Will certainly help on that

Chris-Evelo commented 4 years ago

There are other more complex issues too. Typically we allow curators to use an ID from their own favorite database for gene products, metabolites, and interactions. Since WIkiPathways and PathVisio have BridgeDB build in that is not a problem. However, for the RDF it is. That is why we "normalize" the semantic part of the RDF and explicitly add UniProt, ENSEMBL and a few more. So we basically use BridgeDb during the creation.

Now for the COVID genes there are two issues still. 1) Until yesterday we did not have a BridgeDb database for the coronaviruses. So there wasn't one used to create this first RDF. 2) These pathways are special since they contain gene products from two different species. We have had that problem in the past, for malaria. But we have never really completely solved it. We will have to check whether the RDF generation process will really use two different BridgeDb databases. At this stage, it probably doesn't.