x-atlas-consortia / ubkg-neo4j

A container implementation to serve the Unified Biomedical Knowledge Graph in Neo4j
MIT License
1 stars 0 forks source link

Add relationship indexes on SAB property to allow identification of all data from a set of assertions #14

Closed AlanSimmons closed 1 year ago

AlanSimmons commented 1 year ago

Statement of problem

The Data Distillery team needs the ability to identify data from a particular set of assertions--e.g., everything asserted from IDG.

Proposed solution

Every predicate in a SAB's edge file is translated to a relationship with a SAB property. We should be able to set a relationship index on the relationships after import, in the set_constraints.cypher.

We also need indexes on elements specific to Data Distillery:

e.g., CREATE INDEX FOR ()-[r]-() ON (r.SAB); CREATE INDEX FOR ()-[r]-() ON (r.evidence_class); CREATE INDEX FOR (n:Code) ON (n.value); CREATE INDEX FOR (n:Code) ON (n.lowerbound); CREATE INDEX FOR (n:Code) ON (n.upperbound); CREATE INDEX FOR (n:Code) ON (n.unit);

AlanSimmons commented 1 year ago

Correct Cypher statements:

CREATE INDEX FOR (n:Code) ON (n.value); CREATE INDEX FOR (n:Code) ON (n.lowerbound); CREATE INDEX FOR (n:Code) ON (n.upperbound); CREATE INDEX FOR (n:Code) ON (n.unit); CREATE INDEX FOR (n:Concept)-[r]-(m:Concept) ON (r.SAB); CREATE INDEX FOR (n:Concept)-[r]-(m:Concept) ON (r.evidence_class); CREATE INDEX FOR (c:Code)-[r]-(t:Term) ON (c.SAB);

AlanSimmons commented 1 year ago

Using a local Docker image (running the build_local.sh script and then run.sh with the -t local flag), I was able to instantiate a Docker with a neo4j instance with the desired indexes.

The performance of queries that use the relationship indexes varies with the SAB. I think that this is because SABs vary in the variance of the relationship labels: SABs with only a few relationship types result in faster queries.

Test query:

profile match (c1:Code)<-[:CODE]-(p1:Concept)-[r]->(p2:Concept)-[:CODE]->(c2:Code) WHERE r.SAB='X' RETURN c1.CodeID AS subject, type(r) AS predicate, c2.CodeID AS object LIMIT 1000

Some results: SAB execution time (ms)
LINCS 79
MS 324
KFPT 45373
ERCCRBP I got tired of waiting

Queries might be faster if we knew the range of SABs for the codes in a set of assertions and could include IN clauses.

AlanSimmons commented 1 year ago

This feature is only available in neo4j v4.3+.

AlanSimmons commented 1 year ago

Will be addressed in #29 . Closing.