x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2 stars 0 forks source link

Use script to generate edge and nodes files for HRA via hra-ubkg-exporter instead of downloading from repo #120

Closed AlanSimmons closed 4 months ago

AlanSimmons commented 6 months ago

FYI @bherr2

Statement of issue

A subset of the HRA ontology is exported into UBKG edges/nodes format via code in the hra-ubkg-exporter repo. The current version of the edges and nodes files can be found in the repo in the HRA folder.

Although the UBKG ETL for HRA currently imports edges and nodes files that are downloaded from the repo, there is a risk that these files may not be current. It may be better to execute the hra-ubkg-exporter scripts directly to generate the latest UBKG files.

Proposed solution

Install Node on the development machine. Modify the ETL so that it calls the hra-ubkg-exporter script to generate edges and nodes files.

Potential challenge

The ETL is a Python script, so it would have to execute the Node script somehow. If this is not feasible, I can run the script manually and store the output files locally.

bherr2 commented 6 months ago

Not sure your environment, but it's pretty easy to get node in a python environment by using the python package: nodeenv. Once node 18+ is installed you can run the exporter as a commandline program like this:

npx github:x-atlas-consortia/hra-ubkg-exporter --version v2.3.0 <output dir>

The program takes about 5 minutes (at some point I'll optimize it further) to run.

AlanSimmons commented 4 months ago

After further review, we think that it makes more sense to download extracted edge/node files directly from GitHub, as we can be sure that these files are supported.

bherr2 commented 4 months ago

Yep, makes sense! It only changes once every six months, so then we'll compile and notify you. What is the best way to notify you? PR, GitHub Issue, or you just check when you rebuild anyway (if you do that frequently)?

AlanSimmons commented 4 months ago

I think that the best way is to notify me directly after you update the repo. That way, I only ingest the version that you're happy with. I will work with a downloaded copy of the files between updates.

bherr2 commented 4 months ago

Will do!