x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
3 stars 0 forks source link

Ingest RefSeq summary #73

Closed AlanSimmons closed 7 months ago

AlanSimmons commented 1 year ago

Request

Add to the UBKG the summary for each gene in HGNC shown in the mockup below.

Summary information is maintained by RefSeq.

RefSeq information is available via FTP download of source as described here. The NCBI's eUtils REST API also provides summary information. Sample link: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=604,1&retmode=json

image
AlanSimmons commented 1 year ago

Definitions and separate process

This will not be part of the generation framework script driven by build_csv.py.

The RefSeq data will be added as a set of Definition nodes joined to HGNC IDs. This is a straightforward addition to the DEF.csv and DEFrel.csv files. The bulk of the work will be connecting to the NCBI eUtils API to obtain information on a large number of genes.

AlanSimmons commented 1 year ago

Results

It is now possible to ingest RefSeq summary information for genes.

Image

AlanSimmons commented 1 year ago

Script code complete. Dev UBKG instance ready for upload to Globus. There is currently a problem with Globus.

AlanSimmons commented 1 year ago

Globus problem resolved.

Dependency on https://github.com/x-atlas-consortia/hs-ontology-api/issues/14