x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
3 stars 0 forks source link

Replace SUI with term #94

Closed AlanSimmons closed 7 months ago

AlanSimmons commented 1 year ago

Statement of issue

The UMLS uses SUIs as IDs for term strings. To replicate the SUI for terms from non-UMLS SABs, the generation script simply uses the base64-encoded version of the string as the SUI.

In the UBKG, terms must be unique. The uniqueness requirement means that a separate SUI is not necessary in general, and a SUI that is just a base64 encoding of the term is not necessary in particular. In addition, large terms strings result in large SUIs, which adds to the size of the neo4j database.

To do

  1. Pre-processing UMLS CSVs: a. In SUIs.csv, remove the SUI:ID column. b. In CODE-SUIs.csv, replace the SUI in :END_ID with the actual term. c. In CUI-SUIs.csv, replace the SUI in :END_ID with the actual term.
  2. Ingestion: a. Write only new terms to SUIs.CSV. b. Write the term to the :END_ID column in CODE-SUIs.csv. c. Write the term to the :END_ID column in CUI-SUIs.csv.
AlanSimmons commented 1 year ago

UMLS CSV pre-processing

Script: umls-init.py

SUIs removed from:

  1. CODE-SUIs.csv
  2. CUI-SUIs.csv
  3. SUIs.csv
AlanSimmons commented 1 year ago

Bug: Term Type of NA in CODE-SUIs.csv

During the work to replace SUI with name, I discovered that HGNC codes in CODE-SUIs.csv had terms with term type of NA, which corresponds to "name alias". The string "NA" is interpreted by Pandas as the np.na, which resulted in the value of :TYPE in the CODE-SUIs.csv column changing to an empty string. The empty strings caused import failures later.

I am not sure why this issue had not been encountered previously. It is likely that a dropna statement removed rows with empty values of :TYPE at some point, and that the new conversion logic to remove SUIs exposed the problem.

In any case, the script now addresses the issue by appending "UBKG" to "NA". This indicates both that the :TYPE value is not np.na and that there is a synthetic term type that was not in the original UMLS.

AlanSimmons commented 1 year ago

Example

Code for HGNC:3686, showing all terms, after imports of UBERON and MONDO.

Note that the selected term no longer has a SUI property.

Image

AlanSimmons commented 1 year ago

Regression test of UBKG API

Before a neo4j instance with this architecture can be deployed to production, I will need to regression test the endpoints in the ubkg-api and hs-ontology-api.