Closed AlanSimmons closed 7 months ago
Script: umls-init.py
SUIs removed from:
During the work to replace SUI with name, I discovered that HGNC codes in CODE-SUIs.csv had terms with term type of NA, which corresponds to "name alias". The string "NA" is interpreted by Pandas as the np.na, which resulted in the value of :TYPE in the CODE-SUIs.csv column changing to an empty string. The empty strings caused import failures later.
I am not sure why this issue had not been encountered previously. It is likely that a dropna statement removed rows with empty values of :TYPE at some point, and that the new conversion logic to remove SUIs exposed the problem.
In any case, the script now addresses the issue by appending "UBKG" to "NA". This indicates both that the :TYPE value is not np.na and that there is a synthetic term type that was not in the original UMLS.
Code for HGNC:3686, showing all terms, after imports of UBERON and MONDO.
Note that the selected term no longer has a SUI property.
Before a neo4j instance with this architecture can be deployed to production, I will need to regression test the endpoints in the ubkg-api and hs-ontology-api.
Statement of issue
The UMLS uses SUIs as IDs for term strings. To replicate the SUI for terms from non-UMLS SABs, the generation script simply uses the base64-encoded version of the string as the SUI.
In the UBKG, terms must be unique. The uniqueness requirement means that a separate SUI is not necessary in general, and a SUI that is just a base64 encoding of the term is not necessary in particular. In addition, large terms strings result in large SUIs, which adds to the size of the neo4j database.
To do