x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2 stars 0 forks source link

GTEX ingest: duplicate node files #106

Closed AlanSimmons closed 3 months ago

AlanSimmons commented 9 months ago

FYI @computationdoc

Jonathan noticed duplicate GTEX nodes in the CODES.csv file. For some reason, the generation script is not dropping duplicates when appending codes. This may be a general problem, although we've only seen it with GTEX.

AlanSimmons commented 9 months ago

Example: GTEXEXP:ENSG00000264614-1-Cervix-Endocervix

AlanSimmons commented 9 months ago

This appears to be the result of the script allowing multiple ingests of the same SAB. The script has been updated so that it deduplicates all CSV files before terminating.

AlanSimmons commented 9 months ago

Will close this after generating the next Data Distillery context.