x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
3 stars 0 forks source link

Enhancements to UniProtKB ingestion #63

Closed AlanSimmons closed 7 months ago

AlanSimmons commented 1 year ago

Request:

Add the "Function" field to the data downloaded from UniProtKB.

image
AlanSimmons commented 1 year ago

Filtering to SwissProt curation only

Currently, UBKG only contains HGNC->UNIPROTKB mappings for proteins for which curation has been reviewed (i.e., SwissProt). TrEMBL curations are not ingested.

AlanSimmons commented 1 year ago

@computationdoc

Challenge: where to put function

Statement of problem

The current UNIPROTKB ETL maps as follows:

UniProt field UBKG entity:property
Entry Code node with CodeID=UNIPROTKB:Entry
Entry Name Term node with name=Entry Name and relationship PT
Protein Names Definition node with DEF=Protein Names, linked to the Code's associated Concept node.

The function is actually a type of definition; however, the Definition node is already being used for the Protein Names field.

Options:

1. Make the function a synonym

The UNIPROTKB ETL currently does not assign synonyms--i.e., Term nodes with relationship SYN to the code, based on the values in the node_synonyms field in the node_metadata file. The script could export the function field to the node_synonyms field.

2. Build a term of different semantic type--e.g., FUNCTION

3. Assert new relationships between UNIPROTKB nodes and "function" nodes

4. Make the function another Definition

Recommendation

Option 1. Currently, only UNIPROTKB codes have both definitions and functions in UBKG.

AlanSimmons commented 1 year ago

Result of assigning function to synonym

Example. The function is the term of type SY

Image

AlanSimmons commented 1 year ago

New decision: assign function to definition; short name to synonym

The image is an annotated screen capture of a UniProtKB detail page, describing the changes to make in UBKG.

Image

AlanSimmons commented 1 year ago

Synonym arrangement

UniProtKB stores names and synonyms for proteins in the Protein Names field of the downloaded file. The field is delimited with parentheses--e.g., Approved Name (1st synonym)(2nd synonym) etc.

The script:

In addition, the script uses the value of the Entry Name field from the download as a synonym.

AlanSimmons commented 1 year ago

Result

Example showing the approved name, all synonyms, and the definition.

Image

AlanSimmons commented 1 year ago

Dependency on https://github.com/x-atlas-consortia/hs-ontology-api/issues/14