x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2 stars 0 forks source link

UNIPROTKB ETL incorrectly parses synonyms that contain parentheses #118

Closed AlanSimmons closed 3 months ago

AlanSimmons commented 7 months ago

Statement of Problem

The UniProtKB REST API returns protein name information for proteins with a string in which parentheses are used both in the names and as delimiters.

For example, the name for the protein with UniProtKB ID Q7L0Y3 is

tRNA methyltransferase 10 homolog C (HBV pre-S2 trans-regulated protein 2) (Mitochondrial ribonuclease P protein 1) (Mitochondrial RNase P protein 1) (RNA (guanine-9-)-methyltransferase domain-containing protein 1) (Renal carcinoma antigen NY-REN-49) (mRNA methyladenosine-N(1)-methyltransferase) (EC 2.1.1.-) (tRNA (adenine(9)-N(1))-methyltransferase) (EC 2.1.1.218) (tRNA (guanine(9)-N(1))-methyltransferase) (EC 2.1.1.221)

The corresponding UniProtKB entry:

image

The UNIPROTKB ETL treats a pair of parentheses as a delimiter, using a simple regex.

# Split on the pair of parentheses.
            protein_names = row['Protein names']
            protein_names = re.split(r'[()]', protein_names)

If a synonym string includes parentheses, the ETL treats the nested parentheses as delimiters. For the example above, this means that the synonym mRNA methyladenosine-N(1)-methyltransferase is split into terms named

Correction

The ETL needs to parse nested parentheses in synonym names, treating only the highest level pairs of parentheses as delimiters.

This likely involves using something like the parenthetic_contents function described in this post.