Statement of Problem

The UniProtKB REST API returns protein name information for proteins with a string in which parentheses are used both in the names and as delimiters.

For example, the name for the protein with UniProtKB ID Q7L0Y3 is

tRNA methyltransferase 10 homolog C (HBV pre-S2 trans-regulated protein 2) (Mitochondrial ribonuclease P protein 1) (Mitochondrial RNase P protein 1) (RNA (guanine-9-)-methyltransferase domain-containing protein 1) (Renal carcinoma antigen NY-REN-49) (mRNA methyladenosine-N(1)-methyltransferase) (EC 2.1.1.-) (tRNA (adenine(9)-N(1))-methyltransferase) (EC 2.1.1.218) (tRNA (guanine(9)-N(1))-methyltransferase) (EC 2.1.1.221)

The corresponding UniProtKB entry:

The UNIPROTKB ETL treats a pair of parentheses as a delimiter, using a simple regex.

# Split on the pair of parentheses.
            protein_names = row['Protein names']
            protein_names = re.split(r'[()]', protein_names)

If a synonym string includes parentheses, the ETL treats the nested parentheses as delimiters. For the example above, this means that the synonym mRNA methyladenosine-N(1)-methyltransferase is split into terms named

mRNA methyladenosine-N
1
methyltransferase

Correction

The ETL needs to parse nested parentheses in synonym names, treating only the highest level pairs of parentheses as delimiters.

This likely involves using something like the parenthetic_contents function described in this post.

x-atlas-consortia / ubkg-etl

UNIPROTKB ETL incorrectly parses synonyms that contain parentheses #118

Statement of Problem

Correction