A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2
stars
0
forks
source link
UNIPROTKB ETL incorrectly parses synonyms that contain parentheses #118
The UniProtKB REST API returns protein name information for proteins with a string in which parentheses are used both in the names and as delimiters.
For example, the name for the protein with UniProtKB ID Q7L0Y3 is
tRNA methyltransferase 10 homolog C (HBV pre-S2 trans-regulated protein 2) (Mitochondrial ribonuclease P protein 1) (Mitochondrial RNase P protein 1) (RNA (guanine-9-)-methyltransferase domain-containing protein 1) (Renal carcinoma antigen NY-REN-49) (mRNA methyladenosine-N(1)-methyltransferase) (EC 2.1.1.-) (tRNA (adenine(9)-N(1))-methyltransferase) (EC 2.1.1.218) (tRNA (guanine(9)-N(1))-methyltransferase) (EC 2.1.1.221)
The corresponding UniProtKB entry:
The UNIPROTKB ETL treats a pair of parentheses as a delimiter, using a simple regex.
# Split on the pair of parentheses.
protein_names = row['Protein names']
protein_names = re.split(r'[()]', protein_names)
If a synonym string includes parentheses, the ETL treats the nested parentheses as delimiters. For the example above, this means that the synonym mRNA methyladenosine-N(1)-methyltransferase is split into terms named
mRNA methyladenosine-N
1
methyltransferase
Correction
The ETL needs to parse nested parentheses in synonym names, treating only the highest level pairs of parentheses as delimiters.
This likely involves using something like the parenthetic_contents function described in this post.
Statement of Problem
The UniProtKB REST API returns protein name information for proteins with a string in which parentheses are used both in the names and as delimiters.
For example, the name for the protein with UniProtKB ID Q7L0Y3 is
tRNA methyltransferase 10 homolog C (HBV pre-S2 trans-regulated protein 2) (Mitochondrial ribonuclease P protein 1) (Mitochondrial RNase P protein 1) (RNA (guanine-9-)-methyltransferase domain-containing protein 1) (Renal carcinoma antigen NY-REN-49) (mRNA methyladenosine-N(1)-methyltransferase) (EC 2.1.1.-) (tRNA (adenine(9)-N(1))-methyltransferase) (EC 2.1.1.218) (tRNA (guanine(9)-N(1))-methyltransferase) (EC 2.1.1.221)
The corresponding UniProtKB entry:
The UNIPROTKB ETL treats a pair of parentheses as a delimiter, using a simple regex.
If a synonym string includes parentheses, the ETL treats the nested parentheses as delimiters. For the example above, this means that the synonym
mRNA methyladenosine-N(1)-methyltransferase
is split into terms namedCorrection
The ETL needs to parse nested parentheses in synonym names, treating only the highest level pairs of parentheses as delimiters.
This likely involves using something like the parenthetic_contents function described in this post.