x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
3 stars 0 forks source link

ingest UniProtKB Gene Ontology annotations #155

Open AlanSimmons opened 1 month ago

AlanSimmons commented 1 month ago

Statement of request

From an email of Jonathan's:

I've been seeing in a number of places people using GO annotations to classify general types of genes/proteins.

We have of course HGNC, and the Uniprot protein connection to it, and the GO ontology.

However, which protein is known to be in which GO type of pathway is in Uniprot but we didn't incorporate it.

I think an appropriate expansion of Uniprot ingest we "left in the backlog per se" some long time back would be good to have...

Basically it is in the form of something like to be added to the uniprot ingest (not just gene and protein, but what "categories" the protein is in):

Protein ----- has_annotation ---> GO

(There may of course be a more appropriate - or set of - predicates - in RO)...

This has no "priority" but I am confident will be highly used in time...

AlanSimmons commented 1 month ago

Initial investigation

  1. Uniprot-GO annotations are maintained in QuickGO. You can see this on the detail page of a protein in UniProtKB:

    image
  2. QuickGO offers complete sets of annotation data by species via FTP, including:

    • human
    • rat
    • mouse
  3. The file of automation data for human proteins is currently a 83 MB text file with 662K rows.

  4. The "Qualifier" column (3rd) in the file corresponds to GO relation terms. These can be resolved to RO via ro.json. Example: UniProtKB   A0A024RBG1  enables     GO:0000298   Other qualifers include:

  1. Annotation data includes information on references—e.g., PMID.