zwdzwd / transvar

TransVar - multiway annotator for precision medicine
Other
115 stars 34 forks source link

missing and inconsistent protein annotation usage #35

Open git-jemiller opened 4 years ago

git-jemiller commented 4 years ago

I'm trying to annotate a protein with its genomic coordinates using transvar and for most proteins it works fine, but sometimes nothing is returned except for the header of the output. How should I interpret this result? Or am I doing something wrong?

transvar panno --ensembl --idmap uniprot -i 'W5XKT8'
input   transcript  gene    strand  coordinates(gDNA/cDNA/protein)  region  info

Also, why do some proteins need their isoform to get any output and others do not?

Here's an example:


#returns output
transvar panno -i 'Q6N069-1' --uniprot --ensembl
input   transcript  gene    strand  coordinates(gDNA/cDNA/protein)  region  info
Q6N069-1    ENST00000379406 (protein_coding)    NAA16   +   chr13:g.41885341_41951166/c.1_2592/p.M1_I864    whole_transcript    promoter=chr13:41884341_41885341;#exons=20;cds=chr13:41885665_41949735

#no output
transvar panno -i 'Q6N069' --uniprot --ensembl
input   transcript  gene    strand  coordinates(gDNA/cDNA/protein)  region  info

#returns output without providing isoform number
transvar panno -i 'Q9H1K6' --uniprot --ensembl
input   transcript  gene    strand  coordinates(gDNA/cDNA/protein)  region  info
Q9H1K6  ENST00000267984 (protein_coding)    MESDC1  +   chr15:g.81293295_81296342/c.1_1086/p.M1_N362    whole_transcript    promoter=chr15:81292295_81293295;#exons=1;cds=chr15:81294613_81295698

Thanks!

zwdzwd commented 4 years ago

Hi,

Sorry for the late response. TransVar has been using the ID mapping from uniprot. More specifically it's from this file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz

Therefore if your identifier isn't linked to any transcript id in this file, transvar wouldn't be able to locate transcript definition. That's what happened to W5XKT8 and Q6N069. There has also to be a match between the transcript ID from the id mapping file and the transcript definition used. You could also use a customized ID mapping if you know how to project Uniprot ID to transcript ID (Ensembl, Refseq etc). This is done by

transvar index --idmap <idmapping file> -o <output_idx>

idmapping file has two columns, the first being uniprot ID, the second being the transcript ID. once done you could use something like

transvar panno --idmap <output_idx>

as usual.

Let me know if you know a better way to map these IDs. Thanks!