yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

[possible bug] codons with two changes are not properly handled #315

Closed rneher closed 1 year ago

rneher commented 1 year ago

when running matUtils as follows

matUtils summary -T 1 -i data/tree.pb -t results/all/amino_acid_mutations.tsv -m results/all/mutations.tsv -g config/ncbiGenes.gtf -f config/reference_seq.fasta

on the protobuf obtained from ucsc.edu, I find lines like in the translations file (amino_acid_mutations.tsv in my case)

IMS-10122-CVDP-A7C126A5-5FA5-49CD-95A9-4822D82331B0|OV664019.1|2021-12-11   ORF1ab:Q4583L;M:I217Y;M:V221D   A14012T;A27171T,T27172A;T27184A,A27185C CAA>CTA;ATT>TTT;GTA>GAA 1

The two mutations in M are associated with two nucleotide changes each. But the associated codons only seem to reflect one change and don't match the amino acids.

I might be missing something, but it seems like a bug to me.

jmcbroome commented 1 year ago

Yes, the codon reporting appears to be bugged. Notably, the actual translation is correct.

In the case of the second entry in this line, there is a A>T mutation in the first position, and a T>A mutation in the second position, simultaneously (if not in reality, then with respect to this tree). This causes a codon change ATT>TAT, which causes the single amino acid change I>Y (M:I217Y). This is correct if you look up protein translation tables online. The same applies for the third entry, which is GTA>GAC, leading to V>D.

What appears to be buggy is the reporting of the codon changes directly in the second to last column, as it reports ATT>TTT, disregarding the second mutation. This also appears to affect the third entry in the same way.

We will look into the incorrect codon reporting- thank you for raising this issue.