yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

several issues #345

Open shay671 opened 1 year ago

shay671 commented 1 year ago

Hi, here are several issues i found with the designation and topology of several variants :

DV.8 is seemed to evolve out of DV.1 and not from it’s parent CH.1.1.1 https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_137c4_93a780.json?label=id:node_7292141

The mutational path of EG.5 is missing some mutations which In the phylogenetic tree seems to be included, e.g. C22480T, C29625T. https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_36315_9452d0.json?label=id:node_6647034

XBB.1.16.8 The mutational path shows A19326G, and so does the phylogenetic tree. But when analyzing the samples of the samples designated as this in USHER, only 3.8% of them having this mutation. Another 2.7% of samples have N in that position and 0.5% having R (ambiguous ncltd).

BQ.1.1.78 all sequences having C16887T but the position is masked in the tree. https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_3e6e3_92a380.json?c=gt-nuc_16887&label=id:node_8703807

Thank you for the amazing work u do!

AngieHinrichs commented 12 months ago

Thanks for the reports @shay671!

DV.8 is seemed to evolve out of DV.1 and not from it’s parent CH.1.1.1

I hope to have this fixed in the 2023-07-06 build.

The mutational path of EG.5 is missing some mutations which In the phylogenetic tree seems to be included, e.g. C22480T, C29625T.

There aren't a lot of sequences to establish the order of the final five mutations found in almost all EG.5 sequences. pango-designation/lineage_notes.txt lists 3 mutations as defining: S:F456L (T22930A), ORF1a:A690V (C2334T), and ORF1a:A3143V (C9693T). lineage_notes.txt omits non-coding C22480T and C29625T. In the past I've had problems with the order of mutations shuffling around as more sequences are added, and then Cornelius says "why is mutation X, that wasn't mentioned in lineage_notes.txt, required for the lineage and cutting off some sequences that don't have it?" At the time EG.5 was designated, C22480T was early in the path so I made it an N in pango.clade-mutations.tsv, and C29625T came after all mutations mentioned in lineage_notes.txt so I omitted it despite most sequences having it. The order of mutations happens not to have shuffled around since them so I could have used C22480T instead of C22480N in pango.clade-mutations.tsv, but it's hard to predict how things will go after lineage designation.

XBB.1.16.8 The mutational path shows A19326G, and so does the phylogenetic tree. But when analyzing the samples of the samples designated as this in USHER, only 3.8% of them having this mutation. Another 2.7% of samples have N in that position and 0.5% having R (ambiguous ncltd).

Ugh, this is a case where I might have made a bad call in branch-specific masking as defined in branchSpecificMask.yml. At some point when I went looking for noisy bases in XBB, I saw what I thought were false reversions (G19326A) so I masked that in all of XBB -- but I will need to take a closer look. Please ignore for now and thanks again for letting me know!

BQ.1.1.78 all sequences having C16887T but the position is masked in the tree.

Yes, 16887 is in the Problematic Sites set so it is masked in all sequences before they are added to the tree.

AngieHinrichs commented 12 months ago

Sorry, I accidentally hit the 'Comment' button before I was done composing the previous note, so if you're reading these in email, please see the updated comment above: https://github.com/yatisht/usher/issues/345#issuecomment-1624097309