Open joshuailevy opened 2 years ago
I'm not quite sure what you mean by missing mutation. Can you explain?
Should 28881 be T or A? Right now the path is for A, it's reverted (tree building artefact).
Ah, sorry... Usually if there's a reversion I would expect to see G28881A > (some mutations) > A28881G, but we don't see that here. Instead, we have G28881A > (some mutations) > T28881A, but there was never a mutation (or series of mutations ) that led to a T at 28881, so it's unclear how we can have a T28881A mutation.
Ah you're right! I didn't catch that detail 🙈 Thanks for explaining. @AngieHinrichs may know why this happens and whether it's a problem.
Instead, we have G28881A > (some mutations) > T28881A, but there was never a mutation (or series of mutations ) that led to a T at 28881, so it's unclear how we can have a T28881A mutation.
Yes, that is very odd, and I don't know off the top how that would happen. @yatisht @jmcbroome ?
@joshuailevy can you share the protobuf file that gave this result from matUtils, and how the protobuf file was generated? Was it a UCSC public tree from https://hgdownload.gi.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ ? Were there any matUtils processing steps that would modify the tree before your matUtils extract -i ... -C ...
command above? Thanks!
Yep! I pulled the most recent protobuf file from that exact link (too big to attach here, but can provide a dropbox link if needed), and then processed the tree (with no modifications) using the extract
command, as shown here: https://github.com/andersen-lab/Freyja/blob/main/freyja/updates.py
Josh
OK, thanks Josh! The public tree is a pruned-down version of a larger tree that incorporates GISAID sequences in addition to GenBank/COG-UK. I think some mutations are being removed from nodes during the pruning, but the downstream parsimonious alleles (like the T in T28881A) are not being updated. So when I take the -C clade-paths file from the GISAID+public tree, I see a more complete path that has A28881G > ... G28881T > ... T28881A:
AT.1 node_1109926 > C14408T > C241T > C3037T > A23403G > G28881A > G28883C > G28882A > C9778T > C1392T,C10029T,A11451G,G19180T,C21588T,C25000T,A28271G > T12620A,C16985A,G23012A,C26568A > A3542G,A5176G,T9070C,G17562T,T19180G,C20759T,A28881G,A28882G,C28883G > T1392C,C24370T,C25603T,G26211T,G28079T,G28881T > G3281T,C7005A,G19180T,G23900A,T25675A > C1392T,T28881A,G28882A > G28883C > A22296C
But from the public-only tree I only see A28881G > ... T28881A like you're seeing:
AT.1 node_581346 > C14408T > C241T > C3037T > A23403G > G28881A > G28882A,G28883C > C9778T > G3281T,A3542G,A5176G,C7005A,T9070C,C10029T,A11451G,T12620A,C16985A,G17562T,G19180T,C20759T,C21588T,G23012A,G23900A,C24370T,C25000T,C25603T,T25675A,G26211T,C26568A,G28079T,A28271G,A28882G,C28883G > C1392T,A22206G,A22296C,T28881A,G28882A,G28883C
My guess is that a very small number of samples in GISAID have 28881T, and all of the GenBank/COG-UK samples have either A or G at 28881. When all samples that have 28881T are removed from the tree, the node with 28881T is collapsed out, but matUtils is not checking all the downstream nodes to see if their parsimonious alleles need to be updated (that might be slow?). @jmcbroome any thoughts about this? (matUtils extract
pruning out GISAID samples and keeping existing parsimonious alleles even when dropping an upstream mutation at the same base)
Hi!
I recently noticed a possible issue with the clade mutation paths returned by
matUtils extract -i [tree] -C [clade_paths]
. For example, theAT.1
lineage path looks like:AT.1 node_581021 > C14408T > C241T > C3037T > A23403G > G28881A > G28882A,G28883C > C9778T > G3281T,A3542G,A5176G,C7005A,T9070C,C10029T,A11451G,T12620A,C16985A,G17562T,G19180T,C20759T,C21588T,G23012A,G23900A,C24370T,C25000T,C25603T,T25675A,G26211T,C26568A,G28079T,A28271G,A28882G,C28883G > C1392T,A22206G,A22296C ,T28881A, G28882A,G28883C
where there seems to be a missing mutation (A28881T). Any thoughts on what's going on here?
Josh