yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

Possible issue with AT.1 lineage path #231

Open joshuailevy opened 2 years ago

joshuailevy commented 2 years ago

Hi!

I recently noticed a possible issue with the clade mutation paths returned by matUtils extract -i [tree] -C [clade_paths]. For example, the AT.1 lineage path looks like:

AT.1 node_581021 > C14408T > C241T > C3037T > A23403G > G28881A > G28882A,G28883C > C9778T > G3281T,A3542G,A5176G,C7005A,T9070C,C10029T,A11451G,T12620A,C16985A,G17562T,G19180T,C20759T,C21588T,G23012A,G23900A,C24370T,C25000T,C25603T,T25675A,G26211T,C26568A,G28079T,A28271G,A28882G,C28883G > C1392T,A22206G,A22296C ,T28881A, G28882A,G28883C

where there seems to be a missing mutation (A28881T). Any thoughts on what's going on here?

Josh

corneliusroemer commented 2 years ago

I'm not quite sure what you mean by missing mutation. Can you explain?

Should 28881 be T or A? Right now the path is for A, it's reverted (tree building artefact).

joshuailevy commented 2 years ago

Ah, sorry... Usually if there's a reversion I would expect to see G28881A > (some mutations) > A28881G, but we don't see that here. Instead, we have G28881A > (some mutations) > T28881A, but there was never a mutation (or series of mutations ) that led to a T at 28881, so it's unclear how we can have a T28881A mutation.

corneliusroemer commented 2 years ago

Ah you're right! I didn't catch that detail 🙈 Thanks for explaining. @AngieHinrichs may know why this happens and whether it's a problem.

AngieHinrichs commented 2 years ago

Instead, we have G28881A > (some mutations) > T28881A, but there was never a mutation (or series of mutations ) that led to a T at 28881, so it's unclear how we can have a T28881A mutation.

Yes, that is very odd, and I don't know off the top how that would happen. @yatisht @jmcbroome ?

@joshuailevy can you share the protobuf file that gave this result from matUtils, and how the protobuf file was generated? Was it a UCSC public tree from https://hgdownload.gi.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ ? Were there any matUtils processing steps that would modify the tree before your matUtils extract -i ... -C ... command above? Thanks!

joshuailevy commented 2 years ago

Yep! I pulled the most recent protobuf file from that exact link (too big to attach here, but can provide a dropbox link if needed), and then processed the tree (with no modifications) using the extract command, as shown here: https://github.com/andersen-lab/Freyja/blob/main/freyja/updates.py

Josh

AngieHinrichs commented 2 years ago

OK, thanks Josh! The public tree is a pruned-down version of a larger tree that incorporates GISAID sequences in addition to GenBank/COG-UK. I think some mutations are being removed from nodes during the pruning, but the downstream parsimonious alleles (like the T in T28881A) are not being updated. So when I take the -C clade-paths file from the GISAID+public tree, I see a more complete path that has A28881G > ... G28881T > ... T28881A:

AT.1    node_1109926     > C14408T > C241T > C3037T > A23403G > G28881A > G28883C > G28882A > C9778T > C1392T,C10029T,A11451G,G19180T,C21588T,C25000T,A28271G > T12620A,C16985A,G23012A,C26568A > A3542G,A5176G,T9070C,G17562T,T19180G,C20759T,A28881G,A28882G,C28883G > T1392C,C24370T,C25603T,G26211T,G28079T,G28881T > G3281T,C7005A,G19180T,G23900A,T25675A > C1392T,T28881A,G28882A > G28883C > A22296C

But from the public-only tree I only see A28881G > ... T28881A like you're seeing:

AT.1    node_581346      > C14408T > C241T > C3037T > A23403G > G28881A > G28882A,G28883C > C9778T > G3281T,A3542G,A5176G,C7005A,T9070C,C10029T,A11451G,T12620A,C16985A,G17562T,G19180T,C20759T,C21588T,G23012A,G23900A,C24370T,C25000T,C25603T,T25675A,G26211T,C26568A,G28079T,A28271G,A28882G,C28883G > C1392T,A22206G,A22296C,T28881A,G28882A,G28883C

My guess is that a very small number of samples in GISAID have 28881T, and all of the GenBank/COG-UK samples have either A or G at 28881. When all samples that have 28881T are removed from the tree, the node with 28881T is collapsed out, but matUtils is not checking all the downstream nodes to see if their parsimonious alleles need to be updated (that might be slow?). @jmcbroome any thoughts about this? (matUtils extract pruning out GISAID samples and keeping existing parsimonious alleles even when dropping an upstream mutation at the same base)