yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

bug: collapse_tree (matUtils extract -O) drops mutation in rare case #308

Closed AngieHinrichs closed 1 year ago

AngieHinrichs commented 1 year ago

In cov-lineages/pango-designation#1356, @ryhisner reported that all uploaded BA.2.3.20 sequences were shown as having C23013G as a private mutation, despite C23013G being part of BA.2.3.20. When I placed sequences on the full SARS-CoV-2 tree, I did not see that misbehavior -- C23013G was part of the path to the BA.2.3.20 lineage branch as expected. However, when I tried the public-sequence-only tree, I could see that its BA.2.3.20 branch was in fact missing C23013G which is not good. I traced through the steps of the public tree's construction to see at what stage C23013G was lost, and at one point matUtils extract -O is used to collapse nodes that have only one remaining child after filtering, and that is when we lose C23013G.

Here is a test case to reproduce the problem:

Search for 23013 in sample-paths.preTrim and you will see these two consecutive nodes late in the path:

node_1586491:G23012A,C23013A node_1586492:A23013G

Note that position 23013 appears in both nodes, first changing to A and then to G.

Now run extract with -O / --collapse-tree and check the path again:

matUtils extract -i public-2022-12-12.all.masked.preTrim.pb -O -o test_collapsed.pb matUtils extract -i test_collapsed.pb -s test.sample -S sample-paths.postCollapse

Search for 23013 in sample-paths.postCollapse -- C23013A and A23013G no longer appear where we would expect C23013G.

Instead there is a new collapsed node that includes only G23012A.



I believe the problem is that `MAT::collapse_tree` adds parent mutations to the child's list when collapsing a parent node that has only one child (lines 1287-1292):
https://github.com/yatisht/usher/blob/35711bcdcb82682871b8bc372e093cc4a7041b97/src/mutation_annotated_tree.cpp#L1287

-- but `Node::add_mutation` assumes that mutations are added in "chronological" order (lines 717-729 etc):
https://github.com/yatisht/usher/blob/35711bcdcb82682871b8bc372e093cc4a7041b97/src/mutation_annotated_tree.cpp#L717

In the test case, with the parent node having C23013A and the child node having A23013G, the effect of `child->add_mutation(m.copy());` is like evaluating A23013G > C23013A instead of C23013A > A23013G as intended.  Inside `add_mutation`, `iter->par_nuc` is A23013G's A, and `mut.mut_nuc` is C23013A's A, so it looks like we just have an immediate reversion of the mutation and they cancel out, oops.

Fortunately I think it is pretty rare for the bug conditions to be met:
* a parent and child have mutations at the same position that are not mutation-reversion (or reversion-remutation)
* the child is the only remaining child of the parent
-- but BA.3.20 is still active and it meets those conditions so I'm glad @ryhisner noticed that something was off!

In order to preserve the "chronological" order that `add_mutation` expects, I think `collapse_tree` needs to change so that instead of adding the parent's mutations to the child, it adds the child's mutations to the parent (the node that will be collapsed away), and then replaces the child's mutations with the parent's extended mutations.  I will try that out locally and hopefully send a PR soon.