yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

matUtils updates: new subcommand "fix"; improvements to MAT::Tree:move_node and mask --move-nodes #357

Closed AngieHinrichs closed 8 months ago

AngieHinrichs commented 8 months ago

This bundles several improvements to matUtils. Let me know if you would prefer for them to be pulled out into separate PRs.

  1. I added a new subcommand fix to find and fix a recurring pattern that causes trouble for lineage designation: often there is a node whose mutation is simply a reversion of its grandparent's mutation, for example the final node in a path like this: ... > A1C > G2T > C3A > T2G In that case it would be preferable to move that final node to ... > A1C > C3A, i.e. to move it to become a child of its great-grandparent with only its parent's mutation. As long as the grandparent, parent and final node all have only one mutation each, that operation is parsimony-preserving. It represents two independent occurrences of C3A on the A1C branch, as opposed to only one occurrence of C3A followed by an immediate reversion of the previous mutation in the path. This is motivated by the Pango lineage team's experience of SARS-CoV-2 with advantageous Spike mutations occurring multiple times and causing an increase in transmissions.

  2. I found that MAT::Tree::collapse_tree was moving nodes without first looking to see if the new parent node already had a child with the same mutation(s) as the incoming node. This could result in the new parent node having multiple children with the same mutation(s), which causes incorrect structure and trouble for correctly annotating lineages. So I updated MAT::Tree:move_node to first search the new parent node's children for the same mutations as the moved node. Both the existing child and the moved node could be either an internal node or a leaf node (sample); all combinations are handled. When both nodes are internal nodes, the children of the moved node become children of the new parent's existing child, and the moved node is then removed. Since the moved node might be removed, MAT::Tree::collapse_tree could no longer use BFS order, so I changed it to start with the deepest nodes and work back towards root.

  3. mask --move-nodes used to require that the new parent have exactly the same set of mutations as the original parent. I added support for a new parent with a strict subset of the original parent's mutations, adding the original parent's extra mutations to the moved node. [Then I could manually make moves like the new fix subcommand's moves. I implemented this before fix, for testing.] I also found that mask --mask-mutations was doing more copying than necessary and sped it up a bit.