yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

ENH: Add callback for discovered moves #252

Closed ognian- closed 1 year ago

ognian- commented 2 years ago

As an initial step for integration with Larch, we would need a mechanism to feed a stream of moves to Larch's code. They don't need to be conflict-free, so we should choose some earlier stage to add the callback. Driving the callback from multiple threads won't be an issue for Larch.

ognian- commented 2 years ago

I imagine a good initial callback signature could be:

void callback(const MAT::Tree&, const output_t& moves, void* opaque)

And maybe called after individual_move() and find_moves_bounded().

yatisht commented 2 years ago

@yceh Could you please help with this?

willdumm commented 2 years ago

In order to apply a move to the mutation annotated DAG, we would require not only a source and target node, but we also need to compute the mutations on all new edges in the DAG. This will be much easier if we know the mutations on the edges of the modified tree after the move is applied. That is, we need to compute all differences in edge mutations between the original tree and the tree with the move applied.

It would be great to use Usher infrastructure for computing these edge mutations, so as part of this issue, we should think about the easiest way to compute/recover that information for each move. It seems that when a move is applied to the tree by matOptimize, possibly many Fitch sets and boundary sets need to be updated, and in general there may be many possible choices of edge mutations that correspond to a maximally parsimonious tree. @yceh am I thinking about this correctly?

In other words, is it true that a move can change the inferred ancestral sequence at many more tree nodes than just the source and target? Does matOptimize/Usher update node fitch/boundary sets intelligently after applying the move, to avoid visiting all the nodes of the tree?

yceh commented 2 years ago

I added the callback as a functor for every individual move found, regardless of their profitability, so tree drifting (accepting certain suboptimal moves) can be implemented. The callback can also stir matOptimize the consider or ignore a move through the boolean it returns. "output_result" is a scorching hot function, so I prefer it to be a functor defined in the header instead of a function pointer, so the compiler can inline and optimize it.

During the profitable move enumeration phase, matOptimize does not store the differences in edge mutations, because after the move is applied, they may be different due to interactions between multiple moves.

Yes, Fitch set at ancestors of source and the destination node may need to be updated, even after their most recent common ancestor. Therefore, matOptimize batches together all non-conflicting moves, and update the Fitch/boundary set in post-order traversal order until there is no update.

I need more time to think about the DAG and see whether we can play the same trick.

yceh commented 2 years ago

One limitation of matOptimize is it only works with a particular choice of fitch set among all parsimony optimal assignments -- the fitch set has to be the most common alleles among the fitch set of its children. This has actually caused some issues on clade assignments before because it artificially choose Fitch assignments that have more back mutations (eg. A->T->A) and is unstable during optimization.

yceh commented 2 years ago

Under the condition of binary tree and unweighted parsimony, the previous comment summarizes to "find_profitable_moves" and "individual_move" assume their input tree has optimally assigned ancestral sequences. When there are polytomies, it means for all nodes, only considering itself and its direct descendant, that node has its ancestral sequence optimally assigned only considering that one-level tree and its descendant sequences are fully resolved. Thanks for the example from willdumm, it is more than just resolved allele, but also the Fitch set.