yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

Bugfixes and set intersection refactor #246

Closed jmcbroome closed 2 years ago

jmcbroome commented 2 years ago

This PR addresses two minor bugs and some inefficient behavior around identifying sets of samples which match multiple conditions. It addresses the following issues:

The bugs are:

To elaborate on the refactor- filters implemented in matUtils extract come in two broad categories. The first type are implemented as a consecutive filter- they take a list of samples, compute some metric for each of those samples, and return the subset that pass. A vector is fine for this- any iterable structure is going to have O(N) runtime for a constant metric computed for those samples. I'll note that if we further wanted to improve performance for these functions, it should be possible to parallelize these steps, but mostly they're quite fast already. The second type, which are less common, are more straightforward to compute as a subset of all samples on the tree, such as all short paths (as that's a dynamic programming implementation relying on a full tree traversal), then figure out which of your current selection are also in the set you just pulled from the tree. The latter type require intersecting two collections of samples. Angie rightfully pointed out that intersecting two vectors is slow, and also the short_paths function was poorly designed in general. I've given it a significant overhaul, and adjusted all other uses of sample_intersect to first convert the input to a set and use the set+vector implementation, which is significantly faster. I've also made some edits elsewhere in matUtils extract that clean up some redundant casting and converting between sets and vectors.

yatisht commented 2 years ago

Thanks for the detailed comment.