This PR addresses two minor bugs and some inefficient behavior around identifying sets of samples which match multiple conditions. It addresses the following issues:
if, somehow, a node has an annotation vector with size 0, attempting to create JSON output will cause a segfault. This is because it specifically tries to access the first k indeces where k is the length of the annotations on the root. Instead, I changed it to iterate to the length of the clade annotations of the node its trying to copy. If that clade annotations vector is empty, because new nodes are initalized with vectors with length equal to the root annotation vector containing blanks, it will produce a normal blank annotation vector with length equal to the annotation vector at the root (e.g. ["",""]) instead of a segfault.
code I originally wrote to generate names for unnamed sequences in json input as a simple count of sequences read did not actually work, because I didn't use a pointer for the recursive function. This edge case came up in testing converting older Nextstrain builds to protobuf formats.
To elaborate on the refactor- filters implemented in matUtils extract come in two broad categories. The first type are implemented as a consecutive filter- they take a list of samples, compute some metric for each of those samples, and return the subset that pass. A vector is fine for this- any iterable structure is going to have O(N) runtime for a constant metric computed for those samples. I'll note that if we further wanted to improve performance for these functions, it should be possible to parallelize these steps, but mostly they're quite fast already. The second type, which are less common, are more straightforward to compute as a subset of all samples on the tree, such as all short paths (as that's a dynamic programming implementation relying on a full tree traversal), then figure out which of your current selection are also in the set you just pulled from the tree. The latter type require intersecting two collections of samples. Angie rightfully pointed out that intersecting two vectors is slow, and also the short_paths function was poorly designed in general. I've given it a significant overhaul, and adjusted all other uses of sample_intersect to first convert the input to a set and use the set+vector implementation, which is significantly faster. I've also made some edits elsewhere in matUtils extract that clean up some redundant casting and converting between sets and vectors.
This PR addresses two minor bugs and some inefficient behavior around identifying sets of samples which match multiple conditions. It addresses the following issues:
The bugs are:
To elaborate on the refactor- filters implemented in matUtils extract come in two broad categories. The first type are implemented as a consecutive filter- they take a list of samples, compute some metric for each of those samples, and return the subset that pass. A vector is fine for this- any iterable structure is going to have O(N) runtime for a constant metric computed for those samples. I'll note that if we further wanted to improve performance for these functions, it should be possible to parallelize these steps, but mostly they're quite fast already. The second type, which are less common, are more straightforward to compute as a subset of all samples on the tree, such as all short paths (as that's a dynamic programming implementation relying on a full tree traversal), then figure out which of your current selection are also in the set you just pulled from the tree. The latter type require intersecting two collections of samples. Angie rightfully pointed out that intersecting two vectors is slow, and also the short_paths function was poorly designed in general. I've given it a significant overhaul, and adjusted all other uses of sample_intersect to first convert the input to a set and use the set+vector implementation, which is significantly faster. I've also made some edits elsewhere in matUtils extract that clean up some redundant casting and converting between sets and vectors.