yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

--haplotype rework and metadata loading flag #329

Closed jmcbroome closed 1 year ago

jmcbroome commented 1 year ago

This PR addresses two issues.

First, it addresses https://github.com/yatisht/usher/issues/303. Typically, only metadata for samples in the users query set is loaded into memory. This was originally implemented to reduce the memory footprint of our approach. However, in cases with -N, -K, and similar, users may want full metadata to be available for any and all samples in their output, including non-query context samples. Accordingly, I have added a flag (without a single letter accompanying it) --load-all-metadata to matUtils extract indicating that all available metadata should be loaded and available for output.

Second, it addresses https://github.com/yatisht/usher/issues/326. This is a significant rework of the implementation and output of matUtils summary --haplotype. It is now dynamically computed, significantly reducing runtime, and instead of representing haplotypes as unordered mutational paths, they are now represented as location-state strings in a set (e.g. '56A,60G' means that a haplotype where position 56 is A, position 60 is G, and the rest are reference).