yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

ENH: Quick way to get Auspice/Nextstrain subtree by uploading a single EPI ISL #347

Open corneliusroemer opened 11 months ago

corneliusroemer commented 11 months ago

Usher is an amazing, extremely important tool for my lineage designation workflow. A major time sink/inefficiency right now is that it takes quite long to get an Auspice tree, even if I don't upload a single new sequence. If I want to get the Auspice subtree for a single sequence that's already in the Usher tree, it still takes ~2 minutes.

Would it be possible to speed up that use case, maybe with a new command/view/mode? I image it to involve simply: a) finding the sequence in the protobuf b) identifying the earliest ancestor with <5k descendants c) exporting the subtree descending from that node to auspice.json

This should be possible in seconds rather than minutes. I suspect the reason the above is currently slow is that it's not a mode/use case you have optimized for. However, at least for SC2, it's a super useful/important one. It's at least 50% of how I use Usher (query something on covSpectrum, open EPI_ISLs in Usher, no new alignment/sequences involved, it's purely ISL querying).

You will say I can just use Taxonium. Unfortunately, Taxonium lacks a lot of the capabilities - it's the best if you must look at large trees, but when you want to study a potential new lineage, it's not got the features I love from Auspice.

If this is too specialized a feature request/use case, an alternative could be to (help me) write a CLI that takes the usher tree, an epi isl, and does the above. That should be possible, no? :)

@AngieHinrichs (and @theosanderson for the feedback on Taxonium )

AngieHinrichs commented 11 months ago

I'm not sure command line would be faster since It takes over two minutes just to read and parse the full tree protobuf.

Would an Omicron-only tree (7.6 million samples instead of 15 million samples) be sufficient for most lineage work? That should halve the time.

The web version is using a server instance of usher-sampled, which saves the two minutes of protobuf-reading, but there are other things that take significant time:

AngieHinrichs commented 11 months ago

Alternatively, I could make a reduced version of the tree by sample collection date. Currently 98.5% of samples in the full tree have full YYYY-MM-DD dates. If I take the most recent 25% of those samples, the dates go back to 2022-05-26 (should be ~4x as fast). If I take the most recent 1 million samples, the dates go back to 2022-12-16 (should be ~15x as fast).

Is there some horizon of months-ago before which you don't really care when looking for new lineages?

[Every New Year, we would miss some new samples submitted with the wrong year due to hardcoding in scripts, until Emma notices and tracks down the submitters. 🙃]