ENH: Quick way to get Auspice/Nextstrain subtree by uploading a single EPI ISL

corneliusroemer commented 11 months ago

Usher is an amazing, extremely important tool for my lineage designation workflow. A major time sink/inefficiency right now is that it takes quite long to get an Auspice tree, even if I don't upload a single new sequence. If I want to get the Auspice subtree for a single sequence that's already in the Usher tree, it still takes ~2 minutes.

Would it be possible to speed up that use case, maybe with a new command/view/mode? I image it to involve simply: a) finding the sequence in the protobuf b) identifying the earliest ancestor with <5k descendants c) exporting the subtree descending from that node to auspice.json

This should be possible in seconds rather than minutes. I suspect the reason the above is currently slow is that it's not a mode/use case you have optimized for. However, at least for SC2, it's a super useful/important one. It's at least 50% of how I use Usher (query something on covSpectrum, open EPI_ISLs in Usher, no new alignment/sequences involved, it's purely ISL querying).

You will say I can just use Taxonium. Unfortunately, Taxonium lacks a lot of the capabilities - it's the best if you must look at large trees, but when you want to study a potential new lineage, it's not got the features I love from Auspice.

If this is too specialized a feature request/use case, an alternative could be to (help me) write a CLI that takes the usher tree, an epi isl, and does the above. That should be possible, no? :)

@AngieHinrichs (and @theosanderson for the feedback on Taxonium )

AngieHinrichs commented 11 months ago

I'm not sure command line would be faster since It takes over two minutes just to read and parse the full tree protobuf.

Would an Omicron-only tree (7.6 million samples instead of 15 million samples) be sufficient for most lineage work? That should halve the time.

The web version is using a server instance of usher-sampled, which saves the two minutes of protobuf-reading, but there are other things that take significant time:

reading in a text file of all full names used in the tree, breaking them down into parts, and storing them for lookup. E.g. for USA/CA-QDX-2597/2020|MW191321.1|2020-03-16, the user might paste in "MW191321" or "MW191321.1" or "USA/CA-QDX-2597/2020" and we need to store the mapping of potential user-pasted values to full name. IIRC that alone takes a minute with 15 million names, and is necessary before we query the tree (for which we need the exact full name). I suppose creating a tree-name lookup server could save that minute but would require scarce dev time.
extracting subtrees -- if there are multiple subtrees with 5000 samples each, the time for that can add up
reading and parsing full metadata -- it is read in and parsed in parallel with the tree query, but of course it takes even longer than just parsing the file of names. Reading in the metadata is necessary before we can parse the subtrees written by usher-sampled-server, join in metadata, and write JSON (which doesn't take long; getting the metadata is the real delay).

AngieHinrichs commented 11 months ago

Alternatively, I could make a reduced version of the tree by sample collection date. Currently 98.5% of samples in the full tree have full YYYY-MM-DD dates. If I take the most recent 25% of those samples, the dates go back to 2022-05-26 (should be ~4x as fast). If I take the most recent 1 million samples, the dates go back to 2022-12-16 (should be ~15x as fast).

Is there some horizon of months-ago before which you don't really care when looking for new lineages?

[Every New Year, we would miss some new samples submitted with the wrong year due to hardcoding in scripts, until Emma notices and tracks down the submitters. 🙃]

yatisht / usher

ENH: Quick way to get Auspice/Nextstrain subtree by uploading a single EPI ISL #347