yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

ENH: Store imputed nucleotides for each tip #349

Open corneliusroemer opened 11 months ago

corneliusroemer commented 11 months ago

Usher currently imputes unknown nucleotides and as far as I'm aware, throws away information on which nucleotides were imputed.

It would be great, if you could keep that information around. If done the right way, this wouldn't actually require too much extra space. Here's a proposal:

In general, keeping track of all unknown positions in input sequences would require a lot of space though not crazy amounts, depending on how it's done:

  1. If you one hot bit-encode Ns, it would be ~30kb ~ 3kB per sequence entirely uncompressed, which for 10m sequences is ~30GB, a bit much
  2. Better: run length encode the one hot bit encoded Ns, this should compress things a lot, because we usually have stretches of N. Together with the fact that you only put sequences with <10% N in Usher trees, this should reduce storage by at least a factor of 10. You could e.g. encode the distance between flips from known to unknown in a 16bit uint: each stretch would thus only cost 2 bytes, and since sequences have on average maybe 10 stretches, this means only 20 bytes per sequence, or 200MB for all sequences

For most purposes, I think it's most important to know about Ns that were imputed to be different from the reference - not those that were imputed to be identical to reference. As most Ns will be imputed to be same as reference, this drastically reduces the number of Ns to keep track of! Again, there are a few ways to keep track of these:

  1. One hot encode which mutation was imputed along the path to a sequence in question. This requires as many bits as a sequence has divergence. For SARS-CoV-2, with ~100 mutations and 10m sequences, this would mean ~100 MB. Great, that's not too bad, small enough to put in RAM
  2. But we can easily do better since usually only very few mutations will be imputed, let's encode it sparsely: indexes of imputed mutations. I would guess that on average at most 5% of mutations are imputed. If most paths are <256 long, we can encode indexes in 1 byte, so 1 byte per mutation, that means around 5 bytes per sequence or 50MB for all sequences.
  3. We can do even better: as Ns often come in stretches along the genome, there's a correlation of Ns we can take advantage of. Let's not one-hot encode mutations along the path but along the genome. Let's then store start and end point of mutations that are imputed. This pays off at the very latest when half of the imputations are surrounded by other imputations.

Overall, the above thoughts show that it should be possible to efficiently store Ns of imputed sequences together with the mutation annotated tree. This would allow Usher exports to show Ns on tips which would help users figure out which areas of the trees to trust and which not. This could also help Usher decide which tips to trust more than others: when optimizing the tree, one should first remove sequences with imputed bases as these are less informative and can fit in more places.