yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Unclear why reversion is not indicated in Auspice export #354

Open corneliusroemer opened 8 months ago

corneliusroemer commented 8 months ago

I was trying to figure out why the pango consensus sequence for XCY lacked C1889T and a bunch of other mutations it should have.

When looking at the Usher subtree of all the designated sequences, I couldn't find any relevant reversion shown, e.g. the Italian sequence didn't show any reversion at C1889T: Italy/PIE_IRCC_15895021/2023|EPI_ISL_18331235|2023-09-24 - even though it has 1889C at that position (reference) - and it isn't N either.

image

Note that this is a context sequence - it's not one I uploaded myself.

I know you do some branch specific filtering, and also mask certain problematic sites - but it's unclear to me what is going on here. Can you maybe have a look @AngieHinrichs?

As food for thought, I think it would be very helpful in general to surface masked sites better. The Auspice export shouldn't show imputed values on the tips but indicate uncertainty. For the problematic sites that are always masked, this should be as easy as setting all the nucloetides to N when creating the auspice json (this would mean one wouldn't have to look up the list for each site but it was neatly shown in the tree).

I've previously put down some thoughts regarding representation of unknown states in this issue: https://github.com/yatisht/usher/issues/349

corneliusroemer commented 8 months ago

Looking at the Italian sequence in Nextclade, I notice that it has 45 ambiguous bases - I think that explains what's happening with Usher here: Usher imputes ambiguous and unknown nucleotides. I think it would be good to add to your QC that you should throw out sequences with: a) more than say 5 or 10 ambiguous nucleotides, these are often due to contamination b) throw out sequences that have a large number of imputed bases that differ from reference. While you kick out sequences that have more than 10% Ns (if I'm not mistaken that's the threshold), this doesn't take care of sequences that have singlet Ns instead of ranges. To apply the 10% N rule to these sequences would mean excluding all sequences from the tree that have >10% of their bases imputed to be different to reference.

Both of the above rules would have caused the Italian sequence to be excluded - I think this could really help cleaning up the tree a bit more.

Meanwhile, I will implement some rules like that to filter those sequences out from the pipeline that generates synthetic pango sequences - this isn't too hard as I have information on ambiguous nucleotides and masked nucleotides in the nextclade.tsv

image
AngieHinrichs commented 8 months ago

I just downloaded Italy/PIE_IRCC_15895021/2023|EPI_ISL_18331235|2023-09-24 and aligned it to the reference. I believe its base 1882 corresponds to reference base 1889, and I see a 'Y' there (C or T), so it makes sense to me that UShER infers T by placement.

The UShER web interface results page, in addition to linking to Nextstrain to display the Auspice JSON, does tell about ambiguous and imputed bases below the results table (see the "C1889Y" on the right in the first line of differences from ref, and "1889:T" at the bottom in the list of imputed bases):

image

As food for thought, I think it would be very helpful in general to surface masked sites better. The Auspice export shouldn't show imputed values on the tips but indicate uncertainty.

Masked sites and imputed values of ambiguous bases are two different matters. Problematic Sites are masked before placement, both when adding sequences to the big tree and in the web interface before calling UShER. The web interface shows them in the "Masked SNVs" column of the results table:

image

Branch-specific masking is a step in the daily build process and is not applied to web interface results. (It takes >15 minutes... I'm sure you don't want that much time added to each web request 😄.)

Showing ambiguous bases in the Auspice JSON would be a great improvement, but currently that information is lost when UShER places a sequence. I'm generating Auspice JSON from UShER output files and some metadata, not the full Augur pipeline. For sequences in the big tree, it would be possible to keep track of ambiguous bases as additional metadata -- but the metadata is already really large, the way I'm fetching it is slow, and that would make the process even slower. First I need to have time to improve how I fetch metadata, then I can entertain thoughts of adding more metadata for 16 million sequences.

Since I align uploaded sequences to the reference, and UShER reports imputed bases for placed sequences, I guess it would be possible and not too slow to work that into the Auspice JSON for uploaded sequences. Sequences that are already in the tree (looked up by ID or added in subtree for context) would need to get that info as metadata though.