Question/discussion about low coverage samples

theosanderson commented 1 year ago

A super basic question and apologies if it's already been answered here.

As I have been exploring molnupiravir-associated sequences I relatively often see patterns like this:

When I look at this in Nextclade, or build a tree I find that A is a low-coverage sample and that my interpretation of the data is that it actually has the same genotype as B and C

I guess behaviour here depends on the order of sample placement

My naive questions would be:

would the idealised behaviour of an UShER/matOptimize algorithm (ignoring performance limitations etc.) place A,B,C together?
is this achievable or planned? I can see that its super challenging because the MAT doesn't capture the ambiguity at these positions. I guess I'm wondering if some periodic thing that looks at the VCF and the local tree or something could allow this

corneliusroemer commented 1 year ago

I agree it would be nice if the ambiguity was eventually included in the MAT - like deletions.

I didn't know that MAT didn't keep track of ambiguities. That may explain why we often see ladders with apparent reversions on the way to good sequences. Here A is annotated as lacking some defining mutations of B/C which is objectively wrong because it is just N there.

Cleaning up periodically sounds like a good idea! Alternatively, they could be thrown out altogether.

AngieHinrichs commented 1 year ago

Yes, we could be doing better for this.

In the short term: send me sequence names when you think they could be better placed. Periodically I remove some sequences from the usher tree (because they seem to me to be causing problematic branches that are reported in pango-designation issues or that I happen to notice) and re-optimize, and then let the sequences be added back in the next daily build. Sometimes due to Ns the sequences are added back in a better place; sometimes they don't have Ns but assert the reference allele, and are added back in the same place as before. Occasionally, if some sequences really seem to be causing trouble, I permanently exclude them.

In the longer term, I should build and maintain an all-sequence MSA equivalent so I can give matOptimize the full information for all sequences (including where they have ambiguous bases) instead of leaving it to work with already-imputed alleles. VCF (which I generate for new sequences in the daily build as the input for UShER) is a very inefficient format for that, but matOptimize now supports a format developed by Nicola DeMaio for MAPLE (https://www.biorxiv.org/content/10.1101/2022.03.22.485312v2.full) that could be incrementally built up as new sequences are added.

theosanderson commented 1 year ago

Thanks @AngieHinrichs -- that's really interesting. It's great that matOptimize is already in theory able to think about this if given the full MSA. It also wasn't totally obvious to me what the "right" position would be because I guess by parsimony they are equally valid in the case above.

yatisht / usher

Question/discussion about low coverage samples #328