yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
123 stars 42 forks source link

Mask fewer sites, the mask sites include lots of false positives #384

Open corneliusroemer opened 2 weeks ago

corneliusroemer commented 2 weeks ago

Usher SARS-CoV-2 masks quite a lot of sites (I think around 270, i.e. almost 1% of genome) based on this vcf: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf but I think that list of sites includes quite a lot of things that are no longer problematic.

The last update to that mask list was more than 3 years ago, so it's clearly no longer maintained. It might be worth transitioning away from it. Maybe turn it the existing sites into branch specific masks for old clades, but not for new, recent ones?

I noticed this when desigating stuff within KS.1.1.1, trying to untangle what happened. The two sites here being masked really makes things more difficult to untangle: C2091T and C16887T.

These are the relevant lines:

MN908947.3  16887   .   C   T,Y .   mask    SUB=NDM,RCD;EXC=highly_homoplasic;SRC_COUNTRY=.;SRC_LAB=.;GENE=gene-orf1ab;AA_POS=5541;AA_REF=Y;AA_ALT=I,X
MN908947.3  2091    .   C   T,Y .   mask    SUB=NDM;EXC=highly_ambiguous,homoplasic,narrow_src;SRC_COUNTRY=India,UK;SRC_LAB=NCDC,NU-OMICS;GENE=gene-orf1ab;AA_POS=609;AA_REF=T;AA_ALT=I,X
russcd commented 2 weeks ago

You are definitely right about this. Many of those recommendations have outlived their usefulness and it is something @AngieHinrichs and I have been thinking about how to clean up.

Briefly, a proposed solution is:

  1. Refactor so that samples are in MAPLE representation --- we need this for other reasons but it will be easiest to add with a big overhaul and will make some of this easier.
  2. Determine which subset of sites have remained potentially problematic (e.g., 11083 is still likely not usable?) and mask them. One way to do this is to just compute the parsimony score for each current problematic site without updating the topology --- well behaved sites should not stand out tremendously wrt to parismony:allele_freq even if not used to infer the tree.
  3. Reoptimze the existing tree using new less masked samples in MAPLE format.
  4. ???
  5. Profit.

This will certainly break some stuff and we'll have to figure that out when we get there. @AngieHinrichs and @corneliusroemer what do you think?

I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.

AngieHinrichs commented 1 week ago

Sounds good. I'll try to get to the MAPLE-ification and matOptimize soon. Yes, for sure I expect 11083 and some others to still be problematic, at least in some major lineages, but let's find out!

I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.

As in, recode in C++ in matUtils so it runs faster than its current Pythonic 19 hours? Or just run it every week or month or so, and mask accordingly?

russcd commented 1 week ago

Cool. Thanks, Angie.

Let's not bother with a recode until we decide we really like it and want to run it often. It is not clear to me that branch masking will be something we want to run more than say monthly-ish?