Open corneliusroemer opened 2 weeks ago
You are definitely right about this. Many of those recommendations have outlived their usefulness and it is something @AngieHinrichs and I have been thinking about how to clean up.
Briefly, a proposed solution is:
This will certainly break some stuff and we'll have to figure that out when we get there. @AngieHinrichs and @corneliusroemer what do you think?
I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.
Sounds good. I'll try to get to the MAPLE-ification and matOptimize soon. Yes, for sure I expect 11083 and some others to still be problematic, at least in some major lineages, but let's find out!
I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.
As in, recode in C++ in matUtils so it runs faster than its current Pythonic 19 hours? Or just run it every week or month or so, and mask accordingly?
Cool. Thanks, Angie.
Let's not bother with a recode until we decide we really like it and want to run it often. It is not clear to me that branch masking will be something we want to run more than say monthly-ish?
Usher SARS-CoV-2 masks quite a lot of sites (I think around 270, i.e. almost 1% of genome) based on this vcf: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf but I think that list of sites includes quite a lot of things that are no longer problematic.
The last update to that mask list was more than 3 years ago, so it's clearly no longer maintained. It might be worth transitioning away from it. Maybe turn it the existing sites into branch specific masks for old clades, but not for new, recent ones?
I noticed this when desigating stuff within KS.1.1.1, trying to untangle what happened. The two sites here being masked really makes things more difficult to untangle:
C2091T
andC16887T
.These are the relevant lines: