Closed jbloom closed 1 year ago
Nucleotide 21846 (the S:95 mutation) is masked in the Delta branch of the UCSC/UShER tree because it was very unreliably detected and that led to false branching problems in the tree. Starting with Delta, and increasingly with Omicron, amplicon dropout and suboptimal default configs in some genome assembly pipelines have caused a lot of problems, so as time goes on I'm masking more sites. In addition to mutations that frequently have false reversions to reference due to amplicon dropout, I often mask indel regions because sometimes stray/contam read alignments extend into those gaps and cause false substitutions. My script that masks specific mutations in specific branches of the tree after new sequences are placed in the tree and before matOptimize
is here.
@AngieHinrichs, great, this is really helpful and explains things! We really appreciate all your quick help in understanding these things and in general maintaining such great MATs for easy use.
This may be too much of a request to implement easily, but I was wondering if you might consider eventually somehow putting the masked sites for each clade in a YAML or some other machine-readable format and documenting a bit more clearly that this masking happens? It totally makes sense when I look at this script, but for the types of analyses I often am trying to do (see which mutations are present to what extent in different clades), it's helpful to have an easy way to distinguish between what mutations actually have zero counts versus are just masked in a specific clade.
(Eg, we stumbled across this T95I when we were trying to relate our deep mutational scanning data to mutations that might preferentially occur in Omicron versus Delta lineages.)
Actually, I was able to extract mutations without much problem from script, so above is probably pretty low priority.
I'll go ahead and close this issue.
I have been using
matUtils
, and I have encountered what I think is some bug in how mutations are being counted specifically for mutations to I at site 95 in spike for the Delta clades (eg, 21J). I cannot for the life of me figure out what is causing this as the results seems sensible for other mutations for Delta and for other clades.Specifically, clade 21J should have a fair number of mutations to/from I at site 95 in spike. See for instance this
nextstrain
view: https://nextstrain.org/ncov/gisaid/global/6m?c=gt-S_95But when I use
matUtils
to count the mutations either to or from I at site 95 in spike, I get 0 counts for clade 21J. This contrasts to other clades where there are counts.See the code block below:
The files
21J_lines_with_95I.txt
and21J_lines_with_I95.txt
both have counts of zero, reflecting the (seemingly incorrect) count of no such mutations in thematUtils
parsed mutation-annotated tree. In contract, the files21L_lines_with_95I.txt
and21L_lines_with_I95.txt
have non-zero counts reflecting the (presumably correct) existence of such mutations in clade 21L.