yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

bug in counting of mutations at site 95 in 21J clade #312

Closed jbloom closed 1 year ago

jbloom commented 1 year ago

I have been using matUtils, and I have encountered what I think is some bug in how mutations are being counted specifically for mutations to I at site 95 in spike for the Delta clades (eg, 21J). I cannot for the life of me figure out what is causing this as the results seems sensible for other mutations for Delta and for other clades.

Specifically, clade 21J should have a fair number of mutations to/from I at site 95 in spike. See for instance this nextstrain view: https://nextstrain.org/ncov/gisaid/global/6m?c=gt-S_95

But when I use matUtils to count the mutations either to or from I at site 95 in spike, I get 0 counts for clade 21J. This contrasts to other clades where there are counts.

See the code block below:

#!/bin/bash

curl http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2022/12/18/public-2022-12-18.all.masked.nextclade.pangolin.pb.gz > mat.pb.gz
wget -O - http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/wuhCor1.fa.gz | gunzip -c > ref.fasta
wget -O - http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/genes/ncbiGenes.gtf.gz | gunzip -c > ref.gtf

matUtils extract -i mat.pb.gz --clade "21J (Delta)" -o 21J.pb.gz
matUtils summary -i 21J.pb.gz -g ref.gtf -f ref.fasta -t 21J.tsv
cat 21J.tsv | grep S:.95I | wc -l > 21J_lines_with_95I.txt
cat 21J.tsv | grep S:I95 | wc -l > 21J_lines_with_I95.txt

matUtils extract -i mat.pb.gz --clade "21L (Omicron)" -o 21L.pb.gz
matUtils summary -i 21L.pb.gz -g ref.gtf -f ref.fasta -t 21L.tsv
cat 21L.tsv | grep S:.95I | wc -l > 21L_lines_with_95I.txt
cat 21L.tsv | grep S:I95 | wc -l > 21L_lines_with_I95.txt

The files 21J_lines_with_95I.txt and 21J_lines_with_I95.txt both have counts of zero, reflecting the (seemingly incorrect) count of no such mutations in the matUtils parsed mutation-annotated tree. In contract, the files 21L_lines_with_95I.txt and 21L_lines_with_I95.txt have non-zero counts reflecting the (presumably correct) existence of such mutations in clade 21L.

AngieHinrichs commented 1 year ago

Nucleotide 21846 (the S:95 mutation) is masked in the Delta branch of the UCSC/UShER tree because it was very unreliably detected and that led to false branching problems in the tree. Starting with Delta, and increasingly with Omicron, amplicon dropout and suboptimal default configs in some genome assembly pipelines have caused a lot of problems, so as time goes on I'm masking more sites. In addition to mutations that frequently have false reversions to reference due to amplicon dropout, I often mask indel regions because sometimes stray/contam read alignments extend into those gaps and cause false substitutions. My script that masks specific mutations in specific branches of the tree after new sequences are placed in the tree and before matOptimize is here.

jbloom commented 1 year ago

@AngieHinrichs, great, this is really helpful and explains things! We really appreciate all your quick help in understanding these things and in general maintaining such great MATs for easy use.

This may be too much of a request to implement easily, but I was wondering if you might consider eventually somehow putting the masked sites for each clade in a YAML or some other machine-readable format and documenting a bit more clearly that this masking happens? It totally makes sense when I look at this script, but for the types of analyses I often am trying to do (see which mutations are present to what extent in different clades), it's helpful to have an easy way to distinguish between what mutations actually have zero counts versus are just masked in a specific clade.

(Eg, we stumbled across this T95I when we were trying to relate our deep mutational scanning data to mutations that might preferentially occur in Omicron versus Delta lineages.)

jbloom commented 1 year ago

Actually, I was able to extract mutations without much problem from script, so above is probably pretty low priority.

I'll go ahead and close this issue.