Imputation and null genotypes/indels

martinghunt commented 1 year ago

We've encountered some unexpected behaviour at a position where many of the samples have a deletion, and so have null genotype calls in the VCF file. Could you please advise?

The position is Spike 157. The nucleotide coords in the genome are 22031-22033.

There are 9807 samples. Here's a breakdown of genotype counts from the VCF made by faToVcf:

Position  GT=.    GT=0   GT=1   GT=2
22031     4226    5580     1       0
22032     4225    5572    10       0
22033     4225    5566     5      11

ie most are REF or null calls. Very few are non-null and non-ref.

Most of the null genotype samples are imputed as having the non-ref amino acid C. There's a taxonium screenshot attached (screenshot.all.png). We would have expected these to be ref calls (amino acid F).

Out of curiosity, I took a clade of the tree rooted at node 5219, highlighted in the screenshot. This looked to be about an even split between F and C. Then reran on just that subset (plus the covid ref genome). On this subset, the non-ref imputation switches to mostly amino acid S instead of C - screenshot attached screenshot.subset.png.

Nearly all the genotypes in the subset are actually null. The non-nulls have slightly more REF calls.:

Position  GT=.    GT=0   GT=1
22031     4000      17      1
22032     3992      16     10
22033     3992      16     10

Is this expected when imputing? I guess we might be unlucky in the order the samples get added to the tree and one non-ref imputation could propagate down the rest of the tree?

There's a zip file attached with shell script with the commands, the input files, and output files including trees and jsonl files.

test_files.zip

AngieHinrichs commented 1 year ago

Thanks for the very complete test_files. I think your guess is right, that when so many sequences have N (null call), and a few sequences have alternate alleles, the order in which they are placed can affect which alternate allele gets imputed for subsequent Ns.

And I see it's the Delta branch you're looking at, and the site of a familiar deletion in Delta (22029-22034). You're working with raw data from the SRA, so your perspective on this might be a little different, but my impression from looking at many consensus genomes placed in the UShER tree is that when there is a deletion, it can manifest in several different ways in different genome assemblies:

genome assembly pipelines might omit the deleted bases (the best outcome) or they might insert Ns in the consensus fasta (not great but OK for usher), or in the worst case, they may backfill with reference bases (a horror for phylogenetics [Edit: I've changed my mind about this! reference backfilling may be better for deletions in this case, but is still a horror for substitutions because it causes false reversions])
genome assembly pipelines might fail to detect the deletion due to a small number of stray contaminant reads, not caring that the coverage drops from 1000 to 1 or 2 in the deleted region
genome assembly pipelines might report substitutions in the deleted region due to imperfectly matching reads whose alignment is extended a few bases into the deleted region instead of skipping to the other side

Due to those issues, it's maddening to try to tell from consensus genome sequences exactly when a deletion happened relative to other mutations, and IMO substitutions within a region believed to be deleted are dubious and best ignored. (It's even worse for insertions.)

The Delta era was when amplicon dropout and inconsistent detection of deletions started to really make things difficult for UShER (and those problems only got worse with all the Omicrons). In response we added branch-specific masking to the daily build, so that we could mask certain sites only for Delta (and later, mask other sites only for BA.1 etc.). In retrospect we realized that a deletion at 11288-11296 had also caused trouble for the Alpha, Beta and Gamma branches.

We do two kinds of masking at two different stages of the daily build.

After aligning sequences to the reference and before placing them in the tree with UShER, we mask out all sites with mask in the 7th column of https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/problematic_sites_sarsCov2.vcf. Those positions are omitted from the VCF file passed to UShER, as if all sequences had the reference allele. We have done this type of masking pretty much since the beginning. In Delta, inconsistent detection of S:D142G (G21987A) was so bad that 21987 was added to that list, so it is completely invisible in the UShER tree.
After UShER places sequences in the tree, we do branch-specific masking: we make a list of mutations to discard from all descendants of specified nodes. Details are in https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/maskDelta.sh . These are the sites that we mask in Delta:
- 22027-22034: positions affected by deletion of S:157-158 (22029-22034)
- 28246-28253: positions affected by deletion of ORF8:119-120 (28248-28253)
- 28271: another deleted base
- 21846: inconsistently detected S:T95I (C21846T)
- 21302, 21304, 21305: believed to be either artefact or homoplasic (cov-lineages/pango-designation#398)
- 28254, 28461 also considered problematic in Delta (cov-lineages/pango-designation#435)

We run matOptimize after the branch-specific masking (using matUtils mask -m with the file of sites and node IDs created by the maskDelta.sh script above), so that sequences that were joined together by substitutions that we don't trust (such as those in deletions or sites especially prone to dropout/contam) can be moved to wherever they fit best according to their non-masked substitutions.

If you're working on a uniform genome assembly pipeline then I'd love to hear your perspective on deletions and other difficult-to-assemble situations, and whether it's appropriate to mask them away to make a cleaner-looking tree. 🙂

iqbal-lab commented 1 year ago

Thanks for this detailed reply! Those positions you generally mask are really valuable info which we can use to test our own assemblies. I completely agree, for heterogeneous data you have no choice but to mask like this. In our case we are working on a uniform pipeline and looking at our output only. so this example is very striking - all of those thousands of samples have the same deletion called the same way.

maybe it is possible to modify the VCF passed into usher to avoid this problem by handling SNPs that overlap deletions differently? Seems unlikely to me, as you only want "special treatment" where an deletion dominates a clade, and so we want to ignore SNPs at that site
suppose you had uniform assemblies that didn't have contamination, and didn't put Ns for deletions. Wouldn't you want to treat the deletion as "not there" and essentiually replace with the ancestral allele for that clade? Would it be possible to pass in two inputs to usher, one which is snps only and one which is indels only. You then do a 3 stage process: use the snp input to infer an initial tree. Then use the indel inoput to infer where the indels occur in the tree. Then combine this info with the first tree to impute the ancestral allele onto the clade which has the indel? Or is this mad?

AngieHinrichs commented 1 year ago

so this example is very striking - all of those thousands of samples have the same deletion called the same way.

Yes! So I'm curious... if you look at BAMs for a few examples where the deletion is not called (ref in VCF instead of '.' null call), do they have a noticeable drop in coverage at the site of the deletion? And for the very few cases where a non-reference base is called, do they have not only a big drop in coverage, but maybe also some reads whose alignment is extended into the deletion region because it's near the end of the read and there aren't enough bases to make the alignment skip to the other side of the deletion??

maybe it is possible to modify the VCF passed into usher to avoid this problem by handling SNPs that overlap deletions differently? Seems unlikely to me, as you only want "special treatment" where an deletion dominates a clade, and so we want to ignore SNPs at that site

Right, I don't want to mask out deleted-in-Delta positions for every sequence regardless of whether it's Delta or not. I have idly considered using nextclade to identify sequences in clades known to have deletions, and then construct the VCF with appropriate masking for each sequence, but have not tried it yet.

suppose you had uniform assemblies that didn't have contamination, and didn't put Ns for deletions. Wouldn't you want to treat the deletion as "not there" and essentiually replace with the ancestral allele for that clade?

I would love to have uniform perfect assemblies as input! 🙂 Indels are not in usher's vocabulary, so for sequences with deletions, Ns or reference bases don't cause any trouble. The real trouble is when there are spurious "substitutions" in deleted regions... and Ns. So yeah, in retrospect, maybe faToVcf should have been asserting the reference base for "-" instead of N all this time... [Edit] but then so many existing consensus genome sequences have Ns in deleted regions, so regardless of faToVcf there would still be lots of Ns, unless I had only your uniform assembles. So yes, reference would be better as long as usher does not know about indels.

Would it be possible to pass in two inputs to usher, one which is snps only and one which is indels only. You then do a 3 stage process: use the snp input to infer an initial tree. Then use the indel inoput to infer where the indels occur in the tree. Then combine this info with the first tree to impute the ancestral allele onto the clade which has the indel? Or is this mad?

That would be less of a hack than using nextclade and hardcoded indel expectations, or hardcoded branch-specific masking. 🙂 But a lot more work, since usher does not have a concept of deletion much less insertion. (yet; Yatish et al. are redesigning UShER to handle indels too.)

Honestly, the current state of genome assembly pipelines and indels makes me worry that if & when usher does know about indels, the wildly inconsistent indel inputs will only confuse it, and intervention (masking and/or input-tweaking) will be even more necessary.

AngieHinrichs commented 1 year ago

I'm realizing now that I did a 180 on on reference-backfilling in deleted regions: first I said

in the worst case, they may backfill with reference bases (a horror for phylogenetics)

and then

Indels are not in usher's vocabulary, so for sequences with deletions, Ns or reference bases don't cause any trouble. The real trouble is when there are spurious "substitutions" in deleted regions... and Ns. ... So yes, reference would be better as long as usher does not know about indels.

So I need to clarify.

For deletions, since usher does not know about indels, and some sequences have spurious substitutions in regions that I think are deleted, reference-backfilling instead of N-filling in the consensus fasta (or omitting deleted bases in fasta and then converting deleted bases to reference instead of N in VCF) would limit the damage that those spurious substitutions do in the presence of lots of Ns.
For plain old substitutions in amplicon dropout regions (a completely different problem, not what we're discussing here), reference backfilling is still a horror because it causes false reversions that mess up the tree.

I still think it's best to get rid of the spurious substitutions in deleted regions if possible -- the less noise going into UShER, the better. Unless anyone really believes that the deletion is spontaneously undone and substitutions really happen there. 🙂

iqbal-lab commented 1 year ago

Thanks @AngieHinrichs ! Will digest, but it does make me think the solution for us is to make our own covid-specific faToVcf which replaces deletion (minus sign) with ref. (It only works for sars-cov-2 as we know ref==ancestral).

iqbal-lab commented 1 year ago

Actually could keep faToVcf but change "-" to Ref in the input to faToVcf . This would make it v quick to test if things look better.

iqbal-lab commented 1 year ago

Also @AngieHinrichs , we'll check out and answer this question :

if you look at BAMs for a few examples where the deletion is not called (ref in VCF instead of '.' null call), do they have a noticeable drop in coverage at the site of the deletion? And for the very few cases where a non-reference base is called, do they have not only a big drop in coverage, but maybe also some reads whose alignment is extended into the deletion region because it's near the end of the read and there aren't enough bases to make the align

AngieHinrichs commented 1 year ago

I did a quick test of branch specific masking -- files are in test_files_added.tar.gz. I made a script maskDeletions.sh which is like my maskDelta.sh, but only the deletions, and only for the VoCs that I can spot in your out.all tree (Beta, Delta, BA.1, BA.2), and a minimal script maskDeletionsDelta.sh which masks only the Delta deletions so I could apply that to your out.subset tree (since that is Delta, it ends up effectively masking those sites in the whole out.subset tree), then optimize, and see what happened to the structure. These are the commands that I ran:

bash ./maskDeletions.sh out.all.optimized.pb out.all.opt.bsmask.pb
matOptimize --min-improvement 0.0000001 -i $treeOutPb -o out.all.opt.bsmask.opt.pb

bash ./maskDeletionsDelta.sh out.subset.optimized.pb out.subset.opt.bsmask.pb
matOptimize --min-improvement 0.0000001 -i  out.subset.opt.bsmask.pb  -o out.subset.opt.bsmask.opt.pb

usher_to_taxonium -i out.subset.opt.bsmask.opt.pb \
    --genbank ~/github/taxonium/taxoniumtools/test_data/hu1.gb \
    --name_internal_nodes \
    --title "subset (optimized), deletions masked, re-optimized" \
    --output out.subset.opt.bsmask.opt.jsonl.gz

(Your --genbank path will be different; I haven't looked at the changes in the out.all tree.) From a quick look at your subset tree vs. the masked and re-optimized tree in taxonium, as expected there is no longer a big S:157S branch:

I noticed another little structural difference by eye, and that came down to deleted base 28253:

Deleted base 28271 also removes a branch.

Regarding the substitutions that we believe to be inconsistently detected in Delta: S:G142D (G21987A) still looks a little...incomplete? Are there a lot of sequences with N at 21987? How about S:T95I (C21846T)?

AngieHinrichs commented 1 year ago

(and of course I'd like to know if tweaking the MSA fasta so that deletions are forced to reference instead of N has a similar structural effect on the tree -- only a few samples should have the oddball substitutions, and may end up on their own little branch, but not on a big branch that pulls in so many sequences with Ns.)

AngieHinrichs commented 1 year ago

FYI since there have been other questions about masking recently e.g. #324 I replaced the bash script (maskDelta.sh) with a YAML description of branches and masked sites/ranges/reversions:

https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/branchSpecificMask.yml

(and a python script that reads branchSpecificMask.yml and does what the bash script did)

yatisht / usher

Imputation and null genotypes/indels #325