yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

an automated way of identifying masked sites? #324

Closed jbloom closed 1 year ago

jbloom commented 1 year ago

I'm not sure if this is feasible given the current structure of the data, but I at least wanted to raise the issue for consideration.

Right now in the pre-built mutation-annotated tree, some sites are masked either globally (#314) or in a clade-specific manner (#312). The latter masking in particular requires detailed analysis of a bash script to identify which sites are masked.

At least for us, a common use case of UShER is looking to see how often a mutation occurs along the tree. However, when doing those queries, it is not possible to distinguish mutations that are masked either globally or in a specific clade just from mutations that don't have any counts. (Or if it's possible to distinguish this, I'm not sure how.)

Is it somehow possible to encode within the mutation-annotated tree or some associated metadata the information about which sites are masked both globally and clade specifically?

AngieHinrichs commented 1 year ago

Yes, sorry it's not easy to discover what sites are masked, especially for branch-specific sites. @rneher has pointed this out too, and the issue comes up now and then in cov-lineages/pango-designation issues.

Here is a quick way to get a list of the Problematic Sites that are masked in all sequences after alignment to reference and before usher adds them to the tree:

curl -Ss  https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf \
| grep -v ^# | grep -w mask | cut -f 2 \
    > positionsMaskedInAll.txt

But the branch-specific masking (performed after usher adds sequences to the tree and before optimization) is currently encoded in a non-portable bash script https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/maskDelta.sh and that requires some real effort to analyze.

No promises, but what if the script were rewritten to use a JSON file that specified sites like this:

[ { "lineage": "B.1.617.2",
    "representative": "IND/GBRC714b/2021",
    "ranges": [ [22027, 22034], [28246, 28253] ],
    "sites": [ 21302, 21304, 21305, 21846, 28254, 28271, 28461 ],
    "comment": "Delta has deletions at S:157-158 (22029-22034), ORF8:119-120 (28248-28253) and 28271.  S:95 (21846) is also very unreliably detected in Delta and caused problems e.g. AY.100 split.  21302, 21304, 21305 caused withdrawn lineage AY.89 (p-d #398).  28254 and 28461 caused withdrawn lineage AY.96 (p-d #435)"
  },
  { "lineage": "B.1.1.7",
    "representative": "Italy/TAA-1900553896/2021",
    "representativeBacktrack": 1,
    "ranges": [ [11288, 11296] ]
  },
...
  { "lineage": "BA.1.1,
    "representative": "England/ALDP-2BEB0A0/2021",
    "sites": [76, 77],
    "reversions": ["G26530A"],
    "comment": "False reversions on 26530 caused fragmented BA.1.1.1; also lots of noise on 76 and 77."
  }
]

Would a format like that be useful to you? (specifically the lineage, ranges, sites and reversions elements)

In general, you can expect sublineages to inherit branch-specific masking from parent lineages although there are rare disagreements between the Pango lineage hierarchy and UShER tree structure (less rare for lineages designated before mid-2021).

BA.4 and BA.5 inherit all of the BA.2 branch-specific masking because they're placed on the BA.2 branch (even if they're not descended from BA.2).

Recombinant lineages (including, in retrospect, BA.*) are placed wherever they fit into the tree by parsimony, usually near the parent from which they inherited more substitutions, if that parent has been sampled. Branch-specific masking can cause them to appear in the UShER tree as though they have some parental mutations that they actually don't have.

jbloom commented 1 year ago

Great, thanks. This is super helpful in terms of being able to get all masked sites.

If you are able to put the lineage-specific masking into a JSON format like you suggest that would be awesome, but also understand if this is not easy to do.

AngieHinrichs commented 1 year ago

OK, I decided to go with YAML instead of JSON because it's a more flexible format that supports comments, and it's very easy to convert YAML to JSON with python (or even online converters) and of course python etc. have libraries that just read YAML into an object. So this is the new source of branch-specific masking info:

https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/utils/otto/sarscov2phylo/branchSpecificMask.yml

jbloom commented 1 year ago

@AngieHinrichs, sorry for the very slow reply, I got sidetracked with other things. This is awesome and really helpful, and I greatly appreciate it!

AngieHinrichs commented 1 year ago

No worries, glad to hear you find it helpful! I'm finding it easier to update than my old bash script too. 😄

russcd commented 1 year ago

Closing since this looks resolved. Thanks, all.