yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

Recombinant lineage nomenclature miscBA1BA2PostX #235

Open ktmeaton opened 2 years ago

ktmeaton commented 2 years ago

This week (2022-04-17), I noticed the public trees have a new(?) nomenclature for recombinant lineages that takes the format miscBA1BA2PostX. Some specific examples:

Could you provide some guidance on what this nomenclature means? Thanks!

corneliusroemer commented 2 years ago

It means it's probably one or multiple recombinant lineages of BA.1 and BA.2 with a breakpoint after nuc8393 etc. that has not (yet) given a Pango name, nor has it been proposed in an issue.

The reason they need to be given a name: if they aren't, then they would wrongly be called something else, like XM since they descend from the branch that has XM on it. In order to avoid false positive XM one needs to label it something else. And so Angie gives them a label that summarizes what it is.

Makes sense? The above is my guess after having discussed this with @AngieHinrichs but I haven't seen the labels myself. I'm sure she'll comment but I thought I'll write down what I suspect the meaning is.

ktmeaton commented 2 years ago

That makes perfect sense thank you! Giving these clades names is super helpful, otherwise they previously got misassigned to XM just as you indicate. @AngieHinrichs can I ask how you do the breakpoint detection? I'm guessing ripples in this application?

corneliusroemer commented 2 years ago

My guess would be Angie uses the breakpoint of the pango designated recombinant - one can look up on the issue where it is.

On Wed, Apr 20, 2022, 17:40 Katherine Eaton @.***> wrote:

That makes perfect sense thank you! Giving these clades names is super helpful, otherwise they previously got misassigned to XM just as you indicate. @AngieHinrichs https://github.com/AngieHinrichs can I ask how you do the breakpoint detection? I'm guessing ripples in this application?

— Reply to this email directly, view it on GitHub https://github.com/yatisht/usher/issues/235#issuecomment-1104080749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF77AQI77RYV24HEEREYOZDVGAQPFANCNFSM5T4JWXCA . You are receiving this because you commented.Message ID: @.***>

AngieHinrichs commented 2 years ago

@ktmeaton they are very ad-hoc & experimental, and the "Pre/Post" labels are determined by me looking at what mutations are present/absent at the nodes of the UCSC/UShER tree where I place the labels.

The main purpose for having the misc labels is to distinguish real X recombinant lineage assignments from "this sequence is a potential recombinant with a similar breakpoint as some designated recombinant lineages, but it doesn't quite belong in any of those lineages." @corneliusroemer first pointed out the need for this when lots of sequences were assigned XM in the UCSC/UShER tree, because they were placed on a branch descended from the XM root node, but they had many other mutations that made it clear they were not truly XM. So I added the label "miscRecombOrContam" to that not-XM branch, but this week I changed that one to miscBA1BA2Post17k and added a bunch of other misc* labels as you noticed.

When pruning down the big UCSC/UShER tree to make the minimal tree for use by pangolin, I am actually keeping those branches in the minimal tree, but changing the misc and proposed labels to officially designated lineages so that I don't break pangolin which at this point does not know what to do with labels that are not officially designated lineages. At this point, the misc/proposed labels are replaced with BA.2 or BA.1 depending on which parent lineage donated more mutations to the potential recombinant. My hope is that retaining these undesignated branches as attractors for not-quite-designated-recombinant sequences will make pangolin/usher's assignments of X* recombinant lineages more specific & reliable.

It would be nice to eventually enhance pangolin to support not-quite-lineage labels (to some extent this is done with Scorpio's designations like "Omicron - Unassigned").

Having recombinants in a "tree" data structure (as opposed to a proper ancestral recombination graph) is not ideal, but it's the best we can do at the moment! :) Take "tree" relationships between recombinants with a big grain of salt; they're just clustered by similarity, not in ancestor/descendant relationships like we expect in normal, recombination-free parts of the tree.

ktmeaton commented 2 years ago

Wow, you preemptively answered all of my follow-up questions!

At this point, the misc/proposed labels are replaced with BA.2 or BA.1 depending on which parent lineage donated more mutations to the potential recombinant.

I was wondering if we would start to see the "misc" labels in pangolin, but understand it's better to stick to the designated lineages!

ilevade commented 2 years ago

Hi ! Following Katherine's question and your first answer Cornelius, we actually proposed an issue for the sample "miscBA1BA2Post8393" | EPI_ISL_11360224

https://github.com/cov-lineages/pango-designation/issues/557

But if I understood correctly there were not enough sequences to propose a Pango name in this case. I found other sequences with the same break point (11537-12880), and I added them to the issue. In our case they were assigned XE by NextClade, and not XM.

AngieHinrichs commented 2 years ago

Yes @ilevade, that is a nice cluster of sequences, with several mutations that distinguish them from other recombinants with a similar breakpoint. But so many distinct recombinants were detected that it got a little overwhelming for the Pango group (!), and they decided to raise the minimum number of sequences, at least for countries that can afford to sequence a lot of samples.

"miscBA1BA2Post8393" is a much larger catch-all. To illustrate, here is a screen grab from the Taxonium viewer (cov2tree.org) where I am viewing a local file with the complete UCSC/UShER tree (including both public and GISAID sequences, not public-only like the public tree; I can share the complete tree with GISAID sequences with registered GISAID users, let me know if you are interested):

image

Purple is miscBA1BA2Post8393; the dark red at the bottom is XE (truncated); the dark red somewhat up and to the right is XM; also shown are miscBA1BA2Pre10k, XH, XJ, proposed463, miscBA1BA2Post13k, proposed467, and then past XM, miscBA1BA2Post17k (light gray-green) and XK.

Notice how there is a lot of purple, with more specific branches & colorings interspersed. There is a small red circle that shows the root node of the cluster from cov-lineages/pango-designation/issues/557 . I think it's safe to say that most sequences in miscBA1BA2Post8393 are not directly related to the cluster from Canada -- they're just placed somewhat nearby because of overall similarity (and the impossibility in a tree structure of placing a recombinant adjacent to both of its parents, if its true parents are even included in the tree).

In our case they were assigned XE by NextClade, and not XM.

Nextclade places your sequence on a tree to assign the lineage -- similar idea to UShER, but a separate tree maintained by @corneliusroemer and Ivan Aksamentov, and slightly different placement algorithm. Undesignated recombinants are inherently tough for a tree-based approach to get right. It's possible for the most similar lineage based on placement to change from one data release of the tree to the next, especially as new lineages and/or sequences are added -- and even more so between trees constructed by different methods like Nextclade vs. UShER. In general there is a lot of room for improvement in how recombinants are analyzed and reported!

ilevade commented 2 years ago

Totally understandable, thank you so much for the quick reply. We will continue the monitoring and will let you know if the number of sequences from this cluster increase. Thanks again !