yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

Same Node, Different Lineage #236

Open ktmeaton opened 2 years ago

ktmeaton commented 2 years ago

image

The placement is at the junction of XM and miscBA1BA2Post17k (which I think implies some uncertainty). By what mechanism would the lineage assignments be different in this case? And any suggestions on how to detect/quantify uncertainty in this case (perhaps by the number of placements)?

Thanks!

corneliusroemer commented 2 years ago

Maybe the lineage assignment was done using a previous = slightly different tree? @AngieHinrichs should know, I'm just guessing

ktmeaton commented 2 years ago

Thanks for the idea, that makes sense so I'll try to reproduce this analysis.

ktmeaton commented 2 years ago

Interestingly, I get the same results using the public tree and the web browser (https://genome.ucsc.edu/cgi-bin/hgPhyloPlace).

Identical node placement (link):

image

But different lineage assignments: image

The "true" assignment should be the "misc" lineage based on the GISAID tree. So the sample with lower genomic quality is incorrect. But I'm still curious why their assignments differ.

AngieHinrichs commented 2 years ago

That is really strange @ktmeaton! Was the auspice JSON for your first image generated using matUtils? After using usher to add the Canadian sequences to the public tree?

In your hgPhyloPlace view, I notice that the placement of the two sequences splits the branch from XM to miscBA1BA2Post17k in the public tree. (In your first image, there is one more "miscBA1BA2Post17k" that looks a little out of place but I can't tell what sequence that is.) usher would have had to place one sequence first (splitting the branch), then the other (adjacent to the first sequence), and it's possible that somehow that would cause a difference in how their nearest-neighbor-for-purpose-of-guessing-lineage would be found. I will have to look at the code to figure out what's really going on there.

But if you're using hgPhyloPlace, since those two samples are already in the non-public tree, there's a kind of roundabout way to check their assignment in the non-public tree. I pasted in the name of a nearby sequence (Denmark/DCGC-474438/2022) to get the branch with those two Canadian sequences (without the annoying "uploaded sample" labels for all attributes, sorry about those): https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/usher-236.json?c=pango_lineage_usher&label=nuc%20mutations:T19955C,G20055A If you zoom in to the branch with the red sequence, you can see that the two Canadian sequences are solidly part of the miscBA1BA2Post17k branch of the non-public tree:

image

That's not a very satisfying answer to give, sorry about that. I can share the full tree privately with registered GISAID users if you would like to try that instead of the public tree -- if so, email angie at soe dot ucsc dot edu.

ktmeaton commented 2 years ago

That is really strange @ktmeaton! Was the auspice JSON for your first image generated using matUtils? After using usher to add the Canadian sequences to the public tree?

Yup, that's exactly what I did! Nextclade align ->faToVcf -> UShER -> matUtils extract

(In your first image, there is one more "miscBA1BA2Post17k" that looks a little out of place but I can't tell what sequence that is.)

Those two aren't public yet, but are also Canadian sequences with odd lineage assignments in this junction.

But if you're using hgPhyloPlace, since those two samples are already in the non-public tree, there's a kind of roundabout way to check their assignment in the non-public tree.

I've been doing that sometimes to check assignments, I'm glad to hear that's a good approach!

Is it fair to summarize this issue as:

  1. Placing samples in-between clades is an edge case, that leads to unstable lineage calls. (Although what is a "stable" recombinant lineage call really, got to manage expectations).
  2. These edge cases might be identified by checking the assignment of a sister sample (k=1 nearest neighbour).
  3. The best workaround is to use a larger, more diverse tree (GISAID+Public). Which minizes the likelihood that a sample will fall in between these clades.