yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

Do the pre-built `UShER` trees mis-categorize 21L sequences as 21M? #301

Closed jbloom closed 1 year ago

jbloom commented 1 year ago

The pre-built UShER trees have many more sequences assigned Nextstrain clade 21M than 21L.

Talking to @rneher, he thinks this might be some sort of mis-categorization? @AngieHinrichs, he suggested asking you about it.

For instance, here are the clade counts I get for the 2022-09-20 tree. You can see there are many more 21M samples (829,164) than 21L samples (10,648):

nextstrain_clade,count
19A,12026
19B,6359
20A,111567
20B,100879
20C,60502
20D,5782
20E,101067
20F,11458
20G,71929
20H,7863
20I,598540
20J,26607
21A,1882
21B,1004
21C,41460
21D,2562
21E,46
21F,33152
21G,1316
21H,5547
21I,191121
21J,2552950
21K,960424
21L,10648
21M,829164
22A,61044
22B,318544
22C,151588
22D,890

It's similar with the 2022-11-04 tree:

nextstrain_clade,count
19A,12016
19B,6377
20A,112216
20B,101241
20C,61202
20D,5784
20E,101100
20F,11458
20G,72845
20H,7892
20I,598864
20J,26625
21A,1886
21B,1004
21C,41487
21D,2592
21E,46
21F,33177
21G,1323
21H,5553
21I,191535
21J,2558299
21K,968016
21L,10794
21M,836171
22A,74029
22B,425872
22C,154586
22D,4664
22E,8453
22F,284
AngieHinrichs commented 1 year ago

Oops, sorry about that and thanks for pointing it out! Should be fixed in the 2022-11-07 build, look for it around 7pm Pacific time this evening.

jbloom commented 1 year ago

Awesome, thanks so much!

jbloom commented 1 year ago

Thanks so much, @AngieHinrichs. The counts now seem more sensible in the 2022-11-07 tree: very few 21M counts and many more 21L:

nextstrain_clade,count
19A,12026
19B,6377
20A,112252
20B,101283
20C,61246
20D,5785
20E,101100
20F,11458
20G,72943
20H,7896
20I,599147
20J,26651
21A,1886
21B,1004
21C,41500
21D,2592
21E,46
21F,33178
21G,1323
21H,5553
21I,191546
21J,2558394
21K,968268
21L,848186
21M,170
22A,75017
22B,434838
22C,154997
22D,5165
22E,9987
22F,340