yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

indexing bug can cause overspecific clade assignment #233

Closed AngieHinrichs closed 2 years ago

AngieHinrichs commented 2 years ago

I believe the index of node_has_unique should be best_j_vec[k], not just k, in line 610 of usher_common.cpp:

609                             for (size_t k=0; k < best_j_vec.size(); k++) {
610                                 bool include_self = !bfs[best_j_vec[k]]->is_leaf() && !node_has_unique[k];

In this loop k is iterating over the number of optimal placements, but node_has_unique[] needs a node index as for bfs[].

In my test case (sequence can't be shared publicly but I can share privately or find/construct another), the input sequence is placed at a node that has a clade annotation -- but that node has several mutations that the input sequence does not have, so in mapper2_body, has_unique is set to true for that node. I expected that that node's clade would not be assigned, but instead the next ancestral clade. However, include_self was set to true because node_has_unique[k] (with k=0) was false, so the node's annotated clade was assigned even though the sequence was missing several of its mutations.

When I changed node_has_unique[k] to node_has_unique[best_j_vec[k]], the has_unique = true for that node was passed forward, include_self was false as expected, and the ancestral clade was assigned as expected.

It's a very small change so I will send a pull request in a few minutes...