yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

matUtils introduce: possible bug in handling cluster date range #281

Closed jen-martin closed 1 year ago

jen-martin commented 1 year ago

@jmcbroome There may be a bug in how matUtils introduce assigns date ranges to clusters. Extra samples not in the cluster (but in the cluster's clade(?)) seem to be evaluated when assigning a date range for that cluster.

To reproduce: Using the example in the docs (public-2021-06-09.all.masked.nextclade.pangolin.pb.gz and regional-samples.txt) and using the command: matUtils introduce -i public-2021-06-09.all.masked.nextclade.pangolin.pb.gz -s regional-samples.txt -o regional-introductions.tsv

Here are some snippets from the last few lines in the regional-introductions.tsv output file:

sample  introduction_node   introduction_rank   growth_score    earliest_date   latest_date cluster_size    cluster_span    intro_confidence    parent_confidence   distance    origin_gap  annotation_1    annotation_2    mutation_path
...
England/PORT-2DB0FD/2021|2021-02-05 default_node_90896  19  0.011236    2021-Jan-29 2021-Feb-05 1   1   1   0   1   2   20I/501Y.V1 B.1.1.7 C11750T,T26972C<A20015G,A21788G<T17865C<G11296T<A21194G<G4136T<C2453T<T28095A<T11296G<A28095T<G28048T<
...
England/CAMC-10FABD8/2021|OD950207.1|2021-01-20 default_node_90895  22  0.010989    2021-Jan-20 2021-Feb-05 1   1   1   0   1   2   20I/501Y.V1 B.1.1.7 A20015G,A21788G<T17865C<G11296T<A21194G<G4136T<C2453T<T28095A<T11296G<A28095T<G28048T<
...

Both of these are single sample clusters, so the date range should simply be the date of the sample, e.g., default_node_90896 should have earliest_date = latest_date = 2021-Feb-05 and for default_node_90895 this should be 2021-Jan-20.

It looks perhaps like all the leaves in the cluster's clade, instead of just the samples in the cluster, are being evaluated in the get_nearest_date() function. For example, it looks like these are the samples that are evaluated for default_node_90895 in the get_nearest_date() function: England/CAMC-10FABD8/2021|OD950207.1|2021-01-20 --> the sample of interest in this cluster, earliest_date and latest_date set to 2021-Jan-20, as expected England/CAMC-CB887C/2020|OD918301.1|2020-12-18 --> skipped, not in the regions file England/PORT-2DB109/2021|2021-02-05 --> skipped, not in the regions file England/PORT-2DB0FD/2021|2021-02-05 --> processed, latest_date now equals 2021-Feb-05! England/PORT-2DB0EE/2021|2021-02-05 --> processed, earliest_date and latest_date unchanged England/PORT-2DB136/2021|2021-02-05 --> skipped, not in the regions file England/PORT-2DB127/2021|2021-02-05 --> processed, earliest_date and latest_date unchanged England/MILK-11EF346/2021|OD970188.1|2021-01-29 --> processed, earliest_date and latest_date unchanged

However, England/PORT-2DB0FD/2021|2021-02-05 is part of another cluster (default_node_90896), as are England/PORT-2DB0EE/2021|2021-02-05 and England/PORT-2DB127/2021|2021-02-05 (both are in default_node_90897).

Perhaps there needs to be an extra step before this line to filter the samples to just those in the cluster of interest?

jmcbroome commented 1 year ago

Thank you for the detailed report and reproduction example. I think your read of the issue is correct. If you have written a solution, please open a pull request to this repository and tag me, and I will review/merge it.

jen-martin commented 1 year ago

@jmcbroome - see PR #287