Closed jen-martin closed 1 year ago
Thank you for the detailed report and reproduction example. I think your read of the issue is correct. If you have written a solution, please open a pull request to this repository and tag me, and I will review/merge it.
@jmcbroome - see PR #287
@jmcbroome There may be a bug in how matUtils introduce assigns date ranges to clusters. Extra samples not in the cluster (but in the cluster's clade(?)) seem to be evaluated when assigning a date range for that cluster.
To reproduce: Using the example in the docs (public-2021-06-09.all.masked.nextclade.pangolin.pb.gz and regional-samples.txt) and using the command:
matUtils introduce -i public-2021-06-09.all.masked.nextclade.pangolin.pb.gz -s regional-samples.txt -o regional-introductions.tsv
Here are some snippets from the last few lines in the regional-introductions.tsv output file:
Both of these are single sample clusters, so the date range should simply be the date of the sample, e.g.,
default_node_90896
should haveearliest_date
=latest_date
=2021-Feb-05
and fordefault_node_90895
this should be2021-Jan-20
.It looks perhaps like all the leaves in the cluster's clade, instead of just the samples in the cluster, are being evaluated in the
get_nearest_date()
function. For example, it looks like these are the samples that are evaluated fordefault_node_90895
in theget_nearest_date()
function:England/CAMC-10FABD8/2021|OD950207.1|2021-01-20
--> the sample of interest in this cluster, earliest_date and latest_date set to 2021-Jan-20, as expectedEngland/CAMC-CB887C/2020|OD918301.1|2020-12-18
--> skipped, not in the regions fileEngland/PORT-2DB109/2021|2021-02-05
--> skipped, not in the regions fileEngland/PORT-2DB0FD/2021|2021-02-05
--> processed, latest_date now equals 2021-Feb-05!England/PORT-2DB0EE/2021|2021-02-05
--> processed, earliest_date and latest_date unchangedEngland/PORT-2DB136/2021|2021-02-05
--> skipped, not in the regions fileEngland/PORT-2DB127/2021|2021-02-05
--> processed, earliest_date and latest_date unchangedEngland/MILK-11EF346/2021|OD970188.1|2021-01-29
--> processed, earliest_date and latest_date unchangedHowever,
England/PORT-2DB0FD/2021|2021-02-05
is part of another cluster (default_node_90896
), as areEngland/PORT-2DB0EE/2021|2021-02-05
andEngland/PORT-2DB127/2021|2021-02-05
(both are indefault_node_90897
).Perhaps there needs to be an extra step before this line to filter the samples to just those in the cluster of interest?