Closed AngieHinrichs closed 2 years ago
Tagging @jmcbroome.
@jmcbroome I found sample_intersect
in select.cpp, so the easy solution is
- return good_samples;
+ return sample_intersect(good_samples, samples_to_check);
However...
(
std::unordered_set
is probably more efficient thanstd::vector
for quickly checking whether a sample is in the set of samples that passed previous filtering steps)
That's actually a big deal when there are ~10 million samples. Looking up a name in a std::vector
is O(N), while looking up a name in a std::unordered_set
is O(constant). So intersecting two vectors is O(N^2), while intersecting a vector and an unordered set is O(N)*O(constant) so O(N). So just for fun I tried sample_intersect as-is, and left that running while I made another version of sample_intersect that takes std::unordered_set
for one of the inputs, changed good_samples
to be an unordered set instead of vector. The unordered set version takes about a minute to intersect, while the two-vectors version was still running after more than an hour so I killed it.
I will send a PR with the unordered set change.
Oops, hit send too soon -- I think it would be best (though not urgent) to use unordered_set instead of vector everywhere, or at least in the places that call sample_intersect
.
To summarize- you've handled this specific case of the bug where max path length overrides other options, but you'd appreciate a refactor where I replace use of vectors with sets for sample handling? I can look into that for sure.
Yep, thanks @jmcbroome.
The big tree daily build includes a filtering step with this matUtils command:
I noticed that some samples definitely should not pass those filters but still appear in the output, for example a sample with parsimony score of 60 is included in the output. I found that if I omitted the
--max-path-length
option, then the sample was excluded as if should be. Looking at select.cppget_short_paths
, it ignores its inputsamples_to_check
when checking path lengths. (std::unordered_set
is probably more efficient thanstd::vector
for quickly checking whether a sample is in the set of samples that passed previous filtering steps).(Or as a quick fix, if
get_short_paths
is the only filtering step that ignores the incoming list of samples, then it could go first and the other steps could follow it.)