yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

matUtils extract --max-path-length includes samples that should be excluded by --max-parsimony, --max-branch-length #243

Closed AngieHinrichs closed 2 years ago

AngieHinrichs commented 2 years ago

The big tree daily build includes a filtering step with this matUtils command:

    matUtils extract -i gisaidAndPublic.$today.masked.preTrim.pb \
        --max-parsimony 20 \
        --max-branch-length 45 \
        --max-path-length 150 \
        -O -o gisaidAndPublic.$today.masked.pb

I noticed that some samples definitely should not pass those filters but still appear in the output, for example a sample with parsimony score of 60 is included in the output. I found that if I omitted the --max-path-length option, then the sample was excluded as if should be. Looking at select.cpp get_short_paths, it ignores its input samples_to_check when checking path lengths. (std::unordered_set is probably more efficient than std::vector for quickly checking whether a sample is in the set of samples that passed previous filtering steps).

(Or as a quick fix, if get_short_paths is the only filtering step that ignores the incoming list of samples, then it could go first and the other steps could follow it.)

yatisht commented 2 years ago

Tagging @jmcbroome.

AngieHinrichs commented 2 years ago

@jmcbroome I found sample_intersect in select.cpp, so the easy solution is

-    return good_samples;
+    return sample_intersect(good_samples, samples_to_check);

However...

(std::unordered_set is probably more efficient than std::vector for quickly checking whether a sample is in the set of samples that passed previous filtering steps)

That's actually a big deal when there are ~10 million samples. Looking up a name in a std::vector is O(N), while looking up a name in a std::unordered_set is O(constant). So intersecting two vectors is O(N^2), while intersecting a vector and an unordered set is O(N)*O(constant) so O(N). So just for fun I tried sample_intersect as-is, and left that running while I made another version of sample_intersect that takes std::unordered_set for one of the inputs, changed good_samples to be an unordered set instead of vector. The unordered set version takes about a minute to intersect, while the two-vectors version was still running after more than an hour so I killed it.

I will send a PR with the unordered set change.

AngieHinrichs commented 2 years ago

Oops, hit send too soon -- I think it would be best (though not urgent) to use unordered_set instead of vector everywhere, or at least in the places that call sample_intersect.

jmcbroome commented 2 years ago

To summarize- you've handled this specific case of the bug where max path length overrides other options, but you'd appreciate a refactor where I replace use of vectors with sets for sample handling? I can look into that for sure.

AngieHinrichs commented 2 years ago

Yep, thanks @jmcbroome.