yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

Allow --clade-paths to work in conjunction with --clade-mutations and --clade-names #242

Closed AngieHinrichs closed 2 years ago

AngieHinrichs commented 2 years ago

Currently in matUtils annotate, --clade-mutations and --clade-names can be used simultaneously with --clade-mutations taking precedence, but --clade-paths is mutually exclusive with all other --clade-* options. This change allows all three of --clade-paths, --clade-mutations and --clade-names to be used simultaneously, with --clade-paths taking highest precedence. (The options can still be used separately.)

When all three options are used together, first we try the paths file; any clade annotated by path will be ignored if we encounter it again in the mutations or names files. Then we try the mutations file and then names as fallbacks for paths that can't be found because the order of mutations has changed. This combines the speed of the paths method with the robustness of the mutations/names method.

I have tested this by running a local test case with different combinations of options, and have been using all three options together in the big tree daily build for over a week. Annotating nodes on the big tree (approaching 10M sequences) now takes 5-30 minutes (depending on how many paths change, usually 5 minutes) instead of 6-7 hours.

Also, matUtils summary -C sample-clades -i ... was hardcoded to write out two columns of annotations (assuming Nextstrain clade and Pango lineage) regardless of how many annotations are present; this PR also includes an update so that it prints out however many annotations are present.