Closed AngieHinrichs closed 1 year ago
@jmcbroome I am going to merge this, but since --node-stats filename
only writes summary stats, maybe it's better suited to matUtils summary
? I'll let you decide if you want to move it there or not.
re: Angie- thanks for getting this in! I'll check it out and see about using it with the automated lineage output.
re: Yatish- I agree that matUtils extract is extremely bloated already. I'll look into reorganizing- thanks
I added two options to matUtils extract to prune problematic sequences and branches from the SARS-CoV-2 tree before making a minimized tree for use with pangolin:
--node-stats filename
writes out various statistics of each internal node and sample, including the number of reversions to reference since the most recent ancestor annotated as a clade/lineage root. (For recent pangolin data releases, I've been pruning samples with 2 or more reversions since lineage root.)--max-mutation-density D
removes samples descended from nodes whose "mutation density" (sum of mutation counts divided by number of leaves) is greater than D (I've been using D=2 for pangolin), with several exceptions: