yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

feature request: don't place high parsimony samples #145

Closed roblanf closed 3 years ago

roblanf commented 3 years ago

Currently there's an option to not place uncertain samples:

  -e [ --max-uncertainty-per-sample ] arg (=1000000)
                                        Maximum number of equally parsimonious 
                                        placements allowed per sample beyond 
                                        which the sample is ignored

My request (as always, I know everyone is busy, I am probably the only person who wants this option, and I am very happy to be completely ignored) is for something that won't place really rubbish samples that are obviously garbage:

  -E [ --max-parsimony-per-sample ] arg (=1000000)
                                        Maximum parsimony score of most parsimonious 
                                        placement allowed per sample beyond 
                                        which the sample is ignored

This would allow one to try and place many samples, but to ignore samples that were missed in prior filtering steps but that parsimony shows are obviously no good. E.g. those on terminal branches >100, which I sometimes see and then have to prune out later.

russcd commented 3 years ago

I agree that this is a good thing to add. We do basically prune out the real trash basically immediately after we finish addition. The big concern is that these samples can affect the placement of other (good) samples and shouldn't.

roblanf commented 3 years ago

Long Placement Attraction? There's a PhD chapter in that when it's all a bit less full on.

russcd commented 3 years ago

I definitely think LPA is a thing. Would be a fun problem particularly when you consider the effects of correlated error.

yatisht commented 3 years ago

I added this option in the latest commit. It should become available via conda once I create a new release. Thanks for the suggestion, Rob!

russcd commented 3 years ago

@AngieHinrichs could we add this filter to the public tree workflow?

AngieHinrichs commented 3 years ago

Sure. Currently I'm filtering post-placement with matUtils extract --max-parsimony 20 --max-branch-length 30. So just to confirm, I'll be adding --max-parsimony-per-sample 20 to usher, and then just matUtils extract --max-branch-length 30.

russcd commented 3 years ago

This option is available in the latest commits. Thanks @yatisht !