yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

potential issue with -E? #155

Closed roblanf closed 3 years ago

roblanf commented 3 years ago

I'm running UShER with -E 20, which should mean that samples with parsimony >20 are not placed.

In some cases (like really awful sequences) it works as expected, and reports that the threshold was exceeded and the sample wasn't placed:

Current tree size (#nodes): 1592295     Sample name: EPI_ISL_1699443    Parsimony score: 5860   Number of parsimony-optimal placements: 3

WARNING: Parsimony score of the most parsimonious placement exceeds the maximum allowed value (20). Ignoring sample EPI_ISL_1699443.

but in others that are closer to the cutoff, there's no such output. So it seems like these are still being placed (it hasn't finished running yet so I can't confirm if they were actually placed. I'll update this issue in a bit when I have that info).

Current tree size (#nodes): 1592636     Sample name: EPI_ISL_2150778    Parsimony score: 22     Number of parsimony-optimal placements: 1
Completed in 472 msec                                                                                                                                                                                                                                                                                                 

then again, some with modest parsimony scores are excluded:

Current tree size (#nodes): 1592668     Sample name: EPI_ISL_2158548    Parsimony score: 24     Number of parsimony-optimal placements: 2                   
WARNING: Parsimony score of the most parsimonious placement exceeds the maximum allowed value (20). Ignoring sample EPI_ISL_2158548.                        
Completed in 736 msec   

but it doesn't look like a simple issue with a threshold:


Current tree size (#nodes): 1592773     Sample name: EPI_ISL_1822602    Parsimony score: 26     Number of parsimony-optimal placements: 1                   
Completed in 599 msec                                                                                                                                       

The closest guess I have (backed up by only the data I've managed to swipe here as the output goes past) is that the filter only seems to kick in when there's >1 optimal placement.

My understanding (and I think probably the best behaviour for this setting) was that it shouldn't place any sample with a score > the threshold.

I could just be missing something though, of course.

roblanf commented 3 years ago

OK, watching another 5 minutes of output scroll past, I can confirm that the only time the filter seems to kick in is when there's more than one most parsimonious placement. So I guess the check for the parsimony cutoff is sitting inside a conditional for >1 parsimony placement.

Would it be better to move it out? I.e. be able to exclude sequences with parsimony >E, regardless of how many placements there are? I'd argue yes, if we're right that there could be issue with long-placement attraction.

yatisht commented 3 years ago

Thanks for pointing this out. It seems like it was doing the right thing, i.e. not placing any sample with parsimony >20, but it printed the WARNING message only when the sample also had >1 parsimonious placements. I have fixed this in the latest commit, which would appear in the next release.

roblanf commented 3 years ago

Ah cool. Thanks for double checking!