find_singletons bug - Githubissues

njbernstein commented 1 year ago

Hi there,

There are some variants which are not singletons which are called singletons from create_singleton_dump.py script. This stems from the fact that not_singletons is not actually used.

Simply put, if a variant was remove for not being a singleton in one batch (batch 1) and that variant only occurs once in another batch (batch 2) then it will incorrectly be called a singleton because there is no evidence of that variant in the output from batch 1. So when the singletons of batch 1 and batch 2 are compared the variant which was removed for not being in a singleton in batch 1 and only occurred in batch 2 once will be considered a singleton.

Best, Nick

weinstockj commented 1 year ago

Interesting, thanks. Let me look into this. In the meantime, it should be straightforward to exclude these from the downstream parquet file.

weinstockj commented 1 year ago

To clarify, for others that come across this - this is only relevant if multi-processing is used to speed up the parsing of the Mutect2 output. If performed in a single batch, all variants will be singletons. If using multi-processing and dividing the samples into batches, singletons will be identified within each batch. Each mutation will be included only once in the final parquet file.

weinstockj / passenger_count_variant_calling

find_singletons bug #1