Open njbernstein opened 1 year ago
Interesting, thanks. Let me look into this. In the meantime, it should be straightforward to exclude these from the downstream parquet file.
To clarify, for others that come across this - this is only relevant if multi-processing is used to speed up the parsing of the Mutect2 output. If performed in a single batch, all variants will be singletons. If using multi-processing and dividing the samples into batches, singletons will be identified within each batch. Each mutation will be included only once in the final parquet file.
Hi there,
There are some variants which are not singletons which are called singletons from create_singleton_dump.py script. This stems from the fact that
not_singletons
is not actually used.Simply put, if a variant was remove for not being a singleton in one batch (batch 1) and that variant only occurs once in another batch (batch 2) then it will incorrectly be called a singleton because there is no evidence of that variant in the output from batch 1. So when the singletons of batch 1 and batch 2 are compared the variant which was removed for not being in a singleton in batch 1 and only occurred in batch 2 once will be considered a singleton.
Best, Nick