Closed scdunatun closed 2 weeks ago
Hi @scdunatun, sorry for the delayed response. I will try to reproduce the error and fix the bug. Will let you know when the code is updated. Thanks.
Hi @scdunatun, I was running MuSE2 sump with different number of threads (i.e., 5, 40 and 80) for 2 pairs of WGS data. My results are: for the same pair, the calls keep the same in terms of chrom, position, reference allele, alternate allele and tier for different thread setting. According to your observation, this issue seems to relate to thread racing in parallel computing. May I ask for the MuSE.txt (obtained from MuSE call) and the dbsnp.vcf.gz for debugging? I appreciate your help that can improve our software.
Hi @jiyunmaths, I sent an email to address that I sent download links for issue#16 to (from the same address). Please let me know if you do not get them today or are unable to download the required files. Please also let me know if I have missed providing any required files.
Hi @scdunatun. Thank you very much for sharing the data. I can download them. I will debug the issue and get back to you.
This issue is closed since the developer can not reproduce it.
Starting Question: Is this a bug or is this expected?
Description of Bug MuSE v2.0.4 produces a non-deterministic tiering of variants for a calls files generated on DRAGEN-aligned bamfiles when running on DNAnexus (within a WDL applet). We have not seen this with our BWA-generated BAMs nor does always happen on the DRAGEN-aligned BAMs.
To Reproduce Steps to reproduce the behavior:
MuSE call -O calls -f hg19.fa -n 72 bam1.bam bam2.bam
. (succeeds) c. Runs MuSE sump e.g.MuSE sump -G -I calls.MuSE.txt -O muse-{try#}.vcf -n 72 -D dbsnp.vcf.gz
.1.c
two or more times with the same calls file from1.b
.1.c
and2
.Expected Behavior MuSE VCFs between runs of
MuSE sump
for the same calls file contain the same tier rankings for calls.Actual Behaviour Ran
MuSE sump
on the same calls filecalls-202403281711652663.MuSE.txt
4 times and compared the outputs.Here's the log file for the calls file generation: calls_generation_log.txt
These are the only outputs I saw from the
MuSE sump
calls:The files have the same positions called and in the same order; only the header information about the MuSE execution is different when not including the tier ranking:
However, we have diffs between pairs of VCFs:
Diff outputs, in order as above: diff_1.txt diff_2.txt diff_3.txt diff_4.txt diff_5.txt diff_6.txt
Additional context We are specifically using bamfiles produced by Illumina's DRAGEN alignment on ICA after sequencing, but not run though additional steps. We attempted to filter the bams with the following filter options from samtools, but these actually increased the tier ranking disparity between multiple executions of
MuSE sump
:Remove duplicates:
samtools view -F 1024 <bam>
Remove improper pairs:samtools view -f 2 <bam>
Remove non-primary alignments:samtools view -F 256 <bam>
Remove duplicates and improper pairs:samtools view -f 2 -F 1024 <bam>
Attempt to use samtools modules (withfixmate
andsort
s):samtools markdup <bam> <outbam>
Additional Debugging Attempts
[x] Split bamfile to chr1 / chr10 / chr19 to see if the calls -> multiple
MuSE sump
executions would present this issue. The individual chromosome bams/calls did not have disparate tier rankings between multipleMuSE sump
executions.[x] Run
MuSE sump
on a 16 core instance / with fewer than 72 threads. I was unable to reproduce the non-determinism in tier ranking when running on a 16-core instance. Still in progress with some more executions, as I have observed that there is about a 20% chance (empirical observation) of getting the same tier ranking in the VCF.MuSE sump
. I was able to reproduce the issue when using 36 threads.% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-015.vcf -n 36 -D 00-All_chr_renamed.vcf.gz EMEstimate size: 39
% diff muse-015.vcf muse-014.vcf | wc -l 10894