wwylab / MuSE

Somatic point mutation caller
GNU General Public License v2.0
22 stars 8 forks source link

MuSEv2 multiple executions of `MuSE sump` produces different tier rankings for the same calls file #18

Closed scdunatun closed 2 weeks ago

scdunatun commented 6 months ago

Starting Question: Is this a bug or is this expected?

Description of Bug MuSE v2.0.4 produces a non-deterministic tiering of variants for a calls files generated on DRAGEN-aligned bamfiles when running on DNAnexus (within a WDL applet). We have not seen this with our BWA-generated BAMs nor does always happen on the DRAGEN-aligned BAMs.

To Reproduce Steps to reproduce the behavior:

  1. Run DNAnexus custom applet for MuSE with input bamfile which: a. Indexes the bamfile using samtools. b. Runs MuSE e.g. MuSE call -O calls -f hg19.fa -n 72 bam1.bam bam2.bam. (succeeds) c. Runs MuSE sump e.g. MuSE sump -G -I calls.MuSE.txt -O muse-{try#}.vcf -n 72 -D dbsnp.vcf.gz.
  2. Repeat 1.c two or more times with the same calls file from 1.b.
  3. Compare the muse VCFs from 1.c and 2.

Expected Behavior MuSE VCFs between runs of MuSE sump for the same calls file contain the same tier rankings for calls.

Actual Behaviour Ran MuSE sump on the same calls file calls-202403281711652663.MuSE.txt 4 times and compared the outputs.

Here's the log file for the calls file generation: calls_generation_log.txt

These are the only outputs I saw from the MuSE sump calls:

% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-01-202403281711652663.vcf -n 72 -D 00-All_chr_renamed.vcf.gz
EMEstimate size: 32

% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-011.vcf -n 72 -D 00-All_chr_renamed.vcf.gz
EMEstimate size: 34

% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-012.vcf -n 72 -D 00-All_chr_renamed.vcf.gz
EMEstimate size: 25

% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-013.vcf -n 72 -D 00-All_chr_renamed.vcf.gz
EMEstimate size: 43

The files have the same positions called and in the same order; only the header information about the MuSE execution is different when not including the tier ranking:

% diff <(cat muse-01-202403281711652663.vcf | cut -f1,2) <(cat muse-011.vcf | cut -f1,2)
17c17
< ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-01-202403281711652663.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"
---
> ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-011.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"

% diff <(cat muse-012.vcf | cut -f1,2) <(cat muse-011.vcf | cut -f1,2)
17c17
< ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-012.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"
---
> ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-011.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"

% diff <(cat muse-012.vcf | cut -f1,2) <(cat muse-013.vcf | cut -f1,2)
17c17
< ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-012.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"
---
> ##MuSE_sump="sump -G -I calls-202403281711652663.MuSE.txt -O muse-013.vcf -n 72 -D 00-All_chr_renamed.vcf.gz"

However, we have diffs between pairs of VCFs:

% diff muse-011.vcf muse-01-202403281711652663.vcf  | wc -l
1796
% diff muse-012.vcf muse-01-202403281711652663.vcf | wc -l
4422
% diff muse-013.vcf muse-01-202403281711652663.vcf | wc -l
20124
% diff muse-011.vcf muse-012.vcf  | wc -l
2910
% diff muse-011.vcf muse-013.vcf  | wc -l
19164
% diff muse-012.vcf muse-013.vcf  | wc -l
18936

Diff outputs, in order as above: diff_1.txt diff_2.txt diff_3.txt diff_4.txt diff_5.txt diff_6.txt

Additional context We are specifically using bamfiles produced by Illumina's DRAGEN alignment on ICA after sequencing, but not run though additional steps. We attempted to filter the bams with the following filter options from samtools, but these actually increased the tier ranking disparity between multiple executions of MuSE sump:

Remove duplicates: samtools view -F 1024 <bam> Remove improper pairs: samtools view -f 2 <bam> Remove non-primary alignments: samtools view -F 256 <bam> Remove duplicates and improper pairs: samtools view -f 2 -F 1024 <bam> Attempt to use samtools modules (with fixmate and sorts): samtools markdup <bam> <outbam>

Additional Debugging Attempts

% MuSE sump -G -I calls-202403281711652663.MuSE.txt -O muse-015.vcf -n 36 -D 00-All_chr_renamed.vcf.gz EMEstimate size: 39

% diff muse-015.vcf muse-014.vcf | wc -l 10894



**Further Information**
I am able to provide the calls file as well as the dbsnp.vcf.gz that we use for our `MuSE sump` call. These would need to be shared in the same way that I previously shared data for Issue 16, since they are still quite large files.
jiyunmaths commented 5 months ago

Hi @scdunatun, sorry for the delayed response. I will try to reproduce the error and fix the bug. Will let you know when the code is updated. Thanks.

jiyunmaths commented 5 months ago

Hi @scdunatun, I was running MuSE2 sump with different number of threads (i.e., 5, 40 and 80) for 2 pairs of WGS data. My results are: for the same pair, the calls keep the same in terms of chrom, position, reference allele, alternate allele and tier for different thread setting. According to your observation, this issue seems to relate to thread racing in parallel computing. May I ask for the MuSE.txt (obtained from MuSE call) and the dbsnp.vcf.gz for debugging? I appreciate your help that can improve our software.

scdunatun commented 5 months ago

Hi @jiyunmaths, I sent an email to address that I sent download links for issue#16 to (from the same address). Please let me know if you do not get them today or are unable to download the required files. Please also let me know if I have missed providing any required files.

jiyunmaths commented 5 months ago

Hi @scdunatun. Thank you very much for sharing the data. I can download them. I will debug the issue and get back to you.

jiyunmaths commented 2 weeks ago

This issue is closed since the developer can not reproduce it.