limit calling to specific chromosomes/contigs

sorelfitzgibbon commented 1 year ago

Hi, I'm not seeing any way to limit MuSE to run only on specific genome intervals. At a minimum we would find it very helpful to be able to limit the calls to the main chromosomes, avoiding for example decoy and alternate contigs that may be present in the reference/BAM.

Related to this, I'd like to know if this could help with an issue of some samples taking many days (over a week even) to run, while others run relatively quickly. The difficult samples are not always the largest input BAMs and I wonder if it's plausible that the hang-ups are caused by outlier regions with very high depth of coverage or some other type of complexity?

Thank you for any help!

wwylab commented 1 year ago

Hi there,

The issue with running MuSE on specific genomic intervals is the training of the error model would be off. It is not impossible though, if you would stop at getting the pi estimate per position, which is genome interval independent, then train your error model using some truth data. What the MuSE_Sump does is to apply a WGS specific model to identify a sample-specific cutoff on pi value using all pi's across the whole genome, and to apply a WES specific model to identify a sample-specific cutoff on pi value using all pi's across the whole exome. We haven't trained the error model on other targeted sequencing studies.

wwylab commented 1 year ago

On the speed question, are you using MuSE2.0? it shouldn't take days to run on one sample. Please provide further details on your data. -Wenyi

sorelfitzgibbon commented 1 year ago

Hi - Thanks for your quick response! To clarify I did not mean limiting to a small set of targeted regions, but rather whole chromosomes. Or even just being able to exclude regions like the decoy and alternate contigs. We have seen a big change in runtime with Strelka2 when these regions are excluded, and wonder if that could also be true for MuSE.

We are using MuSE2.0 on some large input BAMs (200-400GB), but the issue does not seem to occur more frequently with the very largest sizes but rather is somewhat randomly distributed over the set of large input files. We are still checking if it's reproducible with the same input sample.

Thanks again!

wwylab commented 1 year ago

Can you confirm that you are using MuSE2.0.2? We fixed a read group issue in that version. It is the latest release.

nkwang24 commented 1 year ago

Hi, I am working with Sorel on this. I have an example log file from one of our stuck samples. Would you be able to tell if our issue is related to the fixed read group issue? We will try with the new release. Thanks!

Last few lines of the log file:

[21:59:11] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:12] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:13] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:14] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:15] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:16] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721 [21:59:17] chr6:33103000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 93721

Similar issue with another sample but stuck at a different place:

[22:11:56] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558 [22:11:57] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558 [22:11:58] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558 [22:11:59] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558 [22:12:00] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558 [22:12:01] chr17:36767000 BamRead 200000 processQSize 0 writeQSize 10000 readPool 195558

jiyunmaths commented 1 year ago

@nkwang24 Could you please also tell us the memory & CPU usage during your run? This is helpful for us debugging? Thanks.

nkwang24 commented 1 year ago

We are running this on a node with 72 cpus and 144 Gb RAM. This is the output from top on the node with the currently stalled job. Does this help?

jiyunmaths commented 1 year ago

@nkwang24 Thanks for your information. It seems both CPU and RAM usage are reasonable for MuSE 2. Also, the issue should occur in the 'MuSE call' step, running the new release can not fix it since we only updated the 'MuSE sump' step. Can you tell us how you did the preprocessing for the BAM files? We introduced the steps in the README.

nkwang24 commented 1 year ago

I believe we are following the same preprocessing steps:

Align FASTQs with BWA-MEM2
Sort, merge, mark duplicates and index BAMs
Indel realignment and apply BQSR
Reheader, index and merge BAMs

Does this help?

wwylab commented 1 year ago

Did you use GATK4 or GATK3? which version specifically?

nkwang24 commented 1 year ago

We are using GATK4

tyamaguchi-ucla commented 1 year ago

For Indel Realignment, we're using GATK3.7 but otherwise GATK4.

jiyunmaths commented 1 year ago

@nkwang24 @tyamaguchi-ucla, please use GATK3.7 for the preprocessing of BAM files where GATK is required: marking duplicates, Indel realignment and BQSR. Let's know if you still have the issue.

tyamaguchi-ucla commented 1 year ago

@jiyunmaths This request is not straightforward for us and not feasible at the moment. Could you elaborate on the potential issues arising from the use of GATK3 and GATK4?

wwylab commented 1 year ago

@tyamaguchi-ucla As far as GATK3 vs GATK4, we just haven't fully benchmarked MuSE's compatibility with GATK4, although have seen it run ok with it. So far, we've only benchmarked on GATK3.7. We suspect the MuSE runtime issue may be related to the switch to GATK4. We will be happy to debug with you guys which will give us further insight on this.

pboutros commented 1 year ago

I'll second the comment on swapping back to GATK3 for those steps being difficult for us. I'll work with the team here to identify a sample that we can share the FASTQs for and let you work from there on k?

jiyunmaths commented 4 months ago

We benchmarked GATK3 vs. GATK4 for preprocessing and found very small difference.

wwylab / MuSE

limit calling to specific chromosomes/contigs #11