ylab-hi / ScanNeo2

Snakemake-based computational workflow for neoantigen prediction from diverse sources
MIT License
10 stars 1 forks source link

Can I run the pipeline with slurm on HPC? #26

Open Xiao-Zhong opened 3 weeks ago

Xiao-Zhong commented 3 weeks ago

Hi, I'm new to snakemake, and have no idea of submitting Scanneo2 jobs to HPC. Any instructions would be very helpful! Thanks!

Regards, Xiao

riasc commented 3 weeks ago

Hi Xiao,

thanks for your interest in using the tool. There is a snakemake-executor plugin that allows you to you snakemake scripts over SURM. https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html

When this is installed you only have to add --executor slurm to the snakemake call.

Another way is to simply run is normally in SLURMs interactive mode, e.g.,

srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i

and then using the basic call (snakemake --cores all --software-deployment-method conda...) as before.

This may be a bit tedious if you have many different samples you want to have analyzed simultaneously. Actually, I working on this in v0.3 to make this more straight-forward which may be released somewhere next week.

Xiao-Zhong commented 2 weeks ago

Thanks! A simple way for me now is to put the command into a slurm script as below, and submit the script by sbatch.

`#!/bin/bash

SBATCH --job-name=scanneo2

SBATCH --nodes=1

SBATCH --partition=work

SBATCH --ntasks=19

SBATCH --mem=120G

SBATCH --time=36:00:00

snakemake --cores all --software-deployment-method conda`

Can I ask what's the best way to run scanneo2 for multiple samples? I run it for another sample in a separate directory as below, which looks scanneo2 pipeline downloads dependencies and reference database again.

Step1 mkdir -p /path/to/your/working/directory/ cd /path/to/your/working/directory/ git clone --recurse-submodules https://github.com/ylab-hi/ScanNeo2.git cd ScanNeo2

Step2 edit the config/config.yml

Step3 Run the slurm script above.

Thank you!

riasc commented 2 weeks ago

Hi,

thank your for your feedback. So in principle you can run ScanNeo2 from the same directory (for differeent files) using a different config file. Just change the name (https://github.com/ylab-hi/ScanNeo2/blob/52a3818ec3189af502f18eba6b6a1a69b9b3a8c3/config/config.yaml#L12) in the configfile and the results will end up in the different folder within the "results". But make sure you first run it until all the necessary files like genome, VEP are downloaded.. otherwise different call may overwrite these files...

I need to check if snakemake somehow allows to lock these files such that they are only downloaded once and also not overwritten by multiple processes. However, if you create it in new folders you would have to download it again each time.

But you gave me an idea. I could write a small scripts that accepts multiple files and distribute it across different nodes.. let me look into that.

Thanks

Xiao-Zhong commented 2 weeks ago

OK. I see. Thanks! I believe the two jobs have successfully downloaded and/or installed dependences, however both terminated in pre-processing data. The error is as below:

**[Tue Jun 18 00:36:20 2024] Job 102: Retrieve paired-end reads (DNA) for HLA genotyping - sample:s11578 group:dna_tumor Reason: Missing output files: results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq, results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq; Input files updated by another job: results/s11578/dnaseq/reads/dna_tumor_R1_pre>

Activating conda environment: .snakemake/conda/d8900894438122f24cfe02e35dcb6bfd_ ^M 5 898M 5 48.8M 0 0 149k 0 1:42:48 0:05:35 1:37:13 151k^M100 2075M 0 2075M 0 0 6335k 0 --:--:-- 0:05:35 --:--:-- 5064kNo input file specified. [W::sam_read1] Parse error at line 2 samtools sort: truncated file. Aborting [Tue Jun 18 00:36:20 2024] Error in rule get_reads_hlatyping_PE: jobid: 102 input: results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz, results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz output: results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq, results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq log: logs/s11578/hla/get_reads_hlatyping_PE_dna_tumor_DNA.log (check log file(s) for error details) conda-env: /scratch/ms002/xzhong/04ScanNeo2/ScanNeo2/.snakemake/conda/d8900894438122f24cfe02e35dcb6bfd shell:

  samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq
  samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job get_reads_hlatyping_PE since they might be corrupted: results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq ^M100 2081M 0 2081M 0 0 6333k 0 --:--:-- 0:05:36 --:--:-- 5118k^M 5 898M 5 49.0M 0 0 149k 0 1:42:49 0:05:36 1:37:13 149k^M 5 898M 5 49.1M 0 0 149k 0 > Finished job 15. 7 of 116 steps (6%) done ^M100 6854M 0 6854M 0 0 6719k 0 --:--:-- 0:17:24 --:--:-- 7395k^M 17 898M 17 153M 0 0 150k 0 1:41:55 0:17:24 1:24:31 152k^M100 6862M 0 6862M 0 0 6721k 0 > Finished job 16. 8 of 116 steps (7%) done ^M100 7604M 0 7604M 0 0 6664k 0 --:--:-- 0:19:28 --:--:-- 7311k^M 19 898M 19 171M 0 0 150k 0 1:41:53 0:19:28 1:22:25 153k^M100 7611M 0 7611M 0 0 6664k 0 > [Tue Jun 18 01:38:17 2024] Finished job 56. 9 of 116 steps (8%) done ^M 66 898M 66 597M 0 0 150k 0 1:41:38 1:07:32 0:34:06 149k^M 66 898M 66 597M 0 0 150k 0 1:41:38 1:07:33 0:34:05 151k^M 66 898M 66 597M 0 0 150k 0 > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 0 0 0 0 0 0 0 > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 14.0M 0 7749 0 0 3686 0 > [Tue Jun 18 02:20:37 2024] Finished job 19. 10 of 116 steps (9%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-06-18T003023.457455.snakemake.log WorkflowError: At least one job did not complete successfully.**

The whole log files are attached. slurm-452668.out.txt slurm-452676.out.txt

Could you help fix the problem? I'm trying checking what's going on, but haven't figured out why they crashed. The only thing looks strange to me is the samtools command shown as above. The issue here is samtools should be used for BAM/SAM files, is it? samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq

Thanks!

riasc commented 2 weeks ago

Hi,

thank you for reporting this. As I turned out the issue was that the non-existent tmp folder (which is not check in for PE reads). I have resolved this in #27. I did a test run with PE reads and I don't see that error anymore. Could you be re-run this with v0.2.5? Thanks a lot.

Xiao-Zhong commented 2 weeks ago

Hi Richard,

This issue looks solved. Thanks! But, unfortunately I encountered another on the BWA alignment. "Activating conda environment: .snakemake/conda/10d82fd738e7656fc31d371d8af4d92f_ [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1804118 sequences (180000082 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 788379, 13, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (163, 197, 245) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 409) [M::mem_pestat] mean and std.dev: (208.28, 60.13) [M::mem_pestat] low and high boundaries for proper pairs: (1, 491) [M::mem_pestat] analyzing insert size distribution for orientation RF... [M::mem_pestat] (25, 50, 75) percentile: (164, 444, 953) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2531) [M::mem_pestat] mean and std.dev: (482.00, 340.81) [M::mem_pestat] low and high boundaries for proper pairs: (1, 3320) [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_pestat] skip orientation RF [M::mem_process_seqs] Processed 1804118 reads in 241.201 CPU sec, 15.241 real sec [E::aux_parse] unrecognized type ':' [W::sam_read1_sam] Parse error at line 27 samtools addreplacerg: [readgroupise] Error reading from input file: Invalid argument [Thu Jun 20 16:35:08 2024] Error in rule bwa_align_dnaseq: jobid: 42 input: results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz, results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz, resources/refs/bwa/genome.amb, resources/refs/bwa/genome.ann, resources/refs/bwa/genome.bwt, resources/refs/bwa/genome.pac, resources/refs/bwa/genome.sa output: results/s11578/dnaseq/align/dna_tumor_aligned_BWA.bam log: logs/s11578/bwa_align/dnaseq_dna_tumor.log (check log file(s) for error details) conda-env: /scratch/ms002/xzhong/04ScanNeo2/v0.2.5/ScanNeo2/.snakemake/conda/10d82fd738e7656fc31d371d8af4d92f shell:

    bwa mem -t18 -C resources/refs/bwa/genome results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz         | samtools addreplacerg -r ID:dna_tumor -r SM:s11578         -r LB:s11578 -r PL:ILLUMINA -r PU:dna_tumor - -         | samtools sort -@ 6 -n -m1g - -o results/s11578/dnaseq/align/dna_tumor_aligned_BWA.bam > logs/s11578/bwa_align/dnaseq_dna_tumor.log 2>&1

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job bwa_align_dnaseq since they might be corrupted:"

The full log file is attached. slurm-453110.out.txt

This time, I created a new working folder, downloaded your new pipeline and ran it.

Cheers, Xiao

Xiao-Zhong commented 2 weeks ago

Oh, forgot to say. The error looks related to add group tag to SAM/BAM file with samtools.

I usually do this by BWA -R, as below: $BWA -t $THREADS \ -M \ -R "@RG\tID:1\tSM:$TYPE\tLB:lib1\tPL:ILLUMINA" \ $REF/hg38.fa \ $READ1 $READ2 > $OUTPUT/$SAMPLE.sam

riasc commented 2 weeks ago

Hi, sorry for the issues you are experiencing.. I tested it with another dataset and it runs through. Would it be possible to share some of the data in which you are experiencing the issues (or a small subset from it)?

Yes, you are right. In principle one could set the readgroups directly in the BWA call. But it should not make any difference.

Also the command you posted with the RG directly on BWA call works with your data? Thanks

Xiao-Zhong commented 2 weeks ago

Hi, Thanks for your support!

Attached is a small set of the reads (1000 reads per file), including normal, tumour and RNA-Seq. reads.tar.gz

I'm wondering if the issue caused by the read ID in my fastq files, though they're raw sequencing data. See below.

==> normal_R1.fastq <== @700666F:201:C9WM3ANXX:1:1101:1383:2000 1:N:0:GACAGTGC NTGTGAGATGTTGGTCAAAGAATACAAAGTTTGTTATACAAGATGCATAAGTCATGGAGAGCTAATTACAGCATGATGACTGTAGTTAATTCTATTGTAT +

<FGGBGCGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFCEFEGGGGGGGGGGGGGGFGGGGGGGGG

==> normal_R2.fastq <== @700666F:201:C9WM3ANXX:1:1101:1383:2000 2:N:0:GACAGTGC GTATAATTGACAAGTAGAAACTATATATATTTAAGGTATACAGTGTGTGAATTACATTATCAAACAATCGAGCTAGTTAACCTATCCATCACCTCACATT + CBBCCGGGGGGGGGGG1FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGEGGGGGFGGGGGGGGGGGGGGGGFEF0DGGGGGG

==> RNA_R1.fastq <== @700666F:200:C9WECANXX:8:2211:1148:1958 1:N:0:CTTGTA NTGGCACCTAAGAGTACTTTGGCAGAGGGGTGGGGGTCATGGCTGGATTCCCTGGGCCAGAATATCTGTTTTCAGAATGCTCCAGAGAAAGGCAGGGCTGAGATGACTCTGGATGGCACGCACAG +

<=BBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGG

==> RNA_R2.fastq <== @700666F:200:C9WECANXX:8:2211:1148:1958 2:N:0:CTTGTA CAGCCCCAGAAGGGGGCATTATGGGCCCTGCCTCCCCATAGGCCATTTGGACTCTGCCTTCAAACAAAGGCAGTTCAGTCCACAGGCATGGAAGCTGTGAGGGGACAGGCCTGTGCGTGCCATCC + CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGEGGGGGGGGGGGGGGGGGE

==> tumor_R1.fastq <== @700666F:201:C9WM3ANXX:1:1101:1717:1994 1:N:0:AAGGACAC NATGATAGTATAATCCTCTCTTTTTTGTACACAGAGTAAAGAGGACAAATAGGTGAAAGAATAAATGAAAGGCTGGAATCCCACTTCCCCCGCTGTCCCA +

<AABEFGGGGGGGGGGGGGGGGGGGGG@CCGGGG=DGGGGGGGGGFDGGGGGGGGGCGBGG@FDGGGGGEGGGGGGGGGGGFGGGGGGG@GGG0FFG

==> tumor_R2.fastq <== @700666F:201:C9WM3ANXX:1:1101:1717:1994 1:N:0:AAGGACAC NATGATAGTATAATCCTCTCTTTTTTGTACACAGAGTAAAGAGGACAAATAGGTGAAAGAATAAATGAAAGGCTGGAATCCCACTTCCCCCGCTGTCCCA +

<AABEFGGGGGGGGGGGGGGGGGGGGG@CCGGGG=DGGGGGGGGGFDGGGGGGGGGCGBGG@FDGGGGGEGGGGGGGGGGGFGGGGGGG@GGG0FFG

Cheers, Xiao

Xiao-Zhong commented 2 weeks ago

OK. I found fastq files in .tests/integration. They should be downloaded from NCBI SRA database, while my data was generated by illumina sequencer and hasn't been uploaded/downloaded to/from SRA database.

@SRR22881654.1 1 length=254 CAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCTTGCTCAAGGTGTAGTGGCAGCACGCCGGCGTGCTGCCACTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGCGTGCCACCACTATACAGTAAGCAAGAGGGCCCTGCAGTGCCCCGGCGCCAGCAGGGGGCGCTG +SRR22881654.1 1 length=254 AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-AAFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJAJ

See the details of the formats. https://en.wikipedia.org/wiki/FASTQ_format

riasc commented 2 weeks ago

Thanks for providing me with the data. Seems like the -C is the culprit in the BWA mem call. I changed that in v0.2.6 which should resolve the error with your dataset (which is only needed when the input is in BAM file). Will tests the whole workflow with the testdata later.. but so far it runs through...

Thanks

Xiao-Zhong commented 2 weeks ago

Your pipeline moved a little bit further, beyond BWA.

slurm-453236.out.txt

Here, the error below was caused by an older version of samtools used (without -u, Version: 1.9 ) that is incompatible with the command (with -u).

Thanks! That would be very helpful to have a thorough check and test.

Xiao-Zhong commented 1 week ago

Hi, I think I have run through ScanNeo2/v0.2.5 for one sample, after worked around a few issues, like pyfaidx, regtools, bedtools, spladder, etc. I cannot remember what exactly the issues are, but most are like installed them manually, edited problematic commands.

BYW, the way I ran ScanNeo2 shown as above is very slow, which I believe ran jobs sequentially on a compute node as in a single VM. I need to figure out how to run the processes of ScanNeo2 in parallel if possible.

Plus, does ScanNeo2 use dna_normal reads? My input data includes dna_tumor, rna_tumor, normal in config.yaml as below. During the running, I haven't seen normal DNA reads have been used. I will check your paper and my analysis further.

data: name: s12558 dnaseq: dna_tumor: /group/ms002/project/Data_Human/Lung/12558/12558T2019205_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/12558T2019205_2.fastq.gz rnaseq: rna_tumor: /group/ms002/project/Data_Human/Lung/12558/ID12558T2019205_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/ID12558T2019205_2.fastq.gz normal: /group/ms002/project/Data_Human/Lung/12558/DNA12558D1237_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/DNA12558D1237_2.fastq.gz

Look forward to your test and update for ScanNeo2. Thank you!

riasc commented 1 week ago

Hi,

thats strange as the the versions from the conda environments should suffice. I will test it again on another HPC instance.

Well, yes also the normal samples should be calculated. The only different (at the moment) is that in some steps the normal samples are getting ignored.

The major bottleneck is probably the indel calling (short). Unfortunately, GATK cannot be parallelized too much... I'm distributing this over each chromosome but GATK does not provide an option to speed it up more.. (especially since the best practice requires to do the germline calling first, filter out specific variants and then do another round of variant calling).

riasc commented 1 week ago

So for example regtools should be used over the singularity container... using the software-deployment option. Its strange that you had to install it yourself

Xiao-Zhong commented 1 week ago

So for example regtools should be used over the singularity container... using the software-deployment option. Its strange that you had to install it yourself

The HPC I'm using doesn't have singularity, so I had to run it with conda as below. snakemake --rerun-incomplete --cores all --software-deployment-method conda

Does your pipeline install "scanexitron" with both singularity or conda? If singularity only, my run probably didn't install scanexitron properly (including regtools).

Xiao-Zhong commented 1 week ago

Hi,

thats strange as the the versions from the conda environments should suffice. I will test it again on another HPC instance.

Well, yes also the normal samples should be calculated. The only different (at the moment) is that in some steps the normal samples are getting ignored.

The major bottleneck is probably the indel calling (short). Unfortunately, GATK cannot be parallelized too much... I'm distributing this over each chromosome but GATK does not provide an option to speed it up more.. (especially since the best practice requires to do the germline calling first, filter out specific variants and then do another round of variant calling).

Thanks very much! It's very helpful to have a test in a new platform or env from scratch as a user.

riasc commented 1 week ago

Does your pipeline install "scanexitron" with both singularity or conda? If singularity only, my run probably didn't install scanexitron properly (including regtools).

Yes, so the dependencies for scanexitron are resolved over the Docker container that is loaded via Docker/Singularity. Its the only option...

https://github.com/ylab-hi/ScanNeo2/blob/7dcdff88d73d99de2f71d13447b1982355e6e75f/workflow/rules/exitron.smk#L53-L54

The main reason for that is that regtools v0.4.2 is not available via conda (only never versions). But yeah you would have to install it manually along with the other packages.. which I think you did. But I'm assuming your remove the container directive then? Does it work like this? Its also possible to install singularity over conda but I haven't tried it so far because on our HPC the install singularity prevents the use of singularity over conda.

Unfortunately, its the only solution so far. A colleague from the group is working on a improved version that should get to run without singularity. I have to ask him how far he is with that. However, do you get it to work when you have it manually? Another solution which isn't great is that you disable the exitron module and instead determine the exitrons (using ScanExitron or any other tool by youself) and then provide the resulting VCF file as custom file to the workflow.

Xiao-Zhong commented 1 week ago

Thanks!

I first had to run it without singularity, because our HPC like yours doesn't have singularity. snakemake --rerun-incomplete --cores all --software-deployment-method conda

But I managed to install singularity via conda, and something looks as below. It's said installing singularity requires root permission, while executing it doesn't. In my case, conda seems overcame the limit. conda create -n singularity-env conda activate singularity-env conda install -c conda-forge singularity

Later, I also reran the pipelines a few times as: snakemake --rerun-incomplete --cores all --software-deployment-method conda singularity.

I'm still checking if results are complete and expected. Meanwhile I'm running your pipeline for another sample and haven't seen any issue. Great!

Sorry. I cannot describe exactly what I did. Sometimes edited your code a little bit, sometimes installed dependencies manually though the version installed maybe is not identical to the one defined in our pipeline like regtools.

riasc commented 1 week ago

Thanks!

I first had to run it without singularity, because our HPC like yours doesn't have singularity. snakemake --rerun-incomplete --cores all --software-deployment-method conda

But I managed to install singularity via conda, and something looks as below. It's said installing singularity requires root permission, while executing it doesn't. In my case, conda seems overcame the limit. conda create -n singularity-env conda activate singularity-env conda install -c conda-forge singularity

Later, I also reran the pipelines a few times as: snakemake --rerun-incomplete --cores all --software-deployment-method conda singularity.

I'm still checking if results are complete and expected. Meanwhile I'm running your pipeline for another sample and haven't seen any issue. Great!

Sorry. I cannot describe exactly what I did. Sometimes edited your code a little bit, sometimes installed dependencies manually though the version installed maybe is not identical to the one defined in our pipeline like regtools.

Thanks for your insights. Its no problem as long as you get it to run. Let me know if you run into other problems. I did a re-run with the original TESLA data on a new install. So far it runs through... at least until the priorization. I may be able to install regtools into an alternative conda environment and use this then... need to look into that.

Thanks again