Closed Xiao-Zhong closed 3 months ago
Hi Xiao,
thanks for your interest in using the tool. There is a snakemake-executor plugin that allows you to you snakemake scripts over SURM. https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html
When this is installed you only have to add --executor slurm
to the snakemake call.
Another way is to simply run is normally in SLURMs interactive mode, e.g.,
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
and then using the basic call (snakemake --cores all --software-deployment-method conda...
) as before.
This may be a bit tedious if you have many different samples you want to have analyzed simultaneously. Actually, I working on this in v0.3 to make this more straight-forward which may be released somewhere next week.
Thanks! A simple way for me now is to put the command into a slurm script as below, and submit the script by sbatch.
`#!/bin/bash
snakemake --cores all --software-deployment-method conda`
Can I ask what's the best way to run scanneo2 for multiple samples? I run it for another sample in a separate directory as below, which looks scanneo2 pipeline downloads dependencies and reference database again.
Step1
mkdir -p /path/to/your/working/directory/ cd /path/to/your/working/directory/ git clone --recurse-submodules https://github.com/ylab-hi/ScanNeo2.git cd ScanNeo2
Step2 edit the config/config.yml
Step3 Run the slurm script above.
Thank you!
Hi,
thank your for your feedback. So in principle you can run ScanNeo2 from the same directory (for differeent files) using a different config file. Just change the name
(https://github.com/ylab-hi/ScanNeo2/blob/52a3818ec3189af502f18eba6b6a1a69b9b3a8c3/config/config.yaml#L12) in the configfile and the results will end up in the different folder within the "results". But make sure you first run it until all the necessary files like genome, VEP are downloaded.. otherwise different call may overwrite these files...
I need to check if snakemake somehow allows to lock these files such that they are only downloaded once and also not overwritten by multiple processes. However, if you create it in new folders you would have to download it again each time.
But you gave me an idea. I could write a small scripts that accepts multiple files and distribute it across different nodes.. let me look into that.
Thanks
OK. I see. Thanks! I believe the two jobs have successfully downloaded and/or installed dependences, however both terminated in pre-processing data. The error is as below:
**[Tue Jun 18 00:36:20 2024] Job 102: Retrieve paired-end reads (DNA) for HLA genotyping - sample:s11578 group:dna_tumor Reason: Missing output files: results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq, results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq; Input files updated by another job: results/s11578/dnaseq/reads/dna_tumor_R1_pre>
Activating conda environment: .snakemake/conda/d8900894438122f24cfe02e35dcb6bfd_ ^M 5 898M 5 48.8M 0 0 149k 0 1:42:48 0:05:35 1:37:13 151k^M100 2075M 0 2075M 0 0 6335k 0 --:--:-- 0:05:35 --:--:-- 5064kNo input file specified. [W::sam_read1] Parse error at line 2 samtools sort: truncated file. Aborting [Tue Jun 18 00:36:20 2024] Error in rule get_reads_hlatyping_PE: jobid: 102 input: results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz, results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz output: results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq, results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq log: logs/s11578/hla/get_reads_hlatyping_PE_dna_tumor_DNA.log (check log file(s) for error details) conda-env: /scratch/ms002/xzhong/04ScanNeo2/ScanNeo2/.snakemake/conda/d8900894438122f24cfe02e35dcb6bfd shell:
samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq
samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R2.fq
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job get_reads_hlatyping_PE since they might be corrupted: results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq ^M100 2081M 0 2081M 0 0 6333k 0 --:--:-- 0:05:36 --:--:-- 5118k^M 5 898M 5 49.0M 0 0 149k 0 1:42:49 0:05:36 1:37:13 149k^M 5 898M 5 49.1M 0 0 149k 0 > Finished job 15. 7 of 116 steps (6%) done ^M100 6854M 0 6854M 0 0 6719k 0 --:--:-- 0:17:24 --:--:-- 7395k^M 17 898M 17 153M 0 0 150k 0 1:41:55 0:17:24 1:24:31 152k^M100 6862M 0 6862M 0 0 6721k 0 > Finished job 16. 8 of 116 steps (7%) done ^M100 7604M 0 7604M 0 0 6664k 0 --:--:-- 0:19:28 --:--:-- 7311k^M 19 898M 19 171M 0 0 150k 0 1:41:53 0:19:28 1:22:25 153k^M100 7611M 0 7611M 0 0 6664k 0 > [Tue Jun 18 01:38:17 2024] Finished job 56. 9 of 116 steps (8%) done ^M 66 898M 66 597M 0 0 150k 0 1:41:38 1:07:32 0:34:06 149k^M 66 898M 66 597M 0 0 150k 0 1:41:38 1:07:33 0:34:05 151k^M 66 898M 66 597M 0 0 150k 0 > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 0 0 0 0 0 0 0 > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 14.0M 0 7749 0 0 3686 0 > [Tue Jun 18 02:20:37 2024] Finished job 19. 10 of 116 steps (9%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-06-18T003023.457455.snakemake.log WorkflowError: At least one job did not complete successfully.**
The whole log files are attached. slurm-452668.out.txt slurm-452676.out.txt
Could you help fix the problem? I'm trying checking what's going on, but haven't figured out why they crashed. The only thing looks strange to me is the samtools command shown as above. The issue here is samtools should be used for BAM/SAM files, is it?
samtools sort -n results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz -T tmp/ | samtools fastq > results/s11578/hla/reads/dna_tumor_DNA_PE_R1.fq
Thanks!
Hi,
thank you for reporting this. As I turned out the issue was that the non-existent tmp folder (which is not check in for PE reads). I have resolved this in #27. I did a test run with PE reads and I don't see that error anymore. Could you be re-run this with v0.2.5? Thanks a lot.
Hi Richard,
This issue looks solved. Thanks! But, unfortunately I encountered another on the BWA alignment. "Activating conda environment: .snakemake/conda/10d82fd738e7656fc31d371d8af4d92f_ [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 1804118 sequences (180000082 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (5, 788379, 13, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (163, 197, 245) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 409) [M::mem_pestat] mean and std.dev: (208.28, 60.13) [M::mem_pestat] low and high boundaries for proper pairs: (1, 491) [M::mem_pestat] analyzing insert size distribution for orientation RF... [M::mem_pestat] (25, 50, 75) percentile: (164, 444, 953) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2531) [M::mem_pestat] mean and std.dev: (482.00, 340.81) [M::mem_pestat] low and high boundaries for proper pairs: (1, 3320) [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_pestat] skip orientation RF [M::mem_process_seqs] Processed 1804118 reads in 241.201 CPU sec, 15.241 real sec [E::aux_parse] unrecognized type ':' [W::sam_read1_sam] Parse error at line 27 samtools addreplacerg: [readgroupise] Error reading from input file: Invalid argument [Thu Jun 20 16:35:08 2024] Error in rule bwa_align_dnaseq: jobid: 42 input: results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz, results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz, resources/refs/bwa/genome.amb, resources/refs/bwa/genome.ann, resources/refs/bwa/genome.bwt, resources/refs/bwa/genome.pac, resources/refs/bwa/genome.sa output: results/s11578/dnaseq/align/dna_tumor_aligned_BWA.bam log: logs/s11578/bwa_align/dnaseq_dna_tumor.log (check log file(s) for error details) conda-env: /scratch/ms002/xzhong/04ScanNeo2/v0.2.5/ScanNeo2/.snakemake/conda/10d82fd738e7656fc31d371d8af4d92f shell:
bwa mem -t18 -C resources/refs/bwa/genome results/s11578/dnaseq/reads/dna_tumor_R1_preproc.fq.gz results/s11578/dnaseq/reads/dna_tumor_R2_preproc.fq.gz | samtools addreplacerg -r ID:dna_tumor -r SM:s11578 -r LB:s11578 -r PL:ILLUMINA -r PU:dna_tumor - - | samtools sort -@ 6 -n -m1g - -o results/s11578/dnaseq/align/dna_tumor_aligned_BWA.bam > logs/s11578/bwa_align/dnaseq_dna_tumor.log 2>&1
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job bwa_align_dnaseq since they might be corrupted:"
The full log file is attached. slurm-453110.out.txt
This time, I created a new working folder, downloaded your new pipeline and ran it.
Cheers, Xiao
Oh, forgot to say. The error looks related to add group tag to SAM/BAM file with samtools.
I usually do this by BWA -R, as below:
$BWA -t $THREADS \ -M \ -R "@RG\tID:1\tSM:$TYPE\tLB:lib1\tPL:ILLUMINA" \ $REF/hg38.fa \ $READ1 $READ2 > $OUTPUT/$SAMPLE.sam
Hi, sorry for the issues you are experiencing.. I tested it with another dataset and it runs through. Would it be possible to share some of the data in which you are experiencing the issues (or a small subset from it)?
Yes, you are right. In principle one could set the readgroups directly in the BWA call. But it should not make any difference.
Also the command you posted with the RG directly on BWA call works with your data? Thanks
Hi, Thanks for your support!
Attached is a small set of the reads (1000 reads per file), including normal, tumour and RNA-Seq. reads.tar.gz
I'm wondering if the issue caused by the read ID in my fastq files, though they're raw sequencing data. See below.
==> normal_R1.fastq <== @700666F:201:C9WM3ANXX:1:1101:1383:2000 1:N:0:GACAGTGC NTGTGAGATGTTGGTCAAAGAATACAAAGTTTGTTATACAAGATGCATAAGTCATGGAGAGCTAATTACAGCATGATGACTGTAGTTAATTCTATTGTAT +
==> normal_R2.fastq <== @700666F:201:C9WM3ANXX:1:1101:1383:2000 2:N:0:GACAGTGC GTATAATTGACAAGTAGAAACTATATATATTTAAGGTATACAGTGTGTGAATTACATTATCAAACAATCGAGCTAGTTAACCTATCCATCACCTCACATT + CBBCCGGGGGGGGGGG1FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGEGGGGGFGGGGGGGGGGGGGGGGFEF0DGGGGGG
==> RNA_R1.fastq <== @700666F:200:C9WECANXX:8:2211:1148:1958 1:N:0:CTTGTA NTGGCACCTAAGAGTACTTTGGCAGAGGGGTGGGGGTCATGGCTGGATTCCCTGGGCCAGAATATCTGTTTTCAGAATGCTCCAGAGAAAGGCAGGGCTGAGATGACTCTGGATGGCACGCACAG +
==> RNA_R2.fastq <== @700666F:200:C9WECANXX:8:2211:1148:1958 2:N:0:CTTGTA CAGCCCCAGAAGGGGGCATTATGGGCCCTGCCTCCCCATAGGCCATTTGGACTCTGCCTTCAAACAAAGGCAGTTCAGTCCACAGGCATGGAAGCTGTGAGGGGACAGGCCTGTGCGTGCCATCC + CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGEGGGGGGGGGGGGGGGGGE
==> tumor_R1.fastq <== @700666F:201:C9WM3ANXX:1:1101:1717:1994 1:N:0:AAGGACAC NATGATAGTATAATCCTCTCTTTTTTGTACACAGAGTAAAGAGGACAAATAGGTGAAAGAATAAATGAAAGGCTGGAATCCCACTTCCCCCGCTGTCCCA +
==> tumor_R2.fastq <== @700666F:201:C9WM3ANXX:1:1101:1717:1994 1:N:0:AAGGACAC NATGATAGTATAATCCTCTCTTTTTTGTACACAGAGTAAAGAGGACAAATAGGTGAAAGAATAAATGAAAGGCTGGAATCCCACTTCCCCCGCTGTCCCA +
Cheers, Xiao
OK. I found fastq files in .tests/integration. They should be downloaded from NCBI SRA database, while my data was generated by illumina sequencer and hasn't been uploaded/downloaded to/from SRA database.
@SRR22881654.1 1 length=254 CAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCTTGCTCAAGGTGTAGTGGCAGCACGCCGGCGTGCTGCCACTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGCGTGCCACCACTATACAGTAAGCAAGAGGGCCCTGCAGTGCCCCGGCGCCAGCAGGGGGCGCTG +SRR22881654.1 1 length=254 AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-AAFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJAJ
See the details of the formats. https://en.wikipedia.org/wiki/FASTQ_format
Thanks for providing me with the data. Seems like the -C
is the culprit in the BWA mem call. I changed that in v0.2.6 which should resolve the error with your dataset (which is only needed when the input is in BAM file). Will tests the whole workflow with the testdata later.. but so far it runs through...
Thanks
Your pipeline moved a little bit further, beyond BWA.
Here, the error below was caused by an older version of samtools used (without -u, Version: 1.9 ) that is incompatible with the command (with -u).
Thanks! That would be very helpful to have a thorough check and test.
Hi, I think I have run through ScanNeo2/v0.2.5 for one sample, after worked around a few issues, like pyfaidx, regtools, bedtools, spladder, etc. I cannot remember what exactly the issues are, but most are like installed them manually, edited problematic commands.
BYW, the way I ran ScanNeo2 shown as above is very slow, which I believe ran jobs sequentially on a compute node as in a single VM. I need to figure out how to run the processes of ScanNeo2 in parallel if possible.
Plus, does ScanNeo2 use dna_normal reads? My input data includes dna_tumor, rna_tumor, normal in config.yaml as below. During the running, I haven't seen normal DNA reads have been used. I will check your paper and my analysis further.
data: name: s12558 dnaseq: dna_tumor: /group/ms002/project/Data_Human/Lung/12558/12558T2019205_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/12558T2019205_2.fastq.gz rnaseq: rna_tumor: /group/ms002/project/Data_Human/Lung/12558/ID12558T2019205_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/ID12558T2019205_2.fastq.gz normal: /group/ms002/project/Data_Human/Lung/12558/DNA12558D1237_1.fastq.gz /group/ms002/project/Data_Human/Lung/12558/DNA12558D1237_2.fastq.gz
Look forward to your test and update for ScanNeo2. Thank you!
Hi,
thats strange as the the versions from the conda environments should suffice. I will test it again on another HPC instance.
Well, yes also the normal samples should be calculated. The only different (at the moment) is that in some steps the normal samples are getting ignored.
The major bottleneck is probably the indel calling (short). Unfortunately, GATK cannot be parallelized too much... I'm distributing this over each chromosome but GATK does not provide an option to speed it up more.. (especially since the best practice requires to do the germline calling first, filter out specific variants and then do another round of variant calling).
So for example regtools should be used over the singularity container... using the software-deployment option. Its strange that you had to install it yourself
So for example regtools should be used over the singularity container... using the software-deployment option. Its strange that you had to install it yourself
The HPC I'm using doesn't have singularity, so I had to run it with conda as below.
snakemake --rerun-incomplete --cores all --software-deployment-method conda
Does your pipeline install "scanexitron" with both singularity or conda? If singularity only, my run probably didn't install scanexitron properly (including regtools).
Hi,
thats strange as the the versions from the conda environments should suffice. I will test it again on another HPC instance.
Well, yes also the normal samples should be calculated. The only different (at the moment) is that in some steps the normal samples are getting ignored.
The major bottleneck is probably the indel calling (short). Unfortunately, GATK cannot be parallelized too much... I'm distributing this over each chromosome but GATK does not provide an option to speed it up more.. (especially since the best practice requires to do the germline calling first, filter out specific variants and then do another round of variant calling).
Thanks very much! It's very helpful to have a test in a new platform or env from scratch as a user.
Does your pipeline install "scanexitron" with both singularity or conda? If singularity only, my run probably didn't install scanexitron properly (including regtools).
Yes, so the dependencies for scanexitron are resolved over the Docker container that is loaded via Docker/Singularity. Its the only option...
The main reason for that is that regtools v0.4.2 is not available via conda (only never versions). But yeah you would have to install it manually along with the other packages.. which I think you did. But I'm assuming your remove the container
directive then? Does it work like this? Its also possible to install singularity over conda but I haven't tried it so far because on our HPC the install singularity prevents the use of singularity over conda.
Unfortunately, its the only solution so far. A colleague from the group is working on a improved version that should get to run without singularity. I have to ask him how far he is with that. However, do you get it to work when you have it manually? Another solution which isn't great is that you disable the exitron module and instead determine the exitrons (using ScanExitron or any other tool by youself) and then provide the resulting VCF file as custom
file to the workflow.
Thanks!
I first had to run it without singularity, because our HPC like yours doesn't have singularity.
snakemake --rerun-incomplete --cores all --software-deployment-method conda
But I managed to install singularity via conda, and something looks as below. It's said installing singularity requires root permission, while executing it doesn't. In my case, conda seems overcame the limit.
conda create -n singularity-env conda activate singularity-env conda install -c conda-forge singularity
Later, I also reran the pipelines a few times as:
snakemake --rerun-incomplete --cores all --software-deployment-method conda singularity.
I'm still checking if results are complete and expected. Meanwhile I'm running your pipeline for another sample and haven't seen any issue. Great!
Sorry. I cannot describe exactly what I did. Sometimes edited your code a little bit, sometimes installed dependencies manually though the version installed maybe is not identical to the one defined in our pipeline like regtools.
Thanks!
I first had to run it without singularity, because our HPC like yours doesn't have singularity.
snakemake --rerun-incomplete --cores all --software-deployment-method conda
But I managed to install singularity via conda, and something looks as below. It's said installing singularity requires root permission, while executing it doesn't. In my case, conda seems overcame the limit.
conda create -n singularity-env conda activate singularity-env conda install -c conda-forge singularity
Later, I also reran the pipelines a few times as:
snakemake --rerun-incomplete --cores all --software-deployment-method conda singularity.
I'm still checking if results are complete and expected. Meanwhile I'm running your pipeline for another sample and haven't seen any issue. Great!
Sorry. I cannot describe exactly what I did. Sometimes edited your code a little bit, sometimes installed dependencies manually though the version installed maybe is not identical to the one defined in our pipeline like regtools.
Thanks for your insights. Its no problem as long as you get it to run. Let me know if you run into other problems. I did a re-run with the original TESLA data on a new install. So far it runs through... at least until the priorization. I may be able to install regtools into an alternative conda environment and use this then... need to look into that.
Thanks again
Hi, so I tried it on HPC myself:
#!/bin/bash
#SBATCH --account=XXX
#SBATCH --partition=XXX
#SBATCH --time=36:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=50
#SBATCH --mem=64G
#SBATCH --job-name=scanneo2_GBMtest
#SBATCH --output=scanneo2.log
PROJECT_DIR=<project/to/scanneo2>
cd $PROJECT_DIR
source /mambaforge/bin/activate /mambaforge/envs/scanneo2
module load singularity
snakemake --cores 50 --software-deployment-method conda apptainer --configfile ${PROJECT_DIR}/config/config.yaml
Assuming the configfile is located in config/config.yml and the scanneo2
is the name of the conda environment. Will probably automate this in the next release. Let me know if you have any other questions. Thanks
Hi, I'm new to snakemake, and have no idea of submitting Scanneo2 jobs to HPC. Any instructions would be very helpful! Thanks!
Regards, Xiao