Closed johnbradley closed 2 years ago
Thanks John, I will take a look into them too. I don't think I am familiar with them. I have seen snakemake for a mapping pipeline known as spades I think
Nextflow workflows are written in a domain specific language processed by the groovy programming language. Nextflow runs on a java VM. Nextflow has support for running on public cloud infrastructure. A Nextflow workflow consists of a set of processes that receive data via channels for parallel processing. Each process (step) can have cpus and memory directives that will be passed to slurm. There is also a new version of the nextflow domain specific language, but I found the default language to be simpler. You can run a nextflow workflow like so:
nextflow run <nextflowFilename>
Below are the contents of a nextflow file that runs a single process that will index a genome and save the resulting index files in the same directory.
#!/usr/bin/env nextflow
params.genome = "${baseDir}/NC_045512.fasta"
params.outdir = "${file(params.genome).toRealPath().getParent()}"
process indexReferenceGenome {
conda 'environment.yml'
input:
path GENOME from params.genome
output:
path "*" into indexedGenome
publishDir params.outdir, mode: 'copy'
shell:
'''
bwa index -a is !{GENOME}
'''
}
You can see it will load our conda environment when running this workflow, but the conda support does not work when running on a public cloud. Notice the !{GENOME}
syntax. This expression is replaced with the name of our genome file.
So the shell command becomes bwa index -a is NC_045512.fasta
.
This syntax allows us to use environment variables within our shell commands.
The shell command be put in a separate file.
Below is an example that processes fastq files in parallel trimming and mapping.
#!/usr/bin/env nextflow
params.genome = "${baseDir}/NC_045512.fasta"
# collect the paths so genome index files will be passed together
genome_index_files = Channel.fromPath("${params.genome}.*").collect()
params.fastqFiles = "${baseDir}/*.fastq.gz"
# create a channel so the fastq.gz files will be processed in parallel
fastqChannel = Channel.fromPath(params.fastqFiles)
params.outdir = "${baseDir}/results"
process removeNexteraAdapters {
conda 'environment.yml'
input:
path READFILE from fastqChannel
output:
path '*_trimmed.fq.gz' into trimmedFastqChannel
publishDir params.outdir, mode: 'copy'
shell:
'''
trim_galore --fastqc --nextera !{READFILE}
'''
}
process mapBWA {
conda 'environment.yml'
publishDir params.outdir, mode: 'copy'
input:
path READFILE from trimmedFastqChannel
path GENOME from params.genome
path genome_index_files
output:
path '*.sam' into samFilesChannel
shell:
template 'map-bwa-cleaned-libs.sh'
}
map-bwa-cleaned-libs.sh
root=$(basename !{READFILE} _trimmed.fq.gz)
root2=$(basename $root _R1_001)
root3=$(echo ${root} | cut -d'_' -f 1)
bwa mem -R "@RG\tID:ID_${root3}\tPU:PU_${root3}\tSM:${root3}\tLB:${root}" !{GENOME} !{READFILE} > $root3.sam
This example uses a channel to process the *.fastq.gz files in parallel.
Notice how we run collect()
on the genome index files so they will be passed as a group.
My understanding is the removeNexteraAdapters
process output creates another channel with the into trimmedFastqChannel
syntax.
Notice how the multi-line script for mapping is a separate file:
You can resume a failed run by passing the -resume
flag.
nextflow run <nextflowFilename> -resume
Nextflow does not automatically cleanup intermediate files so you should clean them up when done using:
nextflow clean -n <nextflowFilename>
Instead of using array jobs Nextflow will submit a number of sbatch jobs. There is a queueSize option to control how many sbatch jobs nextflow can run at the same time. Each sbatch job will be limited by the per-process cpus
and memory
directives.
Snakemake workflows are written in a domain specific language processed by python.
Snakemake has support for running on public cloud infrastructure. Conda is supported for running on the public cloud unlike nextflow which requires containers.
A Snakemake workflow consists of a set of rules for creating output files from input files.
You tell Snakemake what output files you want and it will read through the rules and execute them to create your output files. Snakemake creates files in parallel based on the --cores
command line argument.
Each rule can specify requested threads and memory. Snakemake expects a file named Snakefile
containing your
rules in the current directory. To run Snakemake enabling conda with 1 core run the following command:
snakemake --cores 1 --use-conda
Below is the contents of a Snakefile file that will index a genome and save the resulting index files in the same directory.
if not "genome" in config:
config["genome"] = "NC_045512.fasta"
rule all:
input: multiext(config["genome"], ".amb", ".ann", ".bwt", ".pac", ".sa")
rule bwa_index:
input:
config["genome"]
output:
multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa")
conda:
"environment.yml"
shell:
"bwa index -a is {input}"
The first bit of python code sets a default value for the "genome" file. A user can specify a path for a specific "genome" on the command line.
The first rule in a Snakefile is the default rule in this case the all
rule.
The all
rule says to make NC_045512.fasta.*
files. Snakemake sees that these files do not exist and looks for a rule that outputs these files. Since thebwa_index
rule creates these files it will be run. If the files already exist you will get a message to that effect and no work is done.
The input
block says to input the "genome" which is defaulted to NC_045512.fasta
based on the all
rule.
The output
block builds a list of filenames to create using a function: NC_045512.fasta.amb
,NC_045512.fasta.ann
...
The conda
block says what environment.yaml file to use to create an environment.
The shell
block of the bwa_index
rule specifies what to run. The {...}
notation is used to replace values within it.
In this case we would run bwa index -a is NC_045512.fasta
.
Below is an example that processes fastq files in parallel trimming and mapping.
# set default config[genome] if not specified
if not "genome" in config:
config["genome"] = "NC_045512.fasta"
SAMPLES, = glob_wildcards("{basefilename}_R1_001.fastq.gz")
rule all:
input: expand("{prefix}.sam", prefix=SAMPLES)
rule trim_adapters:
output: "{prefix}_trimmed.fq.gz"
input: "{prefix}.fastq.gz"
conda: "environment.yml"
shell: "trim_galore --fastqc --nextera {input}"
rule map_with_bwa:
output: "{sample}.sam"
input:
genome=config["genome"],
readfile="{sample}_R1_001_trimmed.fq.gz"
params:
read_group_header=r"@RG\tID:ID_{sample}\tPU:PU_{sample}\tSM:{sample}\tLB:{sample}_R1_001"
conda: "environment.yml"
shell: "bwa mem -R \"{params.read_group_header}\" {input.genome} {input.readfile} > {output}"
The SAMPLES, = glob_wildcards
line looks for sample names in *fastq.gz files.
The all
(default) rule says to create .sam
files for the SAMPLE names found.
The map_with_bwa
rule doesn't require a separate file like Nextflow since we can pull out the data we need.
I find the bwa mem ..
line much easier to read here.
NOTE Due to how snakemake works we could combine the above scripts into one file since Snakemake's smart enough to not recreate the index every time.
You just re-run the same command to resume.
snakemake --cores 1 --use-conda
Instead of array jobs Snakemake also runs sbatch jobs. The cores
command line parameter better specifies how many threads/cores to consume. With Nextflow choosing queueSize is less precise and may cause larger allocations than desired for uneven workflows.
The snakemake documentation takes you through an example workflow that maps reads, sorts, indexes, and calls variants. It is certainly worth a read through: https://snakemake.readthedocs.io/en/stable/tutorial/basics.html
Snakemake seems the better option to me because
Something interesting that I think both tools do is create a graphic for the workflow.
For example the Snakemake Trim and map example
you can create a svg like so:
snakemake --dag | dot -Tsvg > dag.svg
In this example I had three input *fastq.gz files.
Snakemake seems the better option to me because
- The language seems more concise and simpler - IMO
- It is easier to control parallelism on HARDAC(Slurm)
- It has conda compatibility when running on the public cloud
- Another group at Duke is using Snakemake Thoughts @wodanaz ?
snakemake seems to be a better option to me too and I think the options you mention are very convenient.
Something interesting that I think both tools do is create a graphic for the workflow. For example the Snakemake
Trim and map example
you can create a svg like so:snakemake --dag | dot -Tsvg > dag.svg
In this example I had three input *fastq.gz files.
This is really cool
There is some guidance on how Snakemake suggests to structure your project: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility Following those guidelines we could have this workflow added to https://snakemake.github.io/snakemake-workflow-catalog (once our workflow is public that is).
For a first step I want to add a simple Snakemake file to this repo that handles the work done for setup-escape-variants.sh. Then we can build up the Snakemake pipeline and remove the duplicated scripts once the Snakemake pipeline is complete.
Below is my plan for running snakemake. This plan is in keeping with our approach so far of making just the minimal changes necessary. This will allow us to continue to make changes to the existing scripts while enjoying the benefits of snakemake. After we have made the transition to snakemake we can revisit the many options snakemake has to further simplify and improve our workflow.
Snakemake handles running multiple jobs at once without overwhelming the cluster using the --cores <number>
argument. So we no longer need to the sbatch array job logic.
For example we would replace lines like
FILENAMES_FILE=$1
BAMFILE=$(awk NR==$SLURM_ARRAY_TASK_ID $FILENAMES_FILE)
with
BAMFILE=$1
Snakemake has the concept of a working directory where new files are created.
By default the current directory is used as the working directory.
A user can specify where this working directory is with the --directory
argument for snakemake
.
To me the snakemake working directory is the equivalent of the EVDIR directory the workflow already uses.
Part of the changes here will be to stop using EVDIR since snakemake handles this seamlessly.
This will simplify the scripts and be closer to the original Escape_Variants.md markdown.
For example MarkDuplicates
will be changed from
picard -Xmx14g -Djava.io.tmpdir=$EVDIR MarkDuplicates I=$BAMFILE O=$EVDIR/$root.dedup.bam M=$EVDIR/$root.metric.txt
to
picard -Xmx14g -Djava.io.tmpdir=. MarkDuplicates I=$BAMFILE O=$root.dedup.bam M=$root.metric.txt
FYI: The "." passed to tmpdir is shorthand for the current directory.
Snakemake has the concept of a YAML file that contains all the settings for running the workflow. Currently we use environment variables to specify how to run the workflow. With the current workflow you must check the logs to determine how to setup these environment variables to re-run the workflow. To simplify re-running the entire workflow (or even just one step) I think we should leverage the snakemake config file. Here is an example config file:
project: assembledVirus
genome: /data/somedir/NC_045512.fasta
spike: /data/somedir/spike.bed
inputdir: /data/somedir/inputfastqdir
mode: campus
datetab: /data/somedir/Assembling_viruses/date.tab
For running the workflow I plan on putting the config file within the related snakemake working directory so there will be no question where to look for it.
By default snakemake will automatically create and use conda environments when the --use-conda
argument is passed.
All we need to do is reference the conda environment yaml file from a rule.
By default snakemake stores the conda environments in the working directory.
To avoid needlessly recreating the conda environment every time we run the workflow we will use the --conda-prefix <directory>
argument.
In addition to the arguments mentioned above we need to specify the slurm profile. The slurm profile is just the slurm
directory that is part of this repo. To run the pipeline you will need to be in the root directory of this repo.
Then given a working directory created at /data/itlab/data/Bradley123
that contains a config.yaml
file snakemake can be run like so:
srun -p interactive --pty bash
module load Anaconda3/2019.10-gcb02
conda activate snakemake
snakemake --cores 10 \
--directory /data/itlab/data/Bradley123 \
--use-conda --conda-prefix /data/itlab/snakemakeconda \
--profile /data/itlab/work/Assembling_viruses/slurm
Details:
10
cores/threads will be used at a time for the running jobs./data/itlab/data/Bradley123
./data/itlab/snakemakeconda
(this could be any directory you want).slurm
directory in the repo.The above command assumes snakemake has already been installed into a conda environment named snakemake
.
I have a plan to switch to snakemake re-using the existing scripts. I would be glad to give a walkthrough of the code changes and running on HARDAC once the code changes are complete.
I have a branch containing a snakemake workflow that has all of the steps: https://github.com/wodanaz/Assembling_viruses/tree/snakemake-workflow I still need to go over it and make sure everything is properly setup but the general idea is all there.
I created a diagram of the workflow based on running the SampleD file through all the rules: dag.pdf
I cleaned up the snakemake branch and it is now at https://github.com/wodanaz/Assembling_viruses/tree/snakemake-clean.
This branch includes changes to the run-dds-escape-variants.sh
script to use the snakemake workflow.
With this branch I tried to strike a compromise between the snakemake best practices and our current scripts, but some scripts and files have moved around. In the long term there are further changes I would like to implement from the snakemake best practices.
@wodanaz Since this is such a big change I would like to walk you through trying it out on HARDAC before merging. Does that sound ok?
Yeah It sounds good! When can you do the walk though?
@wodanaz I'll email you some times.
When the pipeline fails, for example due to not enough memory, it would be helpful to restart after the last successful step.