Add support for restarting the pipeline

johnbradley commented 3 years ago

When the pipeline fails, for example due to not enough memory, it would be helpful to restart after the last successful step.

johnbradley commented 3 years ago

To support restarting we would need to switch to a full fledged workflow language. Two workflow languages that are relatively easy to read are nextflow and snakemake. I'm going to look into these two options to see which might be a better fit for our needs.

wodanaz commented 3 years ago

Thanks John, I will take a look into them too. I don't think I am familiar with them. I have seen snakemake for a mapping pipeline known as spades I think

johnbradley commented 3 years ago

Nextflow

Nextflow workflows are written in a domain specific language processed by the groovy programming language. Nextflow runs on a java VM. Nextflow has support for running on public cloud infrastructure. A Nextflow workflow consists of a set of processes that receive data via channels for parallel processing. Each process (step) can have cpus and memory directives that will be passed to slurm. There is also a new version of the nextflow domain specific language, but I found the default language to be simpler. You can run a nextflow workflow like so:

nextflow run <nextflowFilename>

Simple bwa index example

Below are the contents of a nextflow file that runs a single process that will index a genome and save the resulting index files in the same directory.

#!/usr/bin/env nextflow

params.genome = "${baseDir}/NC_045512.fasta"
params.outdir = "${file(params.genome).toRealPath().getParent()}"

process indexReferenceGenome {
  conda 'environment.yml'

  input:
  path GENOME from params.genome

  output:
  path "*" into indexedGenome

  publishDir params.outdir, mode: 'copy'

  shell:
  '''
  bwa index -a is !{GENOME}
  '''
}

You can see it will load our conda environment when running this workflow, but the conda support does not work when running on a public cloud. Notice the !{GENOME} syntax. This expression is replaced with the name of our genome file. So the shell command becomes bwa index -a is NC_045512.fasta. This syntax allows us to use environment variables within our shell commands. The shell command be put in a separate file.

Trim and map example

Below is an example that processes fastq files in parallel trimming and mapping.

#!/usr/bin/env nextflow

params.genome = "${baseDir}/NC_045512.fasta"
# collect the paths so genome index files will be passed together
genome_index_files = Channel.fromPath("${params.genome}.*").collect()

params.fastqFiles = "${baseDir}/*.fastq.gz"
# create a channel so the fastq.gz files will be processed in parallel
fastqChannel = Channel.fromPath(params.fastqFiles)

params.outdir = "${baseDir}/results"

process removeNexteraAdapters {
  conda 'environment.yml'

  input:
  path READFILE from fastqChannel

  output:
  path '*_trimmed.fq.gz' into trimmedFastqChannel

  publishDir params.outdir, mode: 'copy'

  shell:
  '''
  trim_galore --fastqc --nextera !{READFILE}
  '''
}

process mapBWA {
  conda 'environment.yml'

  publishDir params.outdir, mode: 'copy'

  input:
  path READFILE from trimmedFastqChannel
  path GENOME from params.genome
  path genome_index_files

  output:
  path '*.sam' into samFilesChannel

  shell:
  template 'map-bwa-cleaned-libs.sh'
}

map-bwa-cleaned-libs.sh

root=$(basename !{READFILE} _trimmed.fq.gz)
root2=$(basename $root _R1_001)
root3=$(echo ${root} | cut -d'_' -f 1)

bwa mem -R "@RG\tID:ID_${root3}\tPU:PU_${root3}\tSM:${root3}\tLB:${root}" !{GENOME} !{READFILE} > $root3.sam

This example uses a channel to process the *.fastq.gz files in parallel. Notice how we run collect() on the genome index files so they will be passed as a group. My understanding is the removeNexteraAdapters process output creates another channel with the into trimmedFastqChannel syntax. Notice how the multi-line script for mapping is a separate file:

Resuming

You can resume a failed run by passing the -resume flag.

nextflow run <nextflowFilename> -resume

Nextflow does not automatically cleanup intermediate files so you should clean them up when done using:

nextflow clean -n <nextflowFilename>

Slurm Compatibility

Instead of using array jobs Nextflow will submit a number of sbatch jobs. There is a queueSize option to control how many sbatch jobs nextflow can run at the same time. Each sbatch job will be limited by the per-process cpus and memory directives.

johnbradley commented 3 years ago

Snakemake

Snakemake workflows are written in a domain specific language processed by python. Snakemake has support for running on public cloud infrastructure. Conda is supported for running on the public cloud unlike nextflow which requires containers. A Snakemake workflow consists of a set of rules for creating output files from input files. You tell Snakemake what output files you want and it will read through the rules and execute them to create your output files. Snakemake creates files in parallel based on the --cores command line argument. Each rule can specify requested threads and memory. Snakemake expects a file named Snakefile containing your rules in the current directory. To run Snakemake enabling conda with 1 core run the following command:

snakemake --cores 1 --use-conda

Simple bwa index example

Below is the contents of a Snakefile file that will index a genome and save the resulting index files in the same directory.

if not "genome" in config:
   config["genome"] = "NC_045512.fasta"

rule all:
    input: multiext(config["genome"], ".amb", ".ann", ".bwt", ".pac", ".sa")

rule bwa_index:
    input:
        config["genome"]
    output:
        multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa")
    conda:
        "environment.yml"
    shell:
        "bwa index -a is {input}"

The first bit of python code sets a default value for the "genome" file. A user can specify a path for a specific "genome" on the command line. The first rule in a Snakefile is the default rule in this case the all rule. The all rule says to make NC_045512.fasta.* files. Snakemake sees that these files do not exist and looks for a rule that outputs these files. Since thebwa_index rule creates these files it will be run. If the files already exist you will get a message to that effect and no work is done. The input block says to input the "genome" which is defaulted to NC_045512.fasta based on the all rule. The output block builds a list of filenames to create using a function: NC_045512.fasta.amb,NC_045512.fasta.ann... The conda block says what environment.yaml file to use to create an environment. The shell block of the bwa_index rule specifies what to run. The {...} notation is used to replace values within it. In this case we would run bwa index -a is NC_045512.fasta.

Trim and map example

Below is an example that processes fastq files in parallel trimming and mapping.

# set default config[genome] if not specified
if not "genome" in config:
   config["genome"] = "NC_045512.fasta"

SAMPLES, = glob_wildcards("{basefilename}_R1_001.fastq.gz")

rule all:
    input: expand("{prefix}.sam", prefix=SAMPLES)

rule trim_adapters:
    output: "{prefix}_trimmed.fq.gz"
    input: "{prefix}.fastq.gz"
    conda: "environment.yml"
    shell:  "trim_galore --fastqc --nextera {input}"

rule map_with_bwa:
    output: "{sample}.sam"
    input:
        genome=config["genome"],
        readfile="{sample}_R1_001_trimmed.fq.gz"
    params:
        read_group_header=r"@RG\tID:ID_{sample}\tPU:PU_{sample}\tSM:{sample}\tLB:{sample}_R1_001"
    conda: "environment.yml"
    shell: "bwa mem -R \"{params.read_group_header}\" {input.genome} {input.readfile} > {output}"

The SAMPLES, = glob_wildcards line looks for sample names in *fastq.gz files. The all (default) rule says to create .sam files for the SAMPLE names found. The map_with_bwa rule doesn't require a separate file like Nextflow since we can pull out the data we need. I find the bwa mem .. line much easier to read here.

NOTE Due to how snakemake works we could combine the above scripts into one file since Snakemake's smart enough to not recreate the index every time.

Resuming

You just re-run the same command to resume.

snakemake --cores 1 --use-conda

Slurm Compatibility

Instead of array jobs Snakemake also runs sbatch jobs. The cores command line parameter better specifies how many threads/cores to consume. With Nextflow choosing queueSize is less precise and may cause larger allocations than desired for uneven workflows.

Tutorial

The snakemake documentation takes you through an example workflow that maps reads, sorts, indexes, and calls variants. It is certainly worth a read through: https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

johnbradley commented 3 years ago

Snakemake seems the better option to me because

The language seems more concise and simpler - IMO
It is easier to control parallelism on HARDAC(Slurm)
It has conda compatibility when running on the public cloud
Another group at Duke is using Snakemake Thoughts @wodanaz ?

johnbradley commented 3 years ago

Something interesting that I think both tools do is create a graphic for the workflow. For example the Snakemake Trim and map example you can create a svg like so:

snakemake --dag | dot -Tsvg > dag.svg

Screen Shot 2021-05-20 at 8 57 40 AM In this example I had three input *fastq.gz files.

wodanaz commented 3 years ago

Snakemake seems the better option to me because

The language seems more concise and simpler - IMO

It is easier to control parallelism on HARDAC(Slurm)

It has conda compatibility when running on the public cloud

Another group at Duke is using Snakemake Thoughts @wodanaz ?

snakemake seems to be a better option to me too and I think the options you mention are very convenient.

wodanaz commented 3 years ago

Something interesting that I think both tools do is create a graphic for the workflow. For example the Snakemake Trim and map example you can create a svg like so:
snakemake --dag | dot -Tsvg > dag.svg
In this example I had three input *fastq.gz files.

This is really cool

johnbradley commented 3 years ago

There is some guidance on how Snakemake suggests to structure your project: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility Following those guidelines we could have this workflow added to https://snakemake.github.io/snakemake-workflow-catalog (once our workflow is public that is).

johnbradley commented 3 years ago

For a first step I want to add a simple Snakemake file to this repo that handles the work done for setup-escape-variants.sh. Then we can build up the Snakemake pipeline and remove the duplicated scripts once the Snakemake pipeline is complete.

johnbradley commented 3 years ago

Plan for running snakemake

Below is my plan for running snakemake. This plan is in keeping with our approach so far of making just the minimal changes necessary. This will allow us to continue to make changes to the existing scripts while enjoying the benefits of snakemake. After we have made the transition to snakemake we can revisit the many options snakemake has to further simplify and improve our workflow.

Removing sbatch Array Job Logic

Snakemake handles running multiple jobs at once without overwhelming the cluster using the --cores <number> argument. So we no longer need to the sbatch array job logic. For example we would replace lines like

FILENAMES_FILE=$1
BAMFILE=$(awk NR==$SLURM_ARRAY_TASK_ID $FILENAMES_FILE)

with

BAMFILE=$1

Working Directory

Snakemake has the concept of a working directory where new files are created. By default the current directory is used as the working directory. A user can specify where this working directory is with the --directory argument for snakemake. To me the snakemake working directory is the equivalent of the EVDIR directory the workflow already uses. Part of the changes here will be to stop using EVDIR since snakemake handles this seamlessly. This will simplify the scripts and be closer to the original Escape_Variants.md markdown. For example MarkDuplicates will be changed from

picard -Xmx14g -Djava.io.tmpdir=$EVDIR MarkDuplicates I=$BAMFILE O=$EVDIR/$root.dedup.bam M=$EVDIR/$root.metric.txt

to

picard -Xmx14g -Djava.io.tmpdir=. MarkDuplicates I=$BAMFILE O=$root.dedup.bam M=$root.metric.txt

FYI: The "." passed to tmpdir is shorthand for the current directory.

Config File

Snakemake has the concept of a YAML file that contains all the settings for running the workflow. Currently we use environment variables to specify how to run the workflow. With the current workflow you must check the logs to determine how to setup these environment variables to re-run the workflow. To simplify re-running the entire workflow (or even just one step) I think we should leverage the snakemake config file. Here is an example config file:

project: assembledVirus
genome: /data/somedir/NC_045512.fasta
spike: /data/somedir/spike.bed
inputdir: /data/somedir/inputfastqdir
mode: campus
datetab: /data/somedir/Assembling_viruses/date.tab

For running the workflow I plan on putting the config file within the related snakemake working directory so there will be no question where to look for it.

Conda Environment Prefix

By default snakemake will automatically create and use conda environments when the --use-conda argument is passed. All we need to do is reference the conda environment yaml file from a rule. By default snakemake stores the conda environments in the working directory. To avoid needlessly recreating the conda environment every time we run the workflow we will use the --conda-prefix <directory> argument.

Running snakemake on HARDAC

In addition to the arguments mentioned above we need to specify the slurm profile. The slurm profile is just the slurm directory that is part of this repo. To run the pipeline you will need to be in the root directory of this repo. Then given a working directory created at /data/itlab/data/Bradley123 that contains a config.yaml file snakemake can be run like so:

srun -p interactive --pty bash
module load Anaconda3/2019.10-gcb02
conda activate snakemake
snakemake --cores 10 \
       --directory /data/itlab/data/Bradley123 \
       --use-conda --conda-prefix /data/itlab/snakemakeconda \
       --profile /data/itlab/work/Assembling_viruses/slurm

Details:

So at max 10 cores/threads will be used at a time for the running jobs.
The resulting files will be written to /data/itlab/data/Bradley123.
Conda environments for steps will be automatically created at /data/itlab/snakemakeconda (this could be any directory you want).
Use slurm sbatch jobs based on the slurm directory in the repo.

The above command assumes snakemake has already been installed into a conda environment named snakemake.

TLDR

I have a plan to switch to snakemake re-using the existing scripts. I would be glad to give a walkthrough of the code changes and running on HARDAC once the code changes are complete.

johnbradley commented 3 years ago

I have a branch containing a snakemake workflow that has all of the steps: https://github.com/wodanaz/Assembling_viruses/tree/snakemake-workflow I still need to go over it and make sure everything is properly setup but the general idea is all there.

I created a diagram of the workflow based on running the SampleD file through all the rules: dag.pdf

johnbradley commented 3 years ago

I cleaned up the snakemake branch and it is now at https://github.com/wodanaz/Assembling_viruses/tree/snakemake-clean. This branch includes changes to the run-dds-escape-variants.sh script to use the snakemake workflow. With this branch I tried to strike a compromise between the snakemake best practices and our current scripts, but some scripts and files have moved around. In the long term there are further changes I would like to implement from the snakemake best practices. @wodanaz Since this is such a big change I would like to walk you through trying it out on HARDAC before merging. Does that sound ok?

wodanaz commented 3 years ago

Yeah It sounds good! When can you do the walk though?

johnbradley commented 3 years ago

@wodanaz I'll email you some times.

wodanaz / Assembling_viruses