The escape variants pipeline is based on Escape_Variants.md.
Snakemake 6.4.1 is required due to incompatible parameter changes in the latest version of snakemake.
A config.sh
file must be created by copying the example file in place:
cp example-config.sh config.sh
This config file contains environment variables used by the pipeline.
config.sh
Environment Variables
USE_MODULES
- Controls if the scripts use environment modules to provide software
FOREGROUND_MODE
- Runs snakemake in the foreground.
sacct
on worker nodesSM_CONDA_PREFIX
- Environment variable to overide the conda cache directory.
conda
within the data directory.Snakemake must be installed in a conda environment named snakemake
.
To create this environment using the conda
command run:
conda create -n snakemake -c conda-forge -c bioconda mamba snakemake==6.4.1
On the HARDAC cluster the the snakemake
environment can be created on an interactive node like so:
module load Anaconda3/2019.10-gcb02
conda create -n snakemake -c conda-forge -c bioconda mamba snakemake==6.4.1
If not using environment modules a conda environment named dds
should be created by running:
conda env create -f dds-environment.yml
When using environment modules (USE_MODULES=Y
) the ddsclient
environment module will be used.
The pipeline uses a data
directory to organize input, conda environments, and output.
By default the pipeline stages data in from DDS so a project name is required.
The main pipeline script has two required arguments:
./run-escape-variants.sh -d <datadir> -i <projectname>
So to use "data" as the name of your data
directory and "CVTestSamples" as your project name run:
./run-escape-variants.sh -d data -i CVTestSamples
The above command will:
data/input/CVTestSamples
data/output/CVTestSamples_snakemake
data/output/CVTestSamples_snakemake
to data/output/CVTestSamples
CVTestSamples_results
If a problem occurs when running the pipeline after making a change (such as increasing memory requested) the same run-escape-variants.sh
command can be run again.
Snakemake will restart where it left off.
If you do not want to download and upload data to DDS put the input fastq.gz files in the
appropriate <datadir>/input/<projectname>
directory and add the -s -S
arguments to run-escape-variants.sh
.
-s
disables staging data in (downloading) from DDS-S
disables staging data out (upload) to DDSFor example using data as datadir
and CVTestSamples
as a project name input files should be placed in data/input/CVTestSamples
and the following command run:
./run-escape-variants.sh -d data -i CVTestSamples -s -S
The -e <email>
argument will cause the run-escape-variants.sh
script to send an email when the job completes.
For example:
./run-escape-variants.sh -d data -i CVTestSamples -e yourusername@example.com
By default the pipeline will delete the input directory and intermediate files/directories.
If you want to preserve the input and intermediate directories pass the -k
argument.
You can see command line help by running ./run-escape-variants.sh
without arguments.
The snakemake pipeline can be run directly bypassing the helper scripts used above. This can be helpful when making changes to the pipeline.
All settings for the pipeline are stored in the config/config.yaml file.
Configuration options:
resources/NC_045512.fasta
.resources/spike.bed
.""
(the base directory of this repo).
If inputdir is changed to a path other than ""
the genome, spike, datetab should be set to absolute paths.
Otherwise the workflow will not be able to find these files.Copy your *.fastq.gz samples into the base directory of this repo. From within the base directory of this repo run snakemake:
snakemake --cores 1 --use-conda
The above command will use one core and create conda environments for the tools used in the pipeline. The number of cores can be increased to improve throughput. After the command finishes the pipeline output files and directories will be in the base directory of this repo.
The snakemake command creates multiple jobs and monitors their progress. Since the snakemake command runs in the foreground it is best to create an sbatch script to run it.
For example to run with 10 concurrent processes, using conda for the required tools, and the slurm profile you
could create a sbatch script named ev.sh
with the following content:
#!/bin/bash
module load Anaconda3/2019.10-gcb02
conda activate snakemake
snakemake --cores 10 --use-conda --profile slurm/
Then run the sbatch script like so:
sbatch ev.sh
If the environment module is not available on your cluster remove the module load Anaconda3/2019.10-gcb02
line and setup conda.
The pipeline can be run on single Slurm node utilizing local storage. Using this mode may help if a Slurm cluster's shared storage is a bottleneck. When using this mode snakemake, conda environments, input data, and intermediate data will all be stored on local storage. Only the output files and top level logs will stored on the shared storage. The top level log files must be on shared storage for the inital sbatch job to run.
Running in the single node mode requires special configuration.
To create the config files run ./setup-config.sh
and answer the prompts.
This script creates/replaces the contents of the config.sh
and smprofile/config.yaml
files.
The script will prompt for the following items:
Miniconda3 must be installed on the local(scratch) storage on the node to run the workflow. The Snakemake and DukeDSClient environments installed in the appropriate environments within the miniconda3 installation.
To run on a single node requires specifing the optional -o
output directory argument.
This argument specifies the location for output files.
So if /scratch/user1/data
is the local storage path and /hpc/group/mygroup/user1
is the shared output files path.
The pipeline could be run on project CVTestSamples
with 120 jobs like so:
./run-escape-variants.sh -d /scratch/user1/data -o /hpc/group/mygroup/user1 -i CVTestSamples -j 120