ZARP (Zavolab Automated RNA-seq Pipeline) is a generic RNA-Seq analysis workflow that allows users to process and analyze Illumina short-read sequencing libraries with minimum effort. Better yet: With our companion ZARP-cli command line interface, you can start ZARP runs with the simplest and most intuitive commands.
RNA-seq analysis doesn't get simpler than that!
ZARP relies on publicly available bioinformatics tools and currently handles single or paired-end stranded bulk RNA-seq data. The workflow is developed in Snakemake, a widely used workflow management system in the bioinformatics community.
ZARP will pre-process, align and quantify your single- or paired-end stranded bulk RNA-seq sequencing libraries with publicly available state-of-the-art bioinformatics tools. ZARP's browser-based rich reports and visualitations will give you meaningful initial insights in the quality and composition of your sequencing experiments - fast and simple. Whether you are an experimentalist struggling with large scale data analysis or an experienced bioinformatician, when there's RNA-seq data to analyze, just zarp 'em!
Note: For a more detailed description of each step, please refer to the workflow documentation.
The workflow has been tested on:
NOTE: Currently, we only support Linux execution.
IMPORTANT: Rather than installing the ZARP workflow as described in this section, we recommend installing ZARP-cli for most use cases! If you follow its installation instructions, you can skip the instructions below.
Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:
git clone https://github.com/zavolanlab/zarp.git
cd zarp
Workflow dependencies can be conveniently installed with the Conda
package manager. We recommend that you install Miniconda
for your system (Linux). Be sure to select Python 3 option.
The workflow was built and tested with miniconda 4.7.12
.
Other versions are not guaranteed to work as expected.
Given that Miniconda has been installed and is available in the current shell the first
dependency for ZARP is the Mamba package manager, which needs to be installed in
the base
conda environment with:
conda install mamba -n base -c conda-forge
For improved reproducibility and reusability of the workflow, each individual step of the workflow runs either in its own Singularity container or in its own Conda virtual environemnt. As a consequence, running this workflow has very few individual dependencies. The container execution requires Singularity to be installed on the system where the workflow is executed. As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity for Linux architectures, the installation instructions are slightly different depending on your system/setup:
If you do not have root privileges on the machine you want to run the workflow on or if you do not have a Linux machine, please install Singularity separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.
NOTE: The workflow has been tested with the following Singularity versions:
v2.6.2
v3.5.2
After installing Singularity, install the remaining dependencies with:
mamba env create -f install/environment.yml
If you have a Linux machine, as well as root privileges, (e.g., if you plan to run the workflow on your own computer), you can execute the following command to include Singularity in the Conda environment:
mamba env update -f install/environment.root.yml
Activate the Conda environment with:
conda activate zarp
Most tests have additional dependencies. If you are planning to run tests, you will need to install these by executing the following command in your active Conda environment:
mamba env update -f install/environment.dev.yml
We have prepared several tests to check the integrity of the workflow and its
components. These can be found in subdirectories of the tests/
directory.
The most critical of these tests enable you to execute the entire workflow on a
set of small example input files. Note that for this and other tests to complete
successfully, additional dependencies
need to be installed.
Execute one of the following commands to run the test workflow
on your local machine:
Test workflow on local machine with Singularity:
bash tests/test_integration_workflow/test.local.sh
Test workflow on local machine with Conda:
bash tests/test_integration_workflow_with_conda/test.local.sh
Execute one of the following commands to run the test workflow on a Slurm-managed high-performance computing (HPC) cluster:
Test workflow with Singularity:
bash tests/test_integration_workflow/test.slurm.sh
bash tests/test_integration_workflow_with_conda/test.slurm.sh
NOTE: Depending on the configuration of your Slurm installation you may need to adapt file
slurm-config.json
(located directly underprofiles
directory) and the arguments to options--cores
and--jobs
in the fileconfig.yaml
of a respective profile. Consult the manual of your workload manager as well as the section of the Snakemake manual dealing with profiles.
Head over to the ZARP-cli to learn how to start ZARP runs with very simple commands, like:
zarp SRR23590181
Assuming that your current directory is the workflow repository's root directory, create a directory for your workflow run and move into it with:
mkdir config/my_run
cd config/my_run
Create an empty sample table and a workflow configuration file:
touch samples.tsv
touch config.yaml
Use your editor of choice to populate these files with appropriate
values. Have a look at the examples in the tests/
directory to see what the
files should look like, specifically:
For more details and explanations, refer to the pipeline-documentation
Create a runner script. Pick one of the following choices for either local
or cluster execution. Before execution of the respective command, you need to
remember to update the argument of the --singularity-args
option of a
respective profile (file: profiles/{profile}/config.yaml
) so that
it contains a comma-separated list of all directories
containing input data files (samples and any annotation files etc) required for
your run.
Runner script for local execution:
cat << "EOF" > run.sh
#!/bin/bash
snakemake \
--profile="../../profiles/local-singularity" \
--configfile="config.yaml"
EOF
OR
Runner script for Slurm cluster exection (note that you may need
to modify the arguments to --jobs
and --cores
in the file:
profiles/slurm-singularity/config.yaml
depending on your HPC
and workload manager configuration):
cat << "EOF" > run.sh
#!/bin/bash
mkdir -p logs/cluster_log
snakemake \
--profile="../profiles/slurm-singularity" \
--configfile="config.yaml"
EOF
Note: When running the pipeline with conda you should use
local-conda
andslurm-conda
profiles instead.Note: The slurm profiles are adapted to a cluster that uses the quality-of-service (QOS) keyword. If QOS is not supported by your slurm instance, you have to remove all the lines with "qos" in
profiles/slurm-config.json
.
Start your workflow run:
bash run.sh
An independent Snakemake workflow workflow/rules/sra_download.smk
is included
for the download of sequencing libraries from the Sequence Read Archive and
conversion into FASTQ.
The workflow expects the following parameters in the configuration file:
samples
, a sample table (tsv) with column sample containing SRR
identifiers (ERR and DRR are also supported), see
example.outdir
, an output directorysamples_out
, a pointer to a modified sample table with the locations of
the corresponding FASTQ filescluster_log_dir
, the cluster log directory.For executing the example with Conda environments, one can use the following
command (from within the activated zarp
Conda environment):
snakemake --snakefile="workflow/rules/sra_download.smk" \
--profile="profiles/local-conda" \
--config samples="tests/input_files/sra_samples.tsv" \
outdir="results/sra_downloads" \
samples_out="results/sra_downloads/sra_samples.out.tsv" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log"
Alternatively, change the argument to --profile
from local-conda
to
local-singularity
to execute the workflow steps within Singularity
containers.
After successful execution, results/sra_downloads/sra_samples.out.tsv
should
contain:
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
An independent Snakemake workflow workflow/rules/htsinfer.smk
that populates the samples.tsv
required by ZARP with the sample specific parameters seqmode
, f1_3p
, f2_3p
, organism
, libtype
and index_size
. Those parameters are inferred from the provided fastq.gz
files by HTSinfer.
Note: The workflow uses the implicit temporary directory from snakemake, which is called with resources.tmpdir.
The workflow expects the following config:
samples
, a sample table (tsv) with column sample containing sample identifiers, as well as columns fq1 and fq2 containing the paths to the input fastq files
see example here. If the table contains further ZARP compatible columns (see pipeline documentation), the values specified there by the user are given priority over htsinfer's results. outdir
, an output directorysamples_out
, path to a modified sample table with inferred parametersrecords
, set to 100000 per defaultFor executing the example one can use the following (with activated zarp environment):
cd tests/test_htsinfer_workflow
snakemake \
--snakefile="../../workflow/rules/htsinfer.smk" \
--restart-times=0 \
--profile="../../profiles/local-singularity" \
--config outdir="results" \
samples="../input_files/htsinfer_samples.tsv" \
samples_out="samples_htsinfer.tsv" \
--notemp \
--keep-incomplete
However, this call will exit with an error, as not all parameters can be inferred from the example files. The argument --keep-incomplete
makes sure the samples_htsinfer.tsv
file can nevertheless be inspected.
After successful execution - if all parameters could be either inferred or were specified by the user - [OUTDIR]/[SAMPLES_OUT]
should contain a populated table with parameters seqmode
, f1_3p
, f2_3p
, organism
, libtype
and index_size
for all input samples as described in the pipeline documentation.