A pipeline for running single-cell demultiplexing simulations with demuxlet.
demux is a Snakemake pipeline for simulating a multiplexed droplet scRNA-seq (dscRNA-seq) experiment using data from individual scRNA-seq samples and quantifying the effectiveness of deconvoluting the sample identify of each cell in the simulated dataset with demuxlet. Such an analysis is helpful for reducing the cost of library preparations for dscRNA-seq experiments.
Here is an example flowchart depicting the demux pipeline with five input samples.
Each step is briefly described below:
Execute the following command.
git clone https://github.com/zrcjessica/demux.git
The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:
conda create -n snakemake -c bioconda -c conda-forge 'snakemake==5.18.0' --no-channel-priority
We highly recommend you install Snakemake via conda like this so that you can use the --use-conda
flag when calling snakemake
to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.
demux minimally requires the following inputs, which must be specified in the config.yml
file:
cellranger count
pipeline:
See below for additional input parameters.
It is recommended to symlink your data into the gitignored data/
folder:
ln -s /path/to/your/data data
If you ever need to switch the input to a different dataset, you can just change the symlink path.
demux returns a table summarizing the performance of demuxlet on the simulated data and a plot showing the precision-recall curves.
You can also symlink your output, if you think you might want to change it in the future:
ln -s /iblm/netapp/data1/jezhou/Telese_Rat_Amygdala/demultiplex_simulation/out out
Locally:
./run &
or on a SGE cluster:
qsub run
You must modify the config.yml file to specify paths to your data. The config file is currently configured to run the pipeline on our data (in the git-ignored data/
folder). The config file contains the following variables:
data
*The data
variable contains nested variables for each of your samples, with the paths to their corresponding BAM (reads
) and filtered barcodes (barcodes
) files (Cell Ranger output) as well as the sample's vcf_id
.
vcf
*Give the path to the vcf file containing genotypes for all samples nested in the data
variable.
samples
List the samples from those nested in the data
variable that you want to be included as input to the demultiplexing simulation. If this line is not provided or commented out, all samples from the data
variable will be used.
rate
Doublet rate to be used for demultiplexing simulations. Defaults to 0.3.
out
Path to directory in which to write output files. If not provided, defaults to out
. The directory will be created if it does not already exist.
* Inputs required
A Snakemake pipeline for running the demultiplexing simulation.
Config file that defines options and input for the pipeline.
Various scripts used by the pipeline. See the script README for more information.
The dependencies of our pipeline, specified as conda
environment files. These are used by Snakemake to automatically install our dependencies at runtime.
An example bash script for executing the pipeline using snakemake
and conda
. Any arguments to this script are passed directly to snakemake
.