demux

A pipeline for running single-cell demultiplexing simulations with demuxlet.

Introduction

demux is a Snakemake pipeline for simulating a multiplexed droplet scRNA-seq (dscRNA-seq) experiment using data from individual scRNA-seq samples and quantifying the effectiveness of deconvoluting the sample identify of each cell in the simulated dataset with demuxlet. Such an analysis is helpful for reducing the cost of library preparations for dscRNA-seq experiments.

Here is an example flowchart depicting the demux pipeline with five input samples.

Each step is briefly described below:

unique_barcodes: aggregate cell barcodes across all samples provided as input and remove any cell barcodes that appear more than once
simulate: simulate a multiplexed dscRNA-seq experiment with a specified doublet rate (default: 0.3). The doublet rate specifies the percentage of cells from the aggregate dataset expected to be found in doubletes. We define two types of doublets: (1) doublets containing cells from different samples, and (2) doublets containing cells from the same samples
table: create a reference table mapping the original (ground truth) barcodes to the new barcodes (for analyzing demuxlet performance)
new bam: edit the BAM files corresponding to each sample provided as input to reflect simulated doublets. For ever pair of cells randomly selected to be in a doublet, we change the cell barcode of one cell in the pair to match that of the other cell.
merge: merge the edited BAM files into one BAM file to reflect a multiplexed experiment.
sort: sort the merged BAM file
demux: run demuxlet with the BAM file as input
results: analyze demuxlet performance

Download

Execute the following command.

git clone https://github.com/zrcjessica/demux.git

Setup

Dependencies

The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:

conda create -n snakemake -c bioconda -c conda-forge 'snakemake==5.18.0' --no-channel-priority

We highly recommend you install Snakemake via conda like this so that you can use the --use-conda flag when calling snakemake to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.

Input

demux minimally requires the following inputs, which must be specified in the config.yml file:

a list of individually processed samples
for each sample above, the following Cell Ranger outputs from the cellranger count pipeline:
- Barcoded BAM
- Cell barcodes from Filtered Feature-Barcode Matrix
a vcf file containing the genotypes of all samples from above

See below for additional input parameters.

It is recommended to symlink your data into the gitignored data/ folder:

ln -s /path/to/your/data data

If you ever need to switch the input to a different dataset, you can just change the symlink path.

Output

demux returns a table summarizing the performance of demuxlet on the simulated data and a plot showing the precision-recall curves.

You can also symlink your output, if you think you might want to change it in the future:

ln -s /iblm/netapp/data1/jezhou/Telese_Rat_Amygdala/demultiplex_simulation/out out

Execution

Locally:

./run &

or on a SGE cluster:

qsub run

Executing the pipeline on your own data

You must modify the config.yml file to specify paths to your data. The config file is currently configured to run the pipeline on our data (in the git-ignored data/ folder). The config file contains the following variables:

`data`*

The data variable contains nested variables for each of your samples, with the paths to their corresponding BAM (reads) and filtered barcodes (barcodes) files (Cell Ranger output) as well as the sample's vcf_id.

`vcf`*

Give the path to the vcf file containing genotypes for all samples nested in the data variable.

`samples`

List the samples from those nested in the data variable that you want to be included as input to the demultiplexing simulation. If this line is not provided or commented out, all samples from the data variable will be used.

`rate`

Doublet rate to be used for demultiplexing simulations. Defaults to 0.3.

`out`

Path to directory in which to write output files. If not provided, defaults to out. The directory will be created if it does not already exist.

* Inputs required

zrcjessica / demux

readme

demux

Introduction

Download

Setup

Dependencies

Input

Output

Execution

Executing the pipeline on your own data

`data`*

`vcf`*

`samples`

`rate`

`out`

Files and directories

Snakefile

config.yml

scripts/

envs/

run

zrcjessica / demux

readme

demux

Introduction

Download

Setup

Dependencies

Input

Output

Execution

Executing the pipeline on your own data

data*

vcf*

samples

rate

out

Files and directories

Snakefile

config.yml

scripts/

envs/

run

`data`*

`vcf`*

`samples`

`rate`

`out`