This version of Longcell-pre has been deprecated. For the newest vresion, please refer to https://github.com/yuntianf/LongcellPre.git
Longcell-pre is a pipeline to analyze Nanopore long read sequencing dataset based on 10X single cell sequencing toolkit. This pipeline includes preprocessing to do barcode and unique molecular identifier (UMI) assignment to give an accurate isoform quantification. Based on the isoform quantification from Longcell-pre, our another pipeline Longcell incorporates downstream splicing analysis, including identification of highly variable exons and differential alternative splicing analysis between different cell populations.
requires:
install.packages("knapsack", repos="http://R-Forge.R-project.org")
)git clone https://github.com/yuntianf/Longcell-pre.git
cd ./Longcell-pre/scripts/
dos2unix Longcell-pre.sh
chmod a+x Longcell-pre.sh
cd ./BarcodeMatch/
g++ -O2 BarcodeMatch.cpp bc.cpp edit.cpp normal.cpp -o BarcodeMatch
Rscript ./Auxiliary/gtf2bed.R -g gtf−ogtf -o bed_folder
This step will transform the isoform annotation in gtf into non-overlapping sub-exons and save it as a bed file for each gene. The output is a table with 5 columns, including:
required:
./Longcell-pre.sh -b bamfile−dbam_file -d bed_dir -w barcodewhitelist−cbarcode_whitelist -c cores_num -o $out_dir
This is an integrated pipeline to directly generate single cell isoform quantification from the bam. The output include three folders:
required:
optional:
python ./SoftclipsExon/softclip_splicesite.py -b bam−tbam -t toolkit -g bed−obed -o outdir/exon_reads/
This step will extract reads from the bam within the designated region annotated in the bed file. Usually this order will just extract reads for one gene and we parallel this process with GNU to traverse all bed files in the bash file.
find bedfolder−nameEN∗.bed|parallel−jbed_folder -name EN*.bed | parallel -j cores python ./SoftclipsExon/softclip_splicesite.py -b bam−tbam -t toolkit -g {} -o $outdir/exon_reads/
find $outdir/exon_reads/ -name "*" -type f -size 0c | xargs -n 1 rm -f
cat outdir/exonreads/EN∗.bed>outdir/exon_reads/EN*.bed > outdir/exon_reads/exon_reads.txt
required:
The output exon_read.txt
is a table with 7 columns, including:
cut -f 1,2 outdir/exonreads/exonreads.txt>outdir/exon_reads/exon_reads.txt > outdir/softclips/softclips.txt
python ./BarcodeMatch/BarcodeMatch.py -q outdir/softclips/softclips.txt−coutdir/softclips/softclips.txt -c barcodes -o "outdir/barcode_match/bc.txt" -co outdir/barcode_match/bc.txt" -co cores
This step will identify the cell barcode in the softclips from the long reads with the reference of barcode whitelist. Here we applied two methods to speed this process up, and the intersection of their results can provide the highest correct ratio.
required:
optional:
The output bc.txt
is a table with 2 columns, including:
Rscript ./BarcodeMatch/barcode_merge.R outdir/barcode_match/bc.txt outdir/barcode_match/bc.txt outdir/exon_reads/exon_reads.txt num num outdir/sub_cell_exon/
This step merge identified barcodes with corresponding reads and filter out reads with no or more than 1 barcode. The output will be splited into subfiles for parallization in UMI deduplication step. As the UMI deduplication treats the gene as the minimal unit, the number of subfiles couldn't exceed the number of genes.
required:
The output sub_cell_exon.id.txt
is a table with 9 columns, including:
Rscript script−cscript -c outdir/sub_cell_exon/sub_cell_exon.*.txt -u UMIlen−sUMI_len -s thresh -o $outdir/cell_gene_splice_count/
This step does UMI deduplication for each gene in single cell. The correction for wrong mapping and truncations is also embedded in this step. This step loops over all sub_cell_exons.txt
output from step4, thus it's simple to be paralleled. As correction for wrong mapping should integrate UMI clusters from all cells, the minimal unit in parallelization is a gene for all cells.
required:
optional:
The output sub_cell_gene_splice_count.*.txt
is a table with 7 columns, including:
|
). Each exon is representated by its start and end sites (seperated by ",") Rscript ./spliceob/createExonList.R outdir/cellgenesplicecount/outdir/cell_gene_splice_count/ bed_folder/gene_bed.rds $outdir/cell_gene_exon_count/sub_cell_gene_exon_count.*.txt
awk 'FNR>1 || NR==1' outdir/cellgeneexoncount/subcellgeneexoncount.∗.txt>outdir/cell_gene_exon_count/sub_cell_gene_exon_count.*.txt > outdir/cell_gene_exon_count/cell_gene_exon_count.txt
This step transforms the exon bins to exon id given the input bed annotation. Bed annotation could be canonical or self-made from the data. This step loops over all files in the input folder, which is output from step 5, thus it's also paralleled by GNU.
required:
gene_bed.rds
output from step1 can be directly usedThe output sub_cell_gene_exon_count.*.txt
is generally the same as sub_cell_gene_splice_count.*.txt
, except for the representation of isoforms.
Rscript ./spliceob/saveExonList.R outdir/cellgeneexoncount/cellgeneexoncount.txtoutdir/cell_gene_exon_count/cell_gene_exon_count.txt outdir/cell_gene_exon_count/
required:
This step stores the single cell isoform expression as a sparse matrix to save memory, which is also the input format for Longcell.
For the tutorial of downstream alternative splicing analysis, please refer to the vignette:
If you use Longcell for published work, please cite our manuscript:
Single cell and spatial alternative splicing analysis with long read sequencing
Yuntian Fu, Heonseok Kim, Jenea I. Adams, Susan M. Grimes, Sijia Huang, Billy T. Lau, Anuja Sathe, Paul Hess, Hanlee P. Ji, Nancy R. Zhang
bioRxiv 2023.02.23.529769; doi: https://doi.org/10.1101/2023.02.23.529769