zerodel / sailfish-cir

a pipeline for quantification of circular RNA.
14 stars 9 forks source link

Introduction

Sailfish-cir is a computational tool to estimate the relative abundance of circular RNA transcripts from high-throughput RNA-seq data.

It accepts output of circRNA identification tools (CIRI, KNIFE, circRNA_finder) or a BED-format file to specify the reference set of circular RNA transcripts in RNA-seq data. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using Sailfish framework.

Prerequisites

The following three tools should be installed before running Sailfish-cir.

1. Sailfish
   Sailfish (http://www.cs.cmu.edu/~ckingsf/software/sailfish/) is used to quantify RNA expression. Sailfish 0.7.0 or above is recommended.

2. gffread
   gffread is part of Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/). This GFF/GTF parsing utility is used to extract sequence.

3. gffutils 
   gffutils (https://pypi.python.org/pypi/gffutils) is used to build a database for reference set of all RNA transcripts.

We assume these pre-installed tools are running under a Unix-like environment, and paths of gffread and sailfish binary executive should be added to your PATH variable.

Other than above third party tools, several data files are required for Sailfish-cir as well.

1. Raw sequencing reads file in fasta/fastq format;
2. A CIRI output file or a BED file;
3. Gene annotation file in .gtf format, and a single fasta file or multi-fasta files containing RNA sequences. 

Usage

python path/to/your/sailfish_cir.py -g path/to/your/genomic/sequence/foo.fa -a path/to/your/annotation/bar.gtf -1 path/to/your/reads/mate1.fastq -2 /path/to/your/reads/mate2.fastq -o /path/to/where/you/want/your/result -c /path/to/your/CIRI/output/file
-g  path to genomic sequence fasta file
-a  path to gene annotation file, ie, .gtf or .gff files
-r  path to single-end raw sequencing reads file.
-1  path to raw pair-end reads, mate 1
-2  path to raw pair-end reads, mate 2

-c  path to CIRI output file to specify circular RNA
--bed  path to bed file which contains circular RNA.
--circRNA_finder path to circRNA_finder output file.
--KNIFE_report_folder path to KNIFE output folder, make sure it has the subdirectory "circReads"

-o  output folder that contains the index built by sailfish and quantification results
-k  minimum match size used during sailfish quantification,   default is 21
--libtype   format string describing the library type of your reads. default is "IU", [read more on libtype of Sailfish](http://sailfish.readthedocs.org/en/master/library_type.html)
--mll mean library length, this option is to fix up the effective length.
-h/--help   print this help message

Output file

Sailfish-cir expression estimats can be found at a subdirectory named quant_circular under the path set by -o parameter.

Notes

  1. .gtf file

    In order to generate reference sequences of circular RNA transcripts, a database of your genomic annotation is needed. It is recommended to create the database file manually using the same filename with extension ".db" in the same directory. Since gffutils don't guess GTF file format, and GTF format was changed after GRCH37.75. It will be time-consuming when building the database using the same option for earlier gtf version.

  2. output file from circRNA identification tools

    This script accepts BED-format file or output files of circRNA identification tools (CIRI, circRNA_finder, KNIFE are supported) to specify the circular RNA transcripts.

    We also provide a small utility to convert those outputs to a BED-format file. usage as follows:

    python path/to/your/sailfish_cir.py convert_CIRI your/CIRI/output/file
    python path/to/your/sailfish_cir.py convert_KNIFE your/KNIFE/output/folder

    For CIRI, this will create a .bed file in the same folder. the 'name' (4th) column follows this "circular_transcript@host_gene" pattern. For KNIFE, this will create a "summarized_knife_junction.bed" under your KNIFE output folder. circRNA_finder output is actually BED-format.

    If you want to quantify circRNA from multiple circRNA detection results, convert those output files into BED format and merge them into a single .bed file, then use "--bed" option.

ChangeLog

Release v0.11

Release v0.10

Example files

Example files are available in MEGA

  1. Human gene Annotation file: hg19 gtf file OR binary database file by gffutils
  2. Genome sequence file: hg19 fasta file
  3. Sailfish (ver 0.9.0): Sailfish Linux binary
  4. RNA-seq simulation scripts simulation
  5. Simulated RNA-seq datasets 11 simulated datasets (uniform expresion distribution)

    one simulated dataset (empirical expression distribution)

    uniform sequencing error with error rate 0.01, script and other files

    uniform sequencing error with error rate 0.02, script and other files

    empirical error model: illumina 4, script and other files

    empirical error model: illumina 5, script and other files

    simulation size two fold, script and other files

    simulation size 4 fold, script and other files

    simulation using whole genome, script and other files

Reference

Musheng Li, Xueying Xie, Jing Zhou, Mengying Sheng, Xiaofeng Yin, Eun-A Ko, Tong Zhou and Wanjun Gu. Quantifying circular RNA expression from RNA-seq data using model-based framework. Bioinformatics, 2017, in press.

Contact

Musheng Li (zerodel@126.com)