wyang17 / SQuIRE

Software for Quantifying Interspersed Repeat Expression
Other
49 stars 30 forks source link

Software for Quantifying Interspersed Repeat Expression

Installation

SQuIRE Pipeline Overview

SQuIRE Pipeline Options

FAQs

Example Pipeline

Read about SQuIRE in:

Wan R Yang, Daniel Ardeljan, Clarissa N Pacyna, Lindsay M Payer, Kathleen H Burns; SQuIRE reveals locus-specific regulation of interspersed repeat expression, Nucleic Acids Research, , gky1301, https://doi.org/10.1093/nar/gky1301

Installation

SQuIRE is available on bioconda and can be installed using conda. We suggest running conda with mamba for speedup.

Conda is a package manager that installs and runs packages and their dependencies. Conda also creates virtual environments and allows users to switch between those environments. Mamba is a reimplementation of conda in C++ with faster dependency solving.

The instructions below install conda, mamba, and a conda virtual environment for SQuIRE.

  1. Download Miniconda (a lightweight distribution of conda) from https://conda.io/miniconda.html

    • wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    • Documentation will appear as the software downloads
  2. Run the Miniconda installer

    • bash Miniconda3-latest-Linux-x86_64.sh
    • Press ENTER key to review the Miniconda license
    • Type yes to approve the license terms
    • Pres ENTER key to confirm install locatino (or enter a preferred location)
    • Type yes to add Miniconda2 into your PATH
  3. Restart shell

    • exec $SHELL
  4. Configure conda channel priority

    • Add conda channels in the following order:
      conda config --add channels defaults 
      conda config --add channels bioconda
      conda config --add channels conda-forge
    • Confirm that the channels are listed in the following order by running cat $(dirname $CONDA_PREFIX)/.condarc:
      channels:
       - conda-forge
       - bioconda
       - defaults
      • Each channel has a higher priority than those listed below it. Conda will install packages from a higher priority channel over any version from a lower priority channel by default.
    • Check that channel priority is set to flexible with conda config --describe channel_priority
      • Channel priority should be set to flexible by default. If it is not, run conda config --set channel_priority flexible
  5. Install mamba in the base environment

    • conda install mamba -n base -c conda-forge
  6. Create a new environment and install SQuIRE!

    • mamba create -n squire -c bioconda squire
    • To begin using SQuIRE, activate the environment: conda activate squire

Many thanks to Rohini Gadde for setting up the SQuIRE Bioconda package!

Notes

*SQuIRE is compatible with the following specific versions of software:*

If installing these software with conda is unsuccessful, we recommend installing these versions with squire Build to ensure compatibility with SQuIRE.

Pipeline Overview

Preparation Stage 1) Fetch: Downloads input files from RefGene and generates STAR index Only needs to be done once initially to acquire genomic input files or if a new build is desired.

2) Clean: Filters Repeatmasker file for Repeats of interest, collapses overlapping repeats, and returns as BED file.

Optional: Incorporation of non-reference TE sequence

Quantification Stage

1) Map: Aligns RNAseq data

2) Count: Quantifies RNAseq reads aligning to TEs

Analysis Stage

1) Call: Compiles and outputs differential expression from multiple alignments

Follow-up Stage 1) Draw: Creates BEDgraphs from RNAseq data

2) Seek: Reports individual transposable element sequences

An example pipeline with sample scripts is described here.

Arguments for each step

squire Build

Use Build only if conda create does not successfully install software.

Arguments:
-b, --folder Destination folder for downloaded UCSC file(s). Optional; default='squire_build'
-s , --software Install required SQuIRE software and add to PATH - specify 'all' or provide comma-separated list (no spaces) of: STAR,bedtools,samtools,stringtie. Optional; default = False
-v, --verbosity Want messages and runtime printed to stderr. Optional.

Preparation Stage

squire Fetch

Arguments
-h, --help show this help message and exit
-b , --build UCSC designation for genome build, eg. 'hg38'
-o , --fetch_folder Destination folder for downloaded UCSC file(s), default folder is 'squire_fetch'
-f, --fasta Download chromosome fasta files for build chromosomes. Optional
-c, --chrom_info Download chrom_info.txt file with chromosome lengths. Optional
-r, --rmsk Download Repeatmasker file. Optional
-g, --gene Download UCSC gene annotation. Optional
-x, --index Create STAR index (WARNING: will take a lot of time and memory!), optional
-p , --pthreads Launch parallel threads. Optional, default = 1
-k, --keep Keep downloaded compressed files. Optional, default = False
-v, --verbosity Print messages and runtime records to stderr. Optional; default = False

squire Clean

Arguments
-h, --help show this help message and exit
-r , --rmsk Repeatmasker file, default will search 'squire_fetch' folder for rmsk.txt or .out file. Optional
-b , --build UCSC designation for genome build, eg. 'hg37'
-o , --clean_folder Destination folder for output BED file, default folder is 'squire_clean'
-c , --repclass Comma-separated list of desired repeat classes (AKA superfamilies), eg 'DNA,LTR'. Column 12 in repeatmasker file. Can use UNIX wildcard patterns. Optional
-f , --family Comma-separated list of desired repeat families, eg 'ERV1,ERVK,ERVL'. Column 13 on repeatmasker file. Can use UNIX wildcard patterns. Optional
-s , --subfamily Comma-separated list of desired repeat subfamilies, eg 'L1HS,AluYb'. Column 11 in repeatmasker file. Can use UNIX wildcard patterns. Optional
-e , --extra Filepath of extra tab-delimited file containing non-reference repeat sequences. Columns should be chr, start, stop, strand, subfamily, and sequence. Optional; default = False
-v, --verbosity Print messages and runtime records to stderr. Optional; default = False

Non-reference File Format

For known TE sequences that are not included in the reference genome, a tab-delimited file can be provided to SQuIRE to incorporate the non-reference TEs into the analysis. This file can be inputted into the Map and Clean steps with the --extra parameter.

The following information should be included in the file:

  1. Chromosome or Plasmid Identification
    • SQuIRE will add an identifier with an underscore "_" and the insertion type to distinguish the annotation from the reference genome.
  2. Insertion Start
    • 0-based numerical start location of the non-reference repeat.
  3. Insertion End
    • 0-based numerical end location. For chromosome insertions, this will only be one base different from Insertion Start.
  4. Strand
    • + or - Orientation of 'sense' strand of TE annotation.
  5. TE classification
    • Provide TE Subfamily, Family and Order, separated by colons ":".
  6. Insertion Type
    • Must be one of: polymorphic insertion, novel insertion, plasmid, or transgene.
  7. Left-Flank Sequence
    • Flanking sequence before the TE insertion.
  8. Right-Flank Sequence
    • Flanking sequence after the TE insertion.
  9. TE Sequence
    • Non-reference TE sequence.
Example File

Quantification Stage

squire Map

Arguments
-h, --help show this help message and exit
-1 , --read1 RNASeq data fastq file(s); read1 if providing paired end data. If more than one file, separate with commas, no spaces. Can be gzipped.
-2 , --read2 RNASeq data read2 fastq file(s). if more than one file, separate with commas, no spaces. Can be gzipped. Optional if unpaired data.
-o , --map_folder Destination folder for output files. Optional, default = 'squire_map'
-f ,--fetch_folder Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch'
-r , --read_length Read length (if trim3 selected, after trimming; required)
-n , --name Common basename for RNAseq input. Optional, default = basename of read1

-b ,--build , UCSC designation for genome build, eg. 'hg38' (required if more than 1 build in clean_folder) -3 , --trim3 | Trim bases from right end of each read before alignment. Optional; default = 0 -e , --extra | Filepath of text file containing non-reference repeat sequence and genome information. Optional, default = False -g , --gtf | Optional GTF of genome transcripts. For those interested in gene transcription -p , --pthreads | Launch parallel threads. Optional, default = '1' -v, --verbosity | Print messages and runtime records to stderr. Optional; default = False

squire Count

Arguments:
-h, --help show this help message and exit
-m , --map_folder Folder location of outputs from SQuIRE Map (optional,default = 'squire_map')
-c , --clean_folder Folder location of outputs from SQuIRE Clean (optional, default = 'squire_clean')
-o , --count_folder Destination folder for output files(optional, default = 'squire_count')
-t , --tempfolder Folder for tempfiles (optional; default=count_folder')
-f , --fetch_folder Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch')
-r , --read_length Read length (if trim3 selected, after trimming; required).
-n , --name Common basename for input files (required if more than one bam file in map_folder)
-b , --build UCSC designation for genome build, eg. 'hg38' (required if more than 1 build in clean_folder)
-p , --pthreads Launch parallel threads(optional; default='1')
-s , --strandedness '0' if unstranded eg Standard Illumina, 1 if first- strand eg Illumina Truseq, dUTP, NSR, NNSR, 2 if second-strand, eg Ligation, Standard SOLiD (optional,default=0)
-e , --EM Run estimation-maximization on TE counts given numberof times (optional, specify 0 if no EM desired; default=auto)
-v, --verbosity Want messages and runtime printed to stderr (optional; default=False)

Analysis Stage

squire Call

Follow-up Stage

squire Draw

Arguments
-h, --help show this help message and exit
-f , --fetch_folder Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch')
-m , --map_folder Folder location of outputs from SQuIRE Map (optional, default = 'squire_map')
-o , --draw_folder Destination folder for output files (optional; default='squire_draw')
-n , --name Basename for bam file (required if more than one bam file in map_folder)
-s , --strandedness '0' if unstranded, 1 if first-strand eg Illumina Truseq, dUTP, NSR, NNSR, 2 if second-strand, eg Ligation, Standard (optional,default=1)
-b , --build UCSC designation for genome build, eg. 'hg38' (required)
-l, --normlib Normalize bedgraphs by library size (optional; default=False)
-p , --pthreads Launch parallel threads(optional; default='1')
-v, --verbosity Want messages and runtime printed to stderr (optional; default=False)

squire Seek

Arguments
-h, --help show this help message and exit
-i, --infile Repeat genomic coordinates, can be TE_ID, bedfile, or gff
-o, --outfile Repeat sequences output file (FASTA), can use "-" for stdout
-g, --genome Genome build's fasta chromosomes - .fa file or .chromFa folder
-v, --verbosity Print messages and runtime records to stderr. Optional; default = False

FAQs

How do I know if my data is stranded or not?

The RNA-seqlopedia by Cresko Lab at University of Oregon outlines strand specific data in section 3.7 Preparation of stranded libraries. You can verify the strand specificity with the researcher who collected the data, or use an outside program like infer-experiment.py in RSeQC or the libtype option in Salmon.

How much memory does each step require?

You can gauge how much vmem to assign to each job based on the number of reads in your datasets.

Can SQuIRE be used on ChIP or small RNA?

SQuIRE has not yet been tested with ChIP or small RNA sequencing data, so its compatibility has not yet been determined.

Example Pipeline

INSTRUCTIONS

  1. Copy the sample_scripts folder to your project folder

    • mkdir <project folder>/scripts
    • cp SQuIRE/sample_scripts/* <project folder>/scripts
    • cd <project folder>/scripts
  2. Fill out the arguments.sh file

  3. Replace "squire@email.com" in the #$ -M squire@email.com line with your email address to get alert of script completion and memory usage

  4. Submit jobs to SGE cluster (the -cwd option results in error and output files associated to stay in your current working directory)

    • qsub -cwd fetch.sh arguments.sh
    • qsub -cwd clean.sh arguments.sh
    • qsub -cwd loop_map.sh arguments.sh
    • qsub -cwd loop_count.sh arguments.sh
    • qsub -cwd call.sh arguments.sh
    • qsub -cwd loop_draw.sh arguments.sh
  5. If a memory or segmentation fault error occurs, edit the #$ -l mem_free and #$ -l h_vmem lines to increase memory usage for the appropriate script.