Wan R Yang, Daniel Ardeljan, Clarissa N Pacyna, Lindsay M Payer, Kathleen H Burns; SQuIRE reveals locus-specific regulation of interspersed repeat expression, Nucleic Acids Research, , gky1301, https://doi.org/10.1093/nar/gky1301
SQuIRE is available on bioconda and can be installed using conda. We suggest running conda with mamba for speedup.
Conda is a package manager that installs and runs packages and their dependencies. Conda also creates virtual environments and allows users to switch between those environments. Mamba is a reimplementation of conda in C++ with faster dependency solving.
The instructions below install conda, mamba, and a conda virtual environment for SQuIRE.
Download Miniconda (a lightweight distribution of conda) from https://conda.io/miniconda.html
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Run the Miniconda installer
bash Miniconda3-latest-Linux-x86_64.sh
yes
to approve the license termsyes
to add Miniconda2 into your PATHRestart shell
exec $SHELL
Configure conda channel priority
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
cat $(dirname $CONDA_PREFIX)/.condarc
:
channels:
- conda-forge
- bioconda
- defaults
conda config --describe channel_priority
conda config --set channel_priority flexible
Install mamba in the base environment
conda install mamba -n base -c conda-forge
Create a new environment and install SQuIRE!
mamba create -n squire -c bioconda squire
conda activate squire
Many thanks to Rohini Gadde for setting up the SQuIRE Bioconda package!
*SQuIRE is compatible with the following specific versions of software:*
STAR 2.5.3a
bedtools 2.25.0
samtools 1.1
stringtie 1.3.3
DESeq2 1.16.1
R 3.4.1
Python 2.7
If installing these software with conda is unsuccessful, we recommend installing these versions with squire Build to ensure compatibility with SQuIRE.
squire Build -s all
Preparation Stage 1) Fetch: Downloads input files from RefGene and generates STAR index Only needs to be done once initially to acquire genomic input files or if a new build is desired.
2) Clean: Filters Repeatmasker file for Repeats of interest, collapses overlapping repeats, and returns as BED file.
Optional: Incorporation of non-reference TE sequence
Quantification Stage
1) Map: Aligns RNAseq data
2) Count: Quantifies RNAseq reads aligning to TEs
Analysis Stage
1) Call: Compiles and outputs differential expression from multiple alignments
Follow-up Stage 1) Draw: Creates BEDgraphs from RNAseq data
2) Seek: Reports individual transposable element sequences
An example pipeline with sample scripts is described here.
Use Build only if conda create does not successfully install software.
Download and install required software (STAR, Bedtools, Samtools, and/or Stringtie)
Adds software to PATH
usage squire Build -o -s STAR,bedtools,samtools,stringtie -v
Arguments: | |
---|---|
-b, --folder | Destination folder for downloaded UCSC file(s). Optional; default='squire_build' |
-s |
Install required SQuIRE software and add to PATH - specify 'all' or provide comma-separated list (no spaces) of: STAR,bedtools,samtools,stringtie. Optional; default = False |
-v, --verbosity | Want messages and runtime printed to stderr. Optional. |
Downloads required files from repeatmasker
Only needs to be used the first time SQuIRE is used to transfer required genomic build references to your machine
Outputs annotation files, chromosome fasta file(s) and STAR index
usage: squire Fetch [-h] -b <build> [-o <folder>] [-f] [-c] [-r] [-g] [-x] [-p <int>] [-k] [-v]
Arguments | |
---|---|
-h, --help | show this help message and exit |
-b |
UCSC designation for genome build, eg. 'hg38' |
-o |
Destination folder for downloaded UCSC file(s), default folder is 'squire_fetch' |
-f, --fasta | Download chromosome fasta files for build chromosomes. Optional |
-c, --chrom_info | Download chrom_info.txt file with chromosome lengths. Optional |
-r, --rmsk | Download Repeatmasker file. Optional |
-g, --gene | Download UCSC gene annotation. Optional |
-x, --index | Create STAR index (WARNING: will take a lot of time and memory!), optional |
-p |
Launch |
-k, --keep | Keep downloaded compressed files. Optional, default = False |
-v, --verbosity | Print messages and runtime records to stderr. Optional; default = False |
Filters genomic coordinates of Repeats of interest from repeatmasker, collapses overlapping TEs, and returns BED file and count of subfamily copies.
Only needs to be done at the first use of SQuIRE pipeline to clean up the index files
Outputs .bed file of TE coordinates, strand and divergence
usage: squire Clean [-h] [-r <rmsk.txt or file.out>] [-b <build>] [-o <folder>] [-c <classes>] [-f <subfamilies>] [-s <families>] [-e <file>] [-v]
Arguments | |
---|---|
-h, --help | show this help message and exit |
-r |
Repeatmasker file, default will search 'squire_fetch' folder for rmsk.txt or .out file. Optional |
-b |
UCSC designation for genome build, eg. 'hg37' |
-o |
Destination folder for output BED file, default folder is 'squire_clean' |
-c |
Comma-separated list of desired repeat classes (AKA superfamilies), eg 'DNA,LTR'. Column 12 in repeatmasker file. Can use UNIX wildcard patterns. Optional |
-f |
Comma-separated list of desired repeat families, eg 'ERV1,ERVK,ERVL'. Column 13 on repeatmasker file. Can use UNIX wildcard patterns. Optional |
-s |
Comma-separated list of desired repeat subfamilies, eg 'L1HS,AluYb'. Column 11 in repeatmasker file. Can use UNIX wildcard patterns. Optional |
-e |
Filepath of extra tab-delimited file containing non-reference repeat sequences. Columns should be chr, start, stop, strand, subfamily, and sequence. Optional; default = False |
-v, --verbosity | Print messages and runtime records to stderr. Optional; default = False |
For known TE sequences that are not included in the reference genome, a tab-delimited file can be provided to SQuIRE to incorporate the non-reference TEs into the analysis. This file can be inputted into the Map and Clean steps with the --extra
parameter.
The following information should be included in the file:
Aligns RNAseq reads to STAR index allowing for multiple alignments
Outputs .bam file
usage: squire Map [-h] [-1 <file_1.fastq or file_1.fastq.gz>] [-2 <file_2.fastq or file_2.fastq.gz>] [-o <folder>][-f <folder>] -r <int> [-n <str>] [-3 <int>] [-e <file.txt>] [-b <build>] [-p <int>] [-v]
Arguments | |
---|---|
-h, --help | show this help message and exit |
-1 |
RNASeq data fastq file(s); read1 if providing paired end data. If more than one file, separate with commas, no spaces. Can be gzipped. |
-2 |
RNASeq data read2 fastq file(s). if more than one file, separate with commas, no spaces. Can be gzipped. Optional if unpaired data. |
-o |
Destination folder for output files. Optional, default = 'squire_map' |
-f |
Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch' |
-r |
Read length (if trim3 selected, after trimming; required) |
-n |
Common basename for RNAseq input. Optional, default = basename of read1 |
-b
squire Count [-h] [-m <folder>] [-c <folder>] [-o <folder>] [-t <folder>] [-f <folder>] -r <int> [-n <str>] [-b <build>] [-p <int>] [-s <int>] [-e EM] [-v]
Arguments: | |
---|---|
-h, --help | show this help message and exit |
-m |
Folder location of outputs from SQuIRE Map (optional,default = 'squire_map') |
-c |
Folder location of outputs from SQuIRE Clean (optional, default = 'squire_clean') |
-o |
Destination folder for output files(optional, default = 'squire_count') |
-t |
Folder for tempfiles (optional; default=count_folder') |
-f |
Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch') |
-r |
Read length (if trim3 selected, after trimming; required). |
-n |
Common basename for input files (required if more than one bam file in map_folder) |
-b |
UCSC designation for genome build, eg. 'hg38' (required if more than 1 build in clean_folder) |
-p |
Launch |
-s |
'0' if unstranded eg Standard Illumina, 1 if first- strand eg Illumina Truseq, dUTP, NSR, NNSR, 2 if second-strand, eg Ligation, Standard SOLiD (optional,default=0) |
-e , --EM | Run estimation-maximization on TE counts given numberof times (optional, specify 0 if no EM desired; default=auto) |
-v, --verbosity | Want messages and runtime printed to stderr (optional; default=False) |
Performs differential expression analysis on TEs and genes
Outputs DEseq2 output and plots
usage squire Call [-h] -1 <str1,str2> or <str> -2 <str1,str2> or <str> -A
Arguments | |
---|---|
-h, --help | show this help message and exit |
-1 <str1,str2> or <str>, --group1 <str1,str2> or <str> | List of basenames for group1 (Treatment) samples, can also provide string pattern common to all group1 basenames |
-2 <str1,str2> or <str>, --group2 <str1,str2> or <str> | List of basenames for group2 (Control) samples, can also provide string pattern common to all group2 basenames |
-A |
Name of condition for group1 |
-B |
Name of condition for group2 |
-o |
Destination folder for output files (optional; default='squire_call') |
-s, --subfamily | Compare TE counts by subfamily. Otherwise, compares TEs at locus level (optional; default=False) |
-p |
Launch |
-N |
Basename for project |
-f |
Output figures as html or pdf |
-v, --verbosity | Want messages and runtime printed to stderr (optional; default=False) |
Creates bedgraphs and bigwigs from RNAseq data
usage squire Draw [-h] [-f
Arguments | |
---|---|
-h, --help | show this help message and exit |
-f |
Folder location of outputs from SQuIRE Fetch (optional, default = 'squire_fetch') |
-m |
Folder location of outputs from SQuIRE Map (optional, default = 'squire_map') |
-o |
Destination folder for output files (optional; default='squire_draw') |
-n |
Basename for bam file (required if more than one bam file in map_folder) |
-s |
'0' if unstranded, 1 if first-strand eg Illumina Truseq, dUTP, NSR, NNSR, 2 if second-strand, eg Ligation, Standard (optional,default=1) |
-b |
UCSC designation for genome build, eg. 'hg38' (required) |
-l, --normlib | Normalize bedgraphs by library size (optional; default=False) |
-p |
Launch |
-v, --verbosity | Want messages and runtime printed to stderr (optional; default=False) |
Retrieves transposable element sequences from chromosome fasta files
Outputs sequences in FASTA format
usage squire Seek [-h] -i
Arguments | |
---|---|
-h, --help | show this help message and exit |
-i, --infile | Repeat genomic coordinates, can be TE_ID, bedfile, or gff |
-o, --outfile | Repeat sequences output file (FASTA), can use "-" for stdout |
-g, --genome | Genome build's fasta chromosomes - .fa file or .chromFa folder |
-v, --verbosity | Print messages and runtime records to stderr. Optional; default = False |
The RNA-seqlopedia by Cresko Lab at University of Oregon outlines strand specific data in section 3.7 Preparation of stranded libraries. You can verify the strand specificity with the researcher who collected the data, or use an outside program like infer-experiment.py in RSeQC or the libtype option in Salmon.
You can gauge how much vmem to assign to each job based on the number of reads in your datasets.
SQuIRE has not yet been tested with ChIP or small RNA sequencing data, so its compatibility has not yet been determined.
INSTRUCTIONS
Copy the sample_scripts folder to your project folder
mkdir <project folder>/scripts
cp SQuIRE/sample_scripts/* <project folder>/scripts
cd <project folder>/scripts
Fill out the arguments.sh file
Replace "squire@email.com" in the #$ -M squire@email.com
line with your email address to get alert of script completion and memory usage
Submit jobs to SGE cluster (the -cwd option results in error and output files associated to stay in your current working directory)
qsub -cwd fetch.sh arguments.sh
qsub -cwd clean.sh arguments.sh
qsub -cwd loop_map.sh arguments.sh
qsub -cwd loop_count.sh arguments.sh
qsub -cwd call.sh arguments.sh
qsub -cwd loop_draw.sh arguments.sh
If a memory or segmentation fault error occurs, edit the #$ -l mem_free
and #$ -l h_vmem
lines to increase memory usage for the appropriate script.