yangli557 / AnnoSINE

SINE annotation tool for plant genomes
MIT License
18 stars 8 forks source link

AnnoSINE

SINE Annotation Tool for Plant Genomes

Table of Contents

Introduction

AnnoSINE is a SINE annotation tool for plant genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.

<!AnnoSINE has eight major modules. The first one is to identify putative SINE candidates by applying hidden Markov model (HMM)-based homology search, structure-based de novo search or combinition of homology-structure-based search. This step is usually sensitive but can output many false SINE candidates. In the 2nd step, it searches for target site duplication (TSD) in the flanking region to further verify each SINE candidate. As TSD is a significant feature of SINEs, this step is highly effective in removing non-SINEs. Although searching for TSD can be conducted in the later stage of the pipeline, removing false positives earlier can save the computational time of the downstream analysis. In the 3rd step, it examines the copy number and the alignment of SINE copies to remove the sequences with few copy numbers or shifted/fragmented/extended alignments. In addition, it can identify some lineage-specific differences, such as the length of the 3' end using the alignment profile. In the 4th step, it decides the superfamily of each candidate SINE sequence and remove highly similar candidates from known non-coding RNAs. Meanwhile, the highly identical sequences assembling to RNA are false positives. In the 5th step, it removes candidates with a large proportion of tandem repeats. In the 6th step, it removes other TEs by detecting inverted repeats adjacent to TSDs. These steps focused on identifying complete SINEs (i.e., seed sequences) in the query genome. Redundant seeds are filtered to generate the SINE library. After we obtain the non-redundant seed sequences, it will apply RepeatMasker to identify other SINEs to complete the whole genome SINE annotation in the last step.-->

Prerequisites

To use AnnoSINE, you need to install the tools listed below.

Installation

cd ./AnnoSINE/bin
pip3 install -r requirements.txt

Usage

python3 AnnoSINE.py [options] <mode> <input_filename> <output_filename>

Argument

positional arguments:
  mode                  [1 | 2 | 3]
                        Choose the running mode of the program.
                                1--Homology-based method;
                                2--Structure-based method;
                                3--Hybrid of homology-based and structure-based method.
  input_filename        input genome assembly path
  output_filename       output files path

optional arguments:
  -h, --help                 show this help message and exit
  -l, --length_factor        Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
  -c, --copy_number_factor   Threshold of the copy number that determines the SINE boundary (default: 0.15)
  -s, --shift                Maximum threshold of the boundary shift (default: 80)
  -g, --gap                  Maximum threshold of the trancated gap (default: 10)
  -minc, --copy_number       Minimum threshold of the copy number for each element (default: 20)
  -b, --boundary             Output SINE seed boundaries based on TSD or MSA (default: msa)
  -f, --figure          Output the SINE seed MSA figures and copy number profiles (y/n) (default: n)
  -r, --non_redundant    Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)

Inputs

Genome sequence(fasta format).

Outputs

Intermediate Files

Testing

You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).

cd ./AnnoSINE/bin
python3 AnnoSINE.py 3 ../Testing/A.thaliana_Chr4.fasta ../Output_Files

Results of AnnoSINE tests on testing data are saved in Output_Files.

Citations

Please cite the paper if you use this code:

Yang Li, Ning Jiang, Yanni Sun, AnnoSINE: a short interspersed nuclear elements annotation tool for plant genomes, Plant Physiology, Volume 188, Issue 2, February 2022, Pages 955–970, https://doi.org/10.1093/plphys/kiab524