SINE Annotation Tool for Plant Genomes
AnnoSINE is a SINE annotation tool for plant genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.
<!AnnoSINE has eight major modules. The first one is to identify putative SINE candidates by applying hidden Markov model (HMM)-based homology search, structure-based de novo search or combinition of homology-structure-based search. This step is usually sensitive but can output many false SINE candidates. In the 2nd step, it searches for target site duplication (TSD) in the flanking region to further verify each SINE candidate. As TSD is a significant feature of SINEs, this step is highly effective in removing non-SINEs. Although searching for TSD can be conducted in the later stage of the pipeline, removing false positives earlier can save the computational time of the downstream analysis. In the 3rd step, it examines the copy number and the alignment of SINE copies to remove the sequences with few copy numbers or shifted/fragmented/extended alignments. In addition, it can identify some lineage-specific differences, such as the length of the 3' end using the alignment profile. In the 4th step, it decides the superfamily of each candidate SINE sequence and remove highly similar candidates from known non-coding RNAs. Meanwhile, the highly identical sequences assembling to RNA are false positives. In the 5th step, it removes candidates with a large proportion of tandem repeats. In the 6th step, it removes other TEs by detecting inverted repeats adjacent to TSDs. These steps focused on identifying complete SINEs (i.e., seed sequences) in the query genome. Redundant seeds are filtered to generate the SINE library. After we obtain the non-redundant seed sequences, it will apply RepeatMasker to identify other SINEs to complete the whole genome SINE annotation in the last step.-->
To use AnnoSINE, you need to install the tools listed below.
cd ./AnnoSINE/bin
pip3 install -r requirements.txt
python3 AnnoSINE.py [options] <mode> <input_filename> <output_filename>
positional arguments:
mode [1 | 2 | 3]
Choose the running mode of the program.
1--Homology-based method;
2--Structure-based method;
3--Hybrid of homology-based and structure-based method.
input_filename input genome assembly path
output_filename output files path
optional arguments:
-h, --help show this help message and exit
-l, --length_factor Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
-c, --copy_number_factor Threshold of the copy number that determines the SINE boundary (default: 0.15)
-s, --shift Maximum threshold of the boundary shift (default: 80)
-g, --gap Maximum threshold of the trancated gap (default: 10)
-minc, --copy_number Minimum threshold of the copy number for each element (default: 20)
-b, --boundary Output SINE seed boundaries based on TSD or MSA (default: msa)
-f, --figure Output the SINE seed MSA figures and copy number profiles (y/n) (default: n)
-r, --non_redundant Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
Genome sequence(fasta format).
You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).
cd ./AnnoSINE/bin
python3 AnnoSINE.py 3 ../Testing/A.thaliana_Chr4.fasta ../Output_Files
Results of AnnoSINE tests on testing data are saved in Output_Files.
Please cite the paper if you use this code:
Yang Li, Ning Jiang, Yanni Sun, AnnoSINE: a short interspersed nuclear elements annotation tool for plant genomes, Plant Physiology, Volume 188, Issue 2, February 2022, Pages 955–970, https://doi.org/10.1093/plphys/kiab524