zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
87 stars 19 forks source link

Add SINE hmms #29

Closed oushujun closed 2 years ago

oushujun commented 2 years ago

Hi Ren-Gang,

There are about 88 SINE families reported by this study. They have included HMMs for these families. Are they included in the GyDB or other collections already? If not, it may be an enhancement to include them in TEsorter. Thank you.

Best, Shujun

zhangrengang commented 2 years ago

Shujun, I will add it soon. These HMMs are based on nucleotide sequences, and will not be merged with other protein-based databases.

zhangrengang commented 2 years ago

@oushujun, now, SINEs are supported in the lastest Github version by running:

TEsorter -db sine rice6.9.5.liban
The summary of output: Positive Negative
SINE 26 17
non-SINE 30 2358

So the sensitivity = 26/(26+17) = 0.60 and the precision = 26/(26+30) = 0.46. It seems to be not well-performed with this method, compared with AnnoSINE.

oushujun commented 2 years ago

@zhangrengang, the SINE sequences were re-curated by the AnnoSINE authors and updated in the rice library: https://github.com/oushujun/riceTElib There are 200-ish new SINE sequences added to the library, and some helitron sequences containing SINE fragments were cleaned. Can you incorporate this new library to TEsorter? Thank you.

Shujun

zhangrengang commented 2 years ago
@oushujun, I have added the library v7.0.0 into the folder TEsorter/test/. With this library, the summary of output is as follows: Positive Negative
SINE 230 9
non-SINE 32 2356

So the sensitivity = 230/(230+9) = 0.96 and the precision = 230/(230+32) = 0.88.

The false positive cases are detailed as follows, which can be controlled by e-value (e.g. increase the threshold to 1e-6).

Os0190_INT#LTR/Copia    TEsorter        CDS     805     912     0.1     +       .       ID=Os0190_INT#LTR/Copia|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=23.7;evalue=0.0001;probability=0.76
Os0222#DNAnona/MULE     TEsorter        CDS     188     321     0.08    +       .       ID=Os0222#DNAnona/MULE|SINE-1_ECa;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=38.5;evalue=1.3e-05;probability=0.44
Os0343#DNAnona/CACTA    TEsorter        CDS     516     602     0.04    +       .       ID=Os0343#DNAnona/CACTA|SINE2-1a_SBi;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=32.7;evalue=0.00054;probability=0.71
Os0521#DNAnona/hAT      TEsorter        CDS     1       101     0.15    +       .       ID=Os0521#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=9.6e-06;probability=0.7
Os0545#DNAnona/hAT      TEsorter        CDS     1       91      0.2     +       .       ID=Os0545#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=67.8;evalue=9.6e-08;probability=0.78
Os0563#DNAnona/hAT      TEsorter        CDS     119     264     0.09    +       .       ID=Os0563#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=59.7;evalue=8.5e-06;probability=0.53
Os0623#DNAnona/MULE     TEsorter        CDS     267     386     0.1     +       .       ID=Os0623#DNAnona/MULE|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.3;evalue=0.00014;probability=0.54
Os0701#DNAnona/hAT      TEsorter        CDS     1       176     0.08    +       .       ID=Os0701#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=81.0;evalue=4.4e-05;probability=0.5
Os0848#DNAnona/MULE     TEsorter        CDS     216     335     0.09    +       .       ID=Os0848#DNAnona/MULE|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=55.8;evalue=0.00035;probability=0.68
Os0902#DNAnona/hAT      TEsorter        CDS     1       100     0.16    +       .       ID=Os0902#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=59.5;evalue=3.7e-06;probability=0.65
Os0909#MITE/Stow        TEsorter        CDS     59      151     0.08    +       .       ID=Os0909#MITE/Stow|SINE2-3_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=25.1;evalue=0.00014;probability=0.63
Os0926#LINE/unknown     TEsorter        CDS     1092    1222    0.03    +       .       ID=Os0926#LINE/unknown|AtSB6;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=22.8;evalue=0.00087;probability=0.81
Os0986#DNAnona/MULE     TEsorter        CDS     251     344     0.05    +       .       ID=Os0986#DNAnona/MULE|SINE2-1_SBi;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=22.0;evalue=7e-05;probability=0.91
Os0987#DNAnona/hAT      TEsorter        CDS     48      157     0.06    +       .       ID=Os0987#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=42.7;evalue=0.00044;probability=0.78
Os1087#DNAnona/Helitron TEsorter        CDS     43      130     0.12    +       .       ID=Os1087#DNAnona/Helitron|SINE16_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=41.6;evalue=0.00052;probability=0.77
Os1103#DNAnona/hAT      TEsorter        CDS     2       101     0.13    +       .       ID=Os1103#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=70.2;evalue=7.7e-05;probability=0.63
Os1106#DNAnona/Tourist  TEsorter        CDS     157     336     0.03    +       .       ID=Os1106#DNAnona/Tourist|SINE-1_ATr;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=31.7;evalue=0.00078;probability=0.43
Os1181#DNAnona/hAT      TEsorter        CDS     7       105     0.11    +       .       ID=Os1181#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=71.1;evalue=0.00022;probability=0.77
Os1296#DNAnona/hAT      TEsorter        CDS     4       112     0.12    +       .       ID=Os1296#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=76.9;evalue=0.00013;probability=0.63
Os1418#DNAnona/hAT      TEsorter        CDS     2       104     0.1     +       .       ID=Os1418#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=0.00059;probability=0.64
Os1523#DNAauto/hAT      TEsorter        CDS     1       148     0.06    +       .       ID=Os1523#DNAauto/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=62.6;evalue=0.0006;probability=0.53
Os1615#DNAnona/hAT      TEsorter        CDS     1       158     0.11    +       .       ID=Os1615#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=53.6;evalue=2.1e-07;probability=0.63
Os2008#DNAnona/hAT      TEsorter        CDS     3       91      0.13    +       .       ID=Os2008#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=4.6e-05;probability=0.75
Os2057#DNAnona/hAT      TEsorter        CDS     144     304     0.06    +       .       ID=Os2057#DNAnona/hAT|BoSB13;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=25.9;evalue=0.00041;probability=0.84
Os2523#DNAnona/hAT      TEsorter        CDS     1       112     0.08    +       .       ID=Os2523#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=67.3;evalue=4e-05;probability=0.56
Os2861#DNAnona/CACTG    TEsorter        CDS     21      145     0.08    +       .       ID=Os2861#DNAnona/CACTG|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.3;evalue=4.5e-05;probability=0.51
Os3087_INT#LTR/Gypsy    TEsorter        CDS     677     823     0.12    +       .       ID=Os3087_INT#LTR/Gypsy|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=76.9;evalue=2.2e-05;probability=0.45
Os3264#DNAauto/MLE      TEsorter        CDS     165     247     0.08    +       .       ID=Os3264#DNAauto/MLE|BoSB14A;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=26.3;evalue=0.00034;probability=0.74
Os3423#DNAnona/hAT      TEsorter        CDS     1050    1189    0.09    +       .       ID=Os3423#DNAnona/hAT|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=78.2;evalue=0.00054;probability=0.75
Os3507_LTR#LTR/Gypsy    TEsorter        CDS     31      229     0.03    +       .       ID=Os3507_LTR#LTR/Gypsy|SINE2-1_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=40.3;evalue=0.001;probability=0.71
Os3527_ICR#DNAnona/hAT  TEsorter        CDS     1       100     0.16    +       .       ID=Os3527_ICR#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=70.2;evalue=3.5e-06;probability=0.72
Os3565#DNAnona/hAT      TEsorter        CDS     252     474     0.03    +       .       ID=Os3565#DNAnona/hAT|SINE2-1_STu;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=34.7;evalue=1.2e-05;probability=0.53
oushujun commented 2 years ago

@zhangrengang Thank you for the prompt update! The non-SNIE negative category should not be that high. The v7 library only has 2627 sequences. Can you double check? Thanks

zhangrengang commented 2 years ago

@oushujun , you are right. I have revised the number in last comment to 2356.

yuzhenpeng commented 2 years ago

Hi, Ren-Gang,

I wonder if this upgrade works for animal genome annotations.

Thanks.

Zhenpeng

zhangrengang commented 2 years ago

@yuzhenpeng Hi, according to AnnoSINE, these SINE sequences are derived from plants, so I think it should not work well with animals.

yuzhenpeng commented 2 years ago

Thank you for your response. By the way, if i want to annotate or classify animal TEs, such LINEs and SINEs. Do you have some suggestions?

@zhangrengang

zhangrengang commented 2 years ago

@yuzhenpeng , in my opinion, the RepeatMolderlor + RepeatMasker pipeline is ok.

yuzhenpeng commented 2 years ago

But, RepeatMolderlor seems can not accurate identify the LINEs or SINEs. It seems to be a little high or low proportion of the genome. @zhangrengang

zhangrengang commented 2 years ago

@yuzhenpeng sorry, I have no better solution.

@oushujun Shujun, do you have any suggestions?

yuzhenpeng commented 2 years ago

Thanks. I found the solution from https://github.com/oushujun/EDTA/issues/231. But, identification SINEs seems difficult.