Closed oushujun closed 2 years ago
Shujun, I will add it soon. These HMMs are based on nucleotide sequences, and will not be merged with other protein-based databases.
@oushujun, now, SINEs are supported in the lastest Github version by running:
TEsorter -db sine rice6.9.5.liban
The summary of output: | Positive | Negative | |
---|---|---|---|
SINE | 26 | 17 | |
non-SINE | 30 | 2358 |
So the sensitivity = 26/(26+17) = 0.60 and the precision = 26/(26+30) = 0.46. It seems to be not well-performed with this method, compared with AnnoSINE.
@zhangrengang, the SINE sequences were re-curated by the AnnoSINE authors and updated in the rice library: https://github.com/oushujun/riceTElib There are 200-ish new SINE sequences added to the library, and some helitron sequences containing SINE fragments were cleaned. Can you incorporate this new library to TEsorter? Thank you.
Shujun
@oushujun, I have added the library v7.0.0 into the folder TEsorter/test/ . With this library, the summary of output is as follows: |
Positive | Negative | |
---|---|---|---|
SINE | 230 | 9 | |
non-SINE | 32 | 2356 |
So the sensitivity = 230/(230+9) = 0.96 and the precision = 230/(230+32) = 0.88.
The false positive cases are detailed as follows, which can be controlled by e-value (e.g. increase the threshold to 1e-6).
Os0190_INT#LTR/Copia TEsorter CDS 805 912 0.1 + . ID=Os0190_INT#LTR/Copia|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=23.7;evalue=0.0001;probability=0.76
Os0222#DNAnona/MULE TEsorter CDS 188 321 0.08 + . ID=Os0222#DNAnona/MULE|SINE-1_ECa;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=38.5;evalue=1.3e-05;probability=0.44
Os0343#DNAnona/CACTA TEsorter CDS 516 602 0.04 + . ID=Os0343#DNAnona/CACTA|SINE2-1a_SBi;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=32.7;evalue=0.00054;probability=0.71
Os0521#DNAnona/hAT TEsorter CDS 1 101 0.15 + . ID=Os0521#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=9.6e-06;probability=0.7
Os0545#DNAnona/hAT TEsorter CDS 1 91 0.2 + . ID=Os0545#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=67.8;evalue=9.6e-08;probability=0.78
Os0563#DNAnona/hAT TEsorter CDS 119 264 0.09 + . ID=Os0563#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=59.7;evalue=8.5e-06;probability=0.53
Os0623#DNAnona/MULE TEsorter CDS 267 386 0.1 + . ID=Os0623#DNAnona/MULE|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.3;evalue=0.00014;probability=0.54
Os0701#DNAnona/hAT TEsorter CDS 1 176 0.08 + . ID=Os0701#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=81.0;evalue=4.4e-05;probability=0.5
Os0848#DNAnona/MULE TEsorter CDS 216 335 0.09 + . ID=Os0848#DNAnona/MULE|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=55.8;evalue=0.00035;probability=0.68
Os0902#DNAnona/hAT TEsorter CDS 1 100 0.16 + . ID=Os0902#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=59.5;evalue=3.7e-06;probability=0.65
Os0909#MITE/Stow TEsorter CDS 59 151 0.08 + . ID=Os0909#MITE/Stow|SINE2-3_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=25.1;evalue=0.00014;probability=0.63
Os0926#LINE/unknown TEsorter CDS 1092 1222 0.03 + . ID=Os0926#LINE/unknown|AtSB6;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=22.8;evalue=0.00087;probability=0.81
Os0986#DNAnona/MULE TEsorter CDS 251 344 0.05 + . ID=Os0986#DNAnona/MULE|SINE2-1_SBi;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=22.0;evalue=7e-05;probability=0.91
Os0987#DNAnona/hAT TEsorter CDS 48 157 0.06 + . ID=Os0987#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=42.7;evalue=0.00044;probability=0.78
Os1087#DNAnona/Helitron TEsorter CDS 43 130 0.12 + . ID=Os1087#DNAnona/Helitron|SINE16_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=41.6;evalue=0.00052;probability=0.77
Os1103#DNAnona/hAT TEsorter CDS 2 101 0.13 + . ID=Os1103#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=70.2;evalue=7.7e-05;probability=0.63
Os1106#DNAnona/Tourist TEsorter CDS 157 336 0.03 + . ID=Os1106#DNAnona/Tourist|SINE-1_ATr;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=31.7;evalue=0.00078;probability=0.43
Os1181#DNAnona/hAT TEsorter CDS 7 105 0.11 + . ID=Os1181#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=71.1;evalue=0.00022;probability=0.77
Os1296#DNAnona/hAT TEsorter CDS 4 112 0.12 + . ID=Os1296#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=76.9;evalue=0.00013;probability=0.63
Os1418#DNAnona/hAT TEsorter CDS 2 104 0.1 + . ID=Os1418#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=0.00059;probability=0.64
Os1523#DNAauto/hAT TEsorter CDS 1 148 0.06 + . ID=Os1523#DNAauto/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=62.6;evalue=0.0006;probability=0.53
Os1615#DNAnona/hAT TEsorter CDS 1 158 0.11 + . ID=Os1615#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=53.6;evalue=2.1e-07;probability=0.63
Os2008#DNAnona/hAT TEsorter CDS 3 91 0.13 + . ID=Os2008#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.7;evalue=4.6e-05;probability=0.75
Os2057#DNAnona/hAT TEsorter CDS 144 304 0.06 + . ID=Os2057#DNAnona/hAT|BoSB13;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=25.9;evalue=0.00041;probability=0.84
Os2523#DNAnona/hAT TEsorter CDS 1 112 0.08 + . ID=Os2523#DNAnona/hAT|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=67.3;evalue=4e-05;probability=0.56
Os2861#DNAnona/CACTG TEsorter CDS 21 145 0.08 + . ID=Os2861#DNAnona/CACTG|SINE1_SO;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=58.3;evalue=4.5e-05;probability=0.51
Os3087_INT#LTR/Gypsy TEsorter CDS 677 823 0.12 + . ID=Os3087_INT#LTR/Gypsy|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=76.9;evalue=2.2e-05;probability=0.45
Os3264#DNAauto/MLE TEsorter CDS 165 247 0.08 + . ID=Os3264#DNAauto/MLE|BoSB14A;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=26.3;evalue=0.00034;probability=0.74
Os3423#DNAnona/hAT TEsorter CDS 1050 1189 0.09 + . ID=Os3423#DNAnona/hAT|SINE1_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=78.2;evalue=0.00054;probability=0.75
Os3507_LTR#LTR/Gypsy TEsorter CDS 31 229 0.03 + . ID=Os3507_LTR#LTR/Gypsy|SINE2-1_OS;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=40.3;evalue=0.001;probability=0.71
Os3527_ICR#DNAnona/hAT TEsorter CDS 1 100 0.16 + . ID=Os3527_ICR#DNAnona/hAT|SHANSINE_MT;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=70.2;evalue=3.5e-06;probability=0.72
Os3565#DNAnona/hAT TEsorter CDS 252 474 0.03 + . ID=Os3565#DNAnona/hAT|SINE2-1_STu;Name=SINE-SINE;gene=SINE;clade=SINE;coverage=34.7;evalue=1.2e-05;probability=0.53
@zhangrengang Thank you for the prompt update! The non-SNIE negative category should not be that high. The v7 library only has 2627 sequences. Can you double check? Thanks
@oushujun , you are right. I have revised the number in last comment to 2356.
Hi, Ren-Gang,
I wonder if this upgrade works for animal genome annotations.
Thanks.
Zhenpeng
@yuzhenpeng Hi, according to AnnoSINE, these SINE sequences are derived from plants, so I think it should not work well with animals.
Thank you for your response. By the way, if i want to annotate or classify animal TEs, such LINEs and SINEs. Do you have some suggestions?
@zhangrengang
@yuzhenpeng , in my opinion, the RepeatMolderlor + RepeatMasker pipeline is ok.
But, RepeatMolderlor seems can not accurate identify the LINEs or SINEs. It seems to be a little high or low proportion of the genome. @zhangrengang
@yuzhenpeng sorry, I have no better solution.
@oushujun Shujun, do you have any suggestions?
Thanks. I found the solution from https://github.com/oushujun/EDTA/issues/231. But, identification SINEs seems difficult.
Hi Ren-Gang,
There are about 88 SINE families reported by this study. They have included HMMs for these families. Are they included in the GyDB or other collections already? If not, it may be an enhancement to include them in TEsorter. Thank you.
Best, Shujun