zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
89 stars 20 forks source link

TEsorter find TE-related gene in BUSCO datasets #7

Closed baozg closed 4 years ago

baozg commented 5 years ago

Hi, Rengang,

I use the TEsorter to classify the potential TE in BUSCO gene(validate the EDTA masking result). I test two BUSCO gene sets from animals and plants,found the BUSCO gene have 1% TE-related gene. Is it BUSCO issue or the TEsorter issue?

Here is the command I use

# fa are BUSCO/odb9/tetrapoda_odb9/ancestral and /data/database/BUSCO/odb10/eudicotyledons_odb10/ancestral

python /data/software/TEsorter/TEsorter.py -db rexdb -st prot -p 12 eudicotyledons.odb10.fa
python /data/software/TEsorter/TEsorter.py -db rexdb -st prot -p 12 tetrapoda.obd9.fa

Here is the result from TEsorter

# eudicotyledons 2121 genes
#TE    Order     Superfamily  Clade            Complete  Strand  Domains
12416  LTR       Copia        unknown          no        +       GAG|Ty1-outgroup
13331  LTR       Copia        Alesia           no        +       INT|Alesia
15674  LTR       Copia        Ikeros           no        +       GAG|Ikeros
1703   LTR       Gypsy        chromo-unclass   no        +       RH|chromo-unclass
17251  LINE      unknown      unknown          unknown   +       ENDO|LINE
18920  LINE      unknown      unknown          unknown   +       ENDO|LINE
19331  LTR       Gypsy        Chlamyvir        no        +       PROT|Chlamyvir
2319   LTR       Copia        Tork             no        +       GAG|Tork
23637  LTR       Gypsy        chromo-outgroup  no        +       CHD|chromo-outgroup
24863  LTR       Bel-Pao      unknown          no        +       GAG|Bel-Pao
298    LTR       Gypsy        Athila           no        +       RT|Athila
3490   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
370    LTR       Gypsy        TatIII           no        +       INT|TatIII
39406  LTR       Gypsy        unknown          no        +       GAG|Ty3_gypsy
4185   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
4235   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
42761  LTR       Copia        Gymco-I          no        +       GAG|Gymco-I
5202   LINE      unknown      unknown          unknown   +       ENDO|LINE
5492   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
5537   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
6911   LINE      unknown      unknown          unknown   +       ENDO|LINE
7178   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
7311   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
75     Maverick  unknown      unknown          unknown   +       ATPase|Maverick

# tetrapoda.obd9 3950 genes
#TE          Order           Superfamily    Clade            Complete  Strand  Domains
EOG09070046  LTR             Gypsy          Galadriel        no        +       CHD|Galadriel
EOG0907005G  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700NV  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700P2  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700WL  LTR             Bel-Pao        unknown          no        +       GAG|Bel-Pao
EOG0907011V  TIR             hAT            unknown          unknown   +       TPase|hAT
EOG090701KJ  Maverick        unknown        unknown          unknown   +       ATPase|Maverick HEL2|Helitron
EOG0907023Z  LTR             Copia          Ikeros           no        +       GAG|Ikeros
EOG090702M8  LTR             Gypsy          Tcn1             no        +       GAG|Tcn1
EOG090702OV  LTR             Gypsy          chromo-outgroup  no        +       CHD|chromo-outgroup
EOG090702OY  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090702YF  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG09070311  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703FS  Helitron        unknown        unknown          unknown   +       HEL2|Helitron
EOG090703IH  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703K5  LTR             Copia          unknown          no        +       GAG|Ty1_copia
EOG090703L1  LTR             Gypsy          CRM              no        +       CHD|CRM
EOG090703QV  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703U9  LTR             Gypsy          chromo-unclass   no        +       RH|chromo-unclass
EOG090703X6  LTR             Gypsy          unknown          no        +       RH|Ty3_gypsy
EOG090703ZG  LTR             Gypsy          unknown          no        +       INT|Ty3_gypsy
EOG0907060L  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG0907061O  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090706TH  LTR             Copia          TAR              no        +       RH|TAR
EOG090707EW  TIR             PIF_Harbinger  unknown          unknown   +       TPase|PIF_Harbinger
EOG090707UK  LTR             Copia          Gymco-III        no        +       GAG|Gymco-III
EOG0907089R  LTR             Copia          Gymco-I          no        +       PROT|Gymco-I
EOG0907097Q  mixture         mixture        unknown          unknown   +       ATPase|Maverick RT|non-chromo-outgroup
EOG090709EF  LTR             Retrovirus     unknown          unknown   +       RH|Retrovirus
EOG090709MD  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG09070B54  LTR             Bel-Pao        unknown          no        +       PROT|Bel-Pao
EOG09070B89  Maverick        unknown        unknown          unknown   +       PROT|Maverick
EOG09070B8U  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG09070D5R  LTR             Bel-Pao        unknown          no        +       GAG|Bel-Pao
EOG09070EMS  LTR             Gypsy          chromo-outgroup  no        +       CHD|chromo-outgroup
EOG09070FOZ  pararetrovirus  unknown        unknown          unknown   +       RT|pararetrovirus
zhangrengang commented 5 years ago

@baozg, these might be false positives by TEsorter. You can see the score and e-value in the *.dom.gff3 or *.dom.tsv file. If the score is very low and the evalue is very high? If that is, I suggest to increase the criteria, such as -eval 1e-6, to identity TE domains on protein-coding genes of hosts.

For TE sequences, the default criteria is OK, because they were initially identified and filtered by other tools, which made them more likely to be true TEs.