C-grapes commented 3 years ago

Hi Rengang, I encountered several problems when using TEsort. I found a lot of mixtures in my $.rexdb-plant.cls.tsv result file. What kind of TEs does it belong to? According to the result file $.mod.EDTA.TEanno.gff3 of EDTA, I extracted the fata sequence of the corresponding repeated sequence and used it as the input file of TEsort. However, I found that TEsort has many annotation results that contradict EDTA. How can I judge which ones are accurate? When using TEsort annotations, will the overlapping sequence of input files affect the annotation results?

TE Order Superfamily Clade Complete Strand Domains

LTR_retrotransposon::Chr01:12217604-12218469 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:13239443-13240460 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:14185467-14186333 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:14434069-14435285 LTR Gypsy mixture no + RT|Retand RH|Ogre LTR_retrotransposon::Chr01:14960982-14962159 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:21438967-21440147 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:21771486-21772792 mixture mixture unknown unknown - RT|pararetrovirus RH|chromo-outgroup LTR_retrotransposon::Chr01:2440997-2441947 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup LTR_retrotransposon::Chr01:25935874-25936670 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup

Best wish ! putao

C-grapes commented 3 years ago

The command I use is: TEsorter 1.tesort.fa -db rexdb-plant -p 35

zhangrengang commented 3 years ago

In general a mixture has different domains belonging to different TE taxa, but these elements maybe not true mixture as they locate in multple loci but have the same structure (RT|pararetrovirus+RH|non-chromo-outgroup). They maybe novel elements with evolution position between pararetrovirus and LTR/Gypsy/non-chromo-outgroup, which do not covered by the rexdb-plant database. It is better to evaluate them by phylogenetic analyses (https://github.com/zhangrengang/TEsorter#further-phylogenetic-analyses): whether do RT and RH form a simliar and distinct clade respectively on the tree? Regarding contradiction with EDTA, whether are these different names but the same taxa, e.g. DNA/DTA of EDTA = TIR/hAT of TEsorter (for TE nomenclature, see Wicker T, Sabot F, Hua-Van A et. al. A unified classification system for eukaryotic transposable elements [J]. Nat. Rev. Genet., 2007, 10 (4): 973–982)? Unfortunately, TE nomenclature has not been unified. TEsorter need one element one seqence. If seqences overlap each other, there might be some affection. But why do they overlap?

oushujun commented 3 years ago

Structural-based and library-based TE annotation can cause misannotations, which is dependent on how well we understand TE structures and the ability to distinguish them from false predictions. You may want to check out the EDTA --evaluate 1 option and the EDTA paper to learn more about annotation consistency. The other reason is nested insertion. If TE A is inserted into TE B, you may find both TE A and TE B HMM profiles in the sequence of TE B, then TEsorter may determine this is a mixture case. For this reason, I don't think TEsorter is suitable to be used to reclassify fragmented TEs in a genome.

On Tue, May 11, 2021 at 2:58 PM zhangrengang @.***> wrote:

In general a mixture has different domains belonging to different TE taxa, but these elements maybe not true mixture as they locate in multple loci but have the same structure (RT|pararetrovirus+RH|non-chromo-outgroup). They maybe novel elements with evolution position between pararetrovirus and LTR/Gypsy/non-chromo-outgroup, which do not covered by the rexdb-plant database. It is better to evaluate them by phylogenetic analyses ( https://github.com/zhangrengang/TEsorter#further-phylogenetic-analyses): whether do RT and RH form a simliar and distinct clade respectively on the tree? Regarding contradiction with EDTA, whether are these different names but the same taxa, e.g. DNA/DTA of EDTA = TIR/hAT of TEsorter (for TE nomenclature, see Wicker T, Sabot F, Hua-Van A et. al. A unified classification system for eukaryotic transposable elements [J]. Nat. Rev. Genet., 2007, 10 (4): 973–982)? Unfortunately, TE nomenclature has not been unified. TEsorter need one element one seqence. If seqences overlap each other, there might be some affection. But why do they overlap?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhangrengang/TEsorter/issues/26#issuecomment-837950243, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NF5OJB32Y6MYIQYIMLTNDIQXANCNFSM44RLWNNA .

C-grapes commented 3 years ago

@oushujun @zhangrengang Thank you for your patience. In order to further identify the unknown sequence in EDTA, I used TEsort to re-annotate the sequence in the $genome.mod.EDTA.TEanno.gff file，which contains a lot of overlapping sequences. Regarding this issue, I saw a discussion related to it at https://github.com/oushujun/EDTA/issues/174. According to the method mentioned on this forum, I extracted the Unknown and LTR/unkonwn partial sequences in the $.mod.EDTA.TEanno.split.gff3 file with only a few overlapping sequences and put them into TEsort for annotation. I am not sure if my approach is feasible?

oushujun commented 3 years ago

If you want to further classify the unknowns, you my do so for sequences in the TElib.fa file it generated, then supply the new library via --curatedlib to rerun the final and annotation steps of EDTA.

On Tue, May 11, 2021 at 9:59 PM C-grapes @.***> wrote:

@oushujun https://github.com/oushujun @zhangrengang https://github.com/zhangrengang Thank you for your patience. In order to further identify the unknown sequence in EDTA, I used TEsort to re-annotate the sequence in the $genome.mod.EDTA.TEanno.gff file，which contains a lot of overlapping sequences. Regarding this issue, I saw a discussion related to it at oushujun/EDTA#174 https://github.com/oushujun/EDTA/issues/174. According to the method mentioned on this forum, I extracted the Unknown and LTR/unkonwn partial sequences in the $.mod.EDTA.TEanno.split.gff3 file with only a few overlapping sequences and put them into TEsort for annotation. I am not sure if my approach is feasible?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/zhangrengang/TEsorter/issues/26#issuecomment-838524556, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGKP72OHDGQALLDZ6TTNEZ5BANCNFSM44RLWNNA .

C-grapes commented 3 years ago

@oushujun Thank you! I will try it now. I still have some doubts about the above discussion. You mentioned above that TEsort is not suitable to be used to reclassify fragmented TEs in the genome. Does this mean that simply putting TEs that are not recognized by EDTA into TEsort is not a good method?

C-grapes commented 3 years ago

@oushujun Thank you! I will try it now. I still have some doubts about the above discussion. You mentioned above that TEsort is not suitable to be used to reclassify fragmented TEs in the genome. Does this mean that simply putting TEs that are not recognized by EDTA into TEsort is not a good method?

I am so sorry, I misunderstood what you meant and now I understand.Thank you for your patience!

zhangrengang / TEsorter

What kind of transposon does the mixture belong to #26

TE Order Superfamily Clade Complete Strand Domains