zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
85 stars 19 forks source link

GYDB database classification #13

Closed OluchiAroh closed 3 years ago

OluchiAroh commented 4 years ago

Hi,

Thanks for this awesome tool. I am trying to annotate the genome of an invertebrate. I used LTR_retriever to identify intact elements and i am currently trying to use TE sorter to classify them into families. I used the gydb database in the software to do this since i am working with an invertebrate. However, the output classified most of the element as a_clade and b_clade.I have searched the gydb database to figure out what this means but there wasn't any matching result.

Do you have any idea what a_clade and b_clade mean?

zhangrengang commented 4 years ago

The names (a_clade and b_clade) were derived from HMM profiles of GyDB, including:

AP_a_clade.hmm
AP_b_clade.hmm
GAG_a_clade.hmm
GAG_b_clade.hmm
INT_a_clade.hmm
INT_b_clade.hmm
RNaseH_a_clade.hmm
RNaseH_b_clade.hmm
RT_a_clade.hmm
RT_b_clade.hmm

The informatrion of these clades can be found in database/GyDB2.hmm.info:

Element Genbank_accession       Clade   Cluster_or_genus        Branch_or_class Family  System  Host    URL
Cer4    NC_003284(nuc17635677-17630597) B-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements      Caenorhabditis_elegans   http://www.gydb.org/index.php/Element:Cer4
Cer5    U88169(nuc9519-14600)   B-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Caenorhabditis_elegans  http://www.gydb.org/index.php/Element:Cer5
Cer6    Z38112(nuc5690-591)     B-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Caenorhabditis_elegans  http://www.gydb.org/index.php/Element:Cer6
CFG1    124055237       A-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Chlamys_farresihttp://www.gydb.org/index.php/Element:CFG1
DRM     46848209        A-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Danio_rerio    http://www.gydb.org/index.php/Element:DRM
Gulliver        AF243513        B-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Schistosoma_japonicum   http://www.gydb.org/index.php/Element:Gulliver
Hydra2-1        204647056       A-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Hydra_magnipapillata    http://www.gydb.org/index.php/Element:Hydra2-1
Mag     X17219  A-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Bombyx_mori     http://www.gydb.org/index.php/Element:Mag
SPM     115913323       A-clade Mag     Branch_2        Ty3/Gypsy       LTR_retroelements       Strongylocentrotus_purpuratus   http://www.gydb.org/index.php/Element:SPM

These clades include a few elements that have detailed URLs. If you want to classify them into element level, you might have to build a phylogenetic tree.

OluchiAroh commented 4 years ago

Thanks so much for your help!

OluchiAroh commented 4 years ago

Hi,

Sorry to bother you again. I have another question. Below is a line of the output from the .tsv file. The protein domains are from 3 different hmm files, however, the retrotransposon is classified as gmr1. I would love to know the basis for this conclusion since there are 3 possible classification for this particular retrotransposon.

scaffold_2380:8273..15089|Gypsy LTR Gypsy gmr1 no + RT|reina RNaseH|gmr1 INT|osvaldo

Thanks!

zhangrengang commented 4 years ago

Whether are there other gmr1 domain(s) in addition to RNaseH|gmr1? If no, you can send me the sequence of scaffold_2380:8273..15089|Gypsy to debug.

OluchiAroh commented 4 years ago

There is no other gmr1 domain output.

Below is the sequence

scaffold_2380:8273..15089|Gypsy TGTAGCAGGGTGATACCATGTTAATCGCTAGGCGTCACAACCAGCTAATTAGTATTTTTATTTTGAACAGCTGACGTACCAGCCTCGTCCTGGGGGAACGCGAACGCCGATTTCGACAAATGCGTAGCCTAAATTGAGGTCGAAAATGATATAAAGCGGCGTGCTACATTTTGATGTATTTTACTATCAACTGATAGACGTTCGACACTCTGCCGCCTCGCCTTTAGCCTCGCTTTCCGCTTTTCGCCTTTCGCCTTATAGGCGTATTAGTCTCCGTTCTTCGTATTTGTAATAAAATATTACAACCGTACCCAAATCGTGTGGTATAACTGATTGACGTTAGACGCCCTACCACGATGGTGTCAGAAGTGGGATGTGATGACGCGACGAGACCACGAAGACGAGGCGAGGTGGCGGAGACGTAGAGTCGAGTGAGGAGCCTGGACCCTGATGACGACCTCGAGCGACTCAAGAGTGACATTTCCGGTTTAACACTATTCGGCCGCGACGATGACTAGCCCCGACATGCGACGATCGAGCCACGTGACGTGCCGTGACGCTAAGCGACTTAGATCCTGACTGCATTGACAAACGATTGAATTTTGAATTGATACCCATTCACTGTATAATTAGGAACTATATTGTGTATCCTGTCTGTGTAGCTATAGCAGCATTATAGAGATAGCCAGTATCAGAATATTTTTTTAGAGGGACTTTTTGACATTTGCGTGTTAGTTAGGCATGTTTTCTGAGTAAGATGTATCTGTAGTTGTGAGGGTGCAGAGTAGTCCTGTAGATACGAATGTTATTCATGGGTATAACGGGTATATAACTTTTGTTCCGTATTGGGAATTTGGTATACATCTGGGTTAATACTTAGTAGGGCAGTCGCAACTAGCAAGCAGACATGAGTATTGACGGGGCCACTGGAATTGTGCCACCCGACACGGTTCAGGGTGACTTGCAGGACCAACTGGTCACACTGCGAGAGGAGCTGTCACAGCTGCGGCTCGCAACCAATGACCGTCCACGCGTTGCGTACGTACCGCGCGAGCGGAAGGTGAGACCTTTCAGTGGACGTGGTGAGGTGACAGTGCAGGATTTTCTTGCAGATGTCGAGAGTCTATTTAGGGCACGAGCGATGTCTGAAGACGAGCAGTGTGACTTCATGACGTCACACCTCGAGGGTGATGCCAGACAGGAGATCCGGTTCCGGCCTATGGAGGAGGTGCGCACCGCGACGTGTATCAAGCGGATTCTGCTGGAAGTGTTTGGCGAGAAGAGGTCGGTATCGCAGTTGCAGGAACTCTTCTATCTTCGTAAACAGGCTGACGGGGAGGGGATCCGTAGTTTTGCAGGTGCTCTGCAGCAACTGATGGAAGCGCTTTTGCGCAAGGACGGACACGCCGTCAACCAGCCTGACCGTGTGATCGCCGAACACTTCGCAGAGGGCCTGAAAGACCGCGTACTACGACGCGAGGTGAAGCAACAGCACCGGCTGCGACCGCAGATGAGCTTCGTGGAACTGAGGGAAGAGGCCATTCAGTGGTCCGAGGAGGAAGAGAGGCCGCCGCAAGCCAGTCGACGACAGACTGCTGGTGCATTTGAGCTGGTCGCCGACGCAGGGGAAGCAGGTAACGTTACAGATCTGCAACGAGTGCTCTCAGAGTTAGAGAAACAGCAGCACGTCATCGAGGCGCTGACTGAGCGTGTGGAGAGGATGGAAGGTGTTGCGACAGGTAGGACGGGTGGACGACGGTCCAACACTCCACAGAGGCCCTGGTCGGACCAGCGACCCGCATCTGCGCCGCAGAACTTGCAGACACGACGGCCATTTCAGGAGCGACGCGCATTTCAGGGACCCTCGATGCAGTACACACCAGACGGGCAGCCAATTTGTTACCGTTGTGGTGGAGTTGGCCACCTGGGCAGGCAGTGTCGCAATGAGGCGAACACTGGGCCACGCCGGCCCCAGACGGAGGCTTGTCCACCGGGTGTGCAGACGACGGCCCCTACGGCACACAGACGAGAGCAGCCAGACGATGTTAGAGAGAACACGCATCCTTTAAACTGACGCCGCTCGTTACCAAGAGTCGGGTAGACGAGCTAGTGCGTGCAGGCTCACACGCTGCCGCGCCAACTGACGCGACCGTTGGTACTTGCCCTACGGCAACTGTCTTGATGGGAGGTGTGGAGGTCAATTGCTTACTTGACACAGGTTCCCAAGTAAGTACTATCACTGAGTCTTTCTTTCGTAAGTACATTGCTCGAAACGACCGAGAGCTGGCGTCTTGCAGTTGGTTGAAGCTCACCGCTGCAAACGGGTTAGATATACCGTACGTTGGGTACGTCGAGTTGGACGTGACGACCATGGGTCAGACCATACCTGGCAGAGGTATCCTGGTTGTCTGCGATTCCCCTGATACAGATCAGTGCGCGCATAAGAAGAGGTGTCCTGGCATTCTGGGAATGAATGTCATGGGAGAGCTTGACCTTGCTCTATGGAAACGATCCCTGAACACAGGTAGTTACAGAGCAGAGTCGGTGATGGCGTCGGTCATAGACGCGAAGCAACGGGACCGGAGAGGATTTACTCGTGTGGCAGGGTGGAGCGCAGTGCGAATCCCAGCGCAGTCTTGTGCCACGTTGATGGTTTCCGGTTTGCAGAGACAGGCTGATGCGACATGGGTCGTTGTAGAACCCATTGTGGGGGCACTTCCGGTGGAGGTGTCATTGTGTAGTACTTTAACTAGAGTACACCAGGGTGCTGCACCTGTGCAGGTAGTCAACACCAGCAGAGAAGACGTGTGGCTGAAGCCGAGGATGCGAATAGGTATCTATCAGACCATCGCGCACGTCAGTGAAGAGGACAGAGTATGTTTTGAGCGAGTCAGAGTGTCTGAGGAGTACGTGAGCAGGACTGAGGACCGACCGCCAGGTGTAGTCGACACGCGTGAGGTACTAACCAGTACACTCGATGTCGGGCCACACGTCACCGCGGAGCAGGTTGCTCAACTGAGAGATCTCATCTCGCGGAACGAACGAGCATTCGCCCGAGATGATGACGACCTTGGCTATACAGACTTAGCCACGCACACGATAAGGACCACGACTGAGGTACCCGTACGACAACCTTTCAGAAGGATTCCGCCTACACAGTATCAAGAGGTGAAAGAGCACATCAAGGATCTTCTGAACAGAGGTGTTATCAGAGAGAGCCATTCGGCATGGGCTTCGCCGATCGTCCTGGTCAGGAAGAAAGACGATTCGCTGCGCATGTGCGTGGACTATCGGCTGCTCAATGCGAAGACCCACCGAGACGCATTTCCGCTGCCGCGCATTGACGAGAGTTTTGACGCCATGAGCGGCGCAAACTGGTTTACCACATTGGATCTGGCTTCAGGATACCACCAGATCGCGATGTCGGAGGACGACCGAGAGAAGACGGCCTTCACCACACCTATGGGGCTGTACGAATTCAACCGTATGCCGTTCGGGCTCTGTAACAGTCCCGCGACCTTCCAGAGGTTGATGCAGAGGTGCTTTGGAGATCTGTGTTACCAGACGGTACTCTGTTACCTTGATGACGTTATCATCTTCTCCAAGACGTTCGACAGCCACCTAGCTCAGTTGGAGCAAGTGCTACAGCGACTGCAGAGGATTGGACTGAAGCTGAAGACAAGCAAGTGTCACTTCTTACAGCGGGAGGTGCTATACCTAGGACACCATGTGTCGGCCGACGGCATTGCCACAGATCCGGCAAAGATCGATGCTGTGAAAGACTGGGCCGTGCCGAACACAGTGAAACAGCTTCGCTCCTTTGTAGGATTCGCGTCCTACTACCGTAAGTATGTGCAAGGGTTCAGTATGATTGCAGCGCCGCTACACGATCTAGTGACCAGTGTCAACAAGGAAATGAAGGAGAAGGGACGGTTGAGAGGAGCCTTCAGGGACAGTTGGAGTCCAGAGTGTGACGGAGCTTTCAAGGTTTTGAAGAACGCGCTCTGCTCAGCTCCAGTTCTCGGGTATGCCCAGTACCAACGACCATTCATAGTGGAGACGGACGCCAGCCACCTGGGTCTCGGCGCTGTCCTGTCTCAAGAACAGGAAGGACGTCGTGTCGTCATCGCGTACGCGAGTCGCAGGTTGCGACCACCCGAACGCCAGTATAGCTCGATGAAGCTAGAGATGCTAGCACTTAAGTGGGCGGTCACCACGAAGTTCCGTGCCGACCTTTACGGCGGAGAGTTCGTCATCTACACTGACAACAACCCGTTGAAGTACCTGAAGACCGCCAAGTTGGGAGCCATCGAGCAACGATGGGCAGCTGAGCTGGCCCCGTTCAACTTCACGATCGAGTATCGAGCTGGACGTTCAAACGGAAACGCAGATGCCTTGAGTCGACAGCAACTCGACTACGAGCCCGGGGCAACCGACTTCGACGAGGAAGACGACGTGCAGCAGATATGCGCAACATATGCATCGTGCACGCTACTCCCGCGAGAACTACGCCAACAGATAGAGAAGAATGACAGAGTACGACAAGAAGTCCAGGTTGAAGTAAAGGAGATAGAGGTACAGTCAACCCAGTCACCTGCCTTCCCAGCGTTGTCACCTGAAGAACTTGCAGAGTTGCAGAGGAAAGACCCAGACATCAGCAAGTTCCTGATGTACTGGAGGGTCGGCGAGAAGCCGAAGAAGTGTGCACAGAAGGGTGTGACACAACTGCTGAGACAGTGGCACAAGATGAGGGAAGAGAATGGGGTTTTGTACCGCAAGGTGAGAGATCCACAGGGTGGAGAGGTTAAGCAGCTAGTTCTACCGAGTGTGCTGAAGCACCGACTTCTAGAAGCGGTGCACGACAAGATGGGTCACCAGGGTGGTGATAGGACGAGCAGCCTGGCGTGGCACCGTTGTTACTGGCCAGGTATGCACAGAGAGATTGATGACTACGTGAAGAACTGTGCGAGGTGCACACTAGCGAAGATGCCGCGACAGAAGACACACGCGCCGATGGGACACCTTCTAGCATCGCGTCCCTTAGAGGTCTTGGCGGTGGATTACACACAGTTGGAGAAGGCAGCAGATGGACGTGAGAGTGTTCTGGTACTGACAGACGTGTTCACCAAATTCGCTTGGGCTGTACCAGCACGTGACCAGAAGGCGAACACGACGGCGCGACTATTGGTGAGAGAGTGGTTTCAGCGTTACGGAGTTCCACAGAGGATCCATTCTGACAGAGGCCGGAATTTCGAGAGTGCGACTATTGGAGAGCTGTGCAAGCTGTACGGCATCGAGAAGTCAAGGACTACGGCATACCATCCAGAGGGGAACGGGCAGTGTGAAAGGTTTAACCGGACTTTACACGACTTGTTAAGGACCCTTCCGCCGGCACAGAAGCGAAGATGGACAGAGCATCTGCCCGAGCTTTGCAACGCCTACAATGCGACGCCACATGCGTCGACGGGATATTCGCCACACTACTTGTTGTTTGGGCGCGATCCCTGGCTGCCGATAGACGCATTCCTAGGGCAGGAGTTGAGGGAGGACGATTGTGACAGCTCGATCGACGGATGGCTTGCCACACATCAGAAGAGACTGCAAGACGCGTATTCTCACGCAAGCGAACGACTCGGCGCTGCAGCAGAAAGTCGTAACCTGCCTCACGAGAGATTGAAGATGAGTGAGCCGCTGACACTTGGTTGCAGGGTTTACCAACGCAACAGACTACCAGGACGCAACAAAATGCAGGATGCATGGAGCCCGGAGGTGTACAAGGTAGTACGGATACCCACGGAGCCTGGTGGACCTTACCACATCGAGAGGGCAGACGGGTTCGGCCGACCTCAGACAGTCTGCAGGAAGAACATCCAGCCAGCGGGTCCAGGAGCGATTGTTCCGACTCCAGTGGCAGTCACTCCAGAGGTCAAAGACGACGAACGCTCTGATGAGAGCTCTGACGACGAGGAGGAGGAGGAGGATTCCCTTGTCGCGATCGTTCCAGTGGGTCGTCCGACGATAGCTCCAGCTTCAGTGGTCCAGCCGATTGCTGAGGAGGTTTTGACGGGAGAGGAGATTCACGATGAGCCTGCCGTGATACCGGTATCGATACCAGTACCCGCACCTAGGAGATCGGGCAGGATGCGGAAGAAGGTGGTACGCTTCGAAGAAGTCGACACGCAAGTACCTACACCGGCCCCGCGGCGATGCAACGTCATTCCTGAACCGTGTGGGGGGGACCAATTGACCCCCATCATGACTAGTTTGTTTGTAGCGATGGCTCGCAGTGGTGTGGTCTTTTCGGGAAGCATTAACCGGGACGGTTAAGAGTTAGTTGGGGAGGACCCAGGAGGAATTTTTAGAATTTAGATACGCAGGAATTTGGGAGAACCCTAGAGTTCCAGCGGGATTTTTTTTTTTTTATATAGTAACCGGGACGGTTAAGATTTAGTTGGGGAGGATGTAGCAGGGTGATACCATGTTAATCGCTAGGCGTCACCACCAGCTAATTAGTATTTTTATTTTGAACAGCTGACGTACCAGCCTCGTCCTGGGGGAACGCGAACGCCGATTTCGACAAATGCGTAGCCTAAATTGAGGTCGAAAATGATATAAAGCGGCGTGCTACATTTTGATGTATTTTACTATCAACTGATAGACGTTCGACACTCTGCCGCCTCGCCTTTAGCCTCGCTTTCCGCTTTTCGCCTTTCGCCTTATAGGCGTATTAGTCTCCGTTCTTCGTATTTGTAATAAAATATTACAAACGTACCCAAATCGTGTGGTGTAACTGATTGACGTTAGACGCCCTACCACA

zhangrengang commented 4 years ago

There is a bug for gydb to identify mixed element. I have fixed it. Now the output would be:

scaffold_2380:8273..15089|Gypsy LTR     Gypsy   mixture no      +       RT|reina RNaseH|gmr1 INT|osvaldo
OluchiAroh commented 4 years ago

Do i have to rerun TE_sorter to see this change?

zhangrengang commented 4 years ago

yes.

OluchiAroh commented 4 years ago

Thanks for clearing that up. I still have a question about how this classification is done. Below is another instance(after the fixed bug). There are 3 domains and it's been classified as 17_6 and a_clade respectively, what is the basis for this classification?

scaffold_166:133076..137407|Gypsy LTR Gypsy 17_6 no + AP|tat RT|17_6 RNaseH|17_6 INT|b_clade scaffold_443:208989..213730|Gypsy LTR Gypsy a_clade no + GAG|a_clade RT|cer2-3 RNaseH|17_6 INT|a_clade

zhangrengang commented 4 years ago

These use the most common: 17_6 and a_clade respectively. We assume that mixture should be rare. The above two elements should be not mixture but somewhat distinct from 17_6 and a_clade respectively. The relationship should be more clear on phylogenetic tree.

OluchiAroh commented 4 years ago

About the tree, Can you please explain to me the data i can use to build this tree?

I tired extracting one domain and combining it with the same domain from Gydb database , however i'm not sure if this will show the relationship and possible classification of these elements.

zhangrengang commented 4 years ago

Using the method, you can use one or more domains to conduct tree, or conduct multiple trees. The elements of the same clade should be clustered together and be stable across multiple trees.

OluchiAroh commented 4 years ago

Any idea how i can classify this cluster to family level? More like how to give names to this clusters based on existing classification?

zhangrengang commented 4 years ago

I have no more idea.

OluchiAroh commented 4 years ago

Hi, Sorry to disturb again. I just found out that out of the 210 intact LTR from retriever that i used as the input for TEsorter, only 179 was classified. Any idea why the rest wasn't classified?.

I am really confused.

zhangrengang commented 4 years ago

They might have lost protein domains or GyDB do not cover their domains.