Closed OluchiAroh closed 3 years ago
The names (a_clade and b_clade) were derived from HMM profiles of GyDB, including:
AP_a_clade.hmm
AP_b_clade.hmm
GAG_a_clade.hmm
GAG_b_clade.hmm
INT_a_clade.hmm
INT_b_clade.hmm
RNaseH_a_clade.hmm
RNaseH_b_clade.hmm
RT_a_clade.hmm
RT_b_clade.hmm
The informatrion of these clades can be found in database/GyDB2.hmm.info
:
Element Genbank_accession Clade Cluster_or_genus Branch_or_class Family System Host URL
Cer4 NC_003284(nuc17635677-17630597) B-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Caenorhabditis_elegans http://www.gydb.org/index.php/Element:Cer4
Cer5 U88169(nuc9519-14600) B-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Caenorhabditis_elegans http://www.gydb.org/index.php/Element:Cer5
Cer6 Z38112(nuc5690-591) B-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Caenorhabditis_elegans http://www.gydb.org/index.php/Element:Cer6
CFG1 124055237 A-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Chlamys_farresihttp://www.gydb.org/index.php/Element:CFG1
DRM 46848209 A-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Danio_rerio http://www.gydb.org/index.php/Element:DRM
Gulliver AF243513 B-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Schistosoma_japonicum http://www.gydb.org/index.php/Element:Gulliver
Hydra2-1 204647056 A-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Hydra_magnipapillata http://www.gydb.org/index.php/Element:Hydra2-1
Mag X17219 A-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Bombyx_mori http://www.gydb.org/index.php/Element:Mag
SPM 115913323 A-clade Mag Branch_2 Ty3/Gypsy LTR_retroelements Strongylocentrotus_purpuratus http://www.gydb.org/index.php/Element:SPM
These clades include a few elements that have detailed URLs. If you want to classify them into element level, you might have to build a phylogenetic tree.
Thanks so much for your help!
Hi,
Sorry to bother you again. I have another question. Below is a line of the output from the .tsv file. The protein domains are from 3 different hmm files, however, the retrotransposon is classified as gmr1. I would love to know the basis for this conclusion since there are 3 possible classification for this particular retrotransposon.
scaffold_2380:8273..15089|Gypsy LTR Gypsy gmr1 no + RT|reina RNaseH|gmr1 INT|osvaldo
Thanks!
Whether are there other gmr1 domain(s) in addition to RNaseH|gmr1? If no, you can send me the sequence of scaffold_2380:8273..15089|Gypsy
to debug.
There is no other gmr1 domain output.
Below is the sequence
scaffold_2380:8273..15089|Gypsy TGTAGCAGGGTGATACCATGTTAATCGCTAGGCGTCACAACCAGCTAATTAGTATTTTTATTTTGAACAGCTGACGTACCAGCCTCGTCCTGGGGGAACGCGAACGCCGATTTCGACAAATGCGTAGCCTAAATTGAGGTCGAAAATGATATAAAGCGGCGTGCTACATTTTGATGTATTTTACTATCAACTGATAGACGTTCGACACTCTGCCGCCTCGCCTTTAGCCTCGCTTTCCGCTTTTCGCCTTTCGCCTTATAGGCGTATTAGTCTCCGTTCTTCGTATTTGTAATAAAATATTACAACCGTACCCAAATCGTGTGGTATAACTGATTGACGTTAGACGCCCTACCACGATGGTGTCAGAAGTGGGATGTGATGACGCGACGAGACCACGAAGACGAGGCGAGGTGGCGGAGACGTAGAGTCGAGTGAGGAGCCTGGACCCTGATGACGACCTCGAGCGACTCAAGAGTGACATTTCCGGTTTAACACTATTCGGCCGCGACGATGACTAGCCCCGACATGCGACGATCGAGCCACGTGACGTGCCGTGACGCTAAGCGACTTAGATCCTGACTGCATTGACAAACGATTGAATTTTGAATTGATACCCATTCACTGTATAATTAGGAACTATATTGTGTATCCTGTCTGTGTAGCTATAGCAGCATTATAGAGATAGCCAGTATCAGAATATTTTTTTAGAGGGACTTTTTGACATTTGCGTGTTAGTTAGGCATGTTTTCTGAGTAAGATGTATCTGTAGTTGTGAGGGTGCAGAGTAGTCCTGTAGATACGAATGTTATTCATGGGTATAACGGGTATATAACTTTTGTTCCGTATTGGGAATTTGGTATACATCTGGGTTAATACTTAGTAGGGCAGTCGCAACTAGCAAGCAGACATGAGTATTGACGGGGCCACTGGAATTGTGCCACCCGACACGGTTCAGGGTGACTTGCAGGACCAACTGGTCACACTGCGAGAGGAGCTGTCACAGCTGCGGCTCGCAACCAATGACCGTCCACGCGTTGCGTACGTACCGCGCGAGCGGAAGGTGAGACCTTTCAGTGGACGTGGTGAGGTGACAGTGCAGGATTTTCTTGCAGATGTCGAGAGTCTATTTAGGGCACGAGCGATGTCTGAAGACGAGCAGTGTGACTTCATGACGTCACACCTCGAGGGTGATGCCAGACAGGAGATCCGGTTCCGGCCTATGGAGGAGGTGCGCACCGCGACGTGTATCAAGCGGATTCTGCTGGAAGTGTTTGGCGAGAAGAGGTCGGTATCGCAGTTGCAGGAACTCTTCTATCTTCGTAAACAGGCTGACGGGGAGGGGATCCGTAGTTTTGCAGGTGCTCTGCAGCAACTGATGGAAGCGCTTTTGCGCAAGGACGGACACGCCGTCAACCAGCCTGACCGTGTGATCGCCGAACACTTCGCAGAGGGCCTGAAAGACCGCGTACTACGACGCGAGGTGAAGCAACAGCACCGGCTGCGACCGCAGATGAGCTTCGTGGAACTGAGGGAAGAGGCCATTCAGTGGTCCGAGGAGGAAGAGAGGCCGCCGCAAGCCAGTCGACGACAGACTGCTGGTGCATTTGAGCTGGTCGCCGACGCAGGGGAAGCAGGTAACGTTACAGATCTGCAACGAGTGCTCTCAGAGTTAGAGAAACAGCAGCACGTCATCGAGGCGCTGACTGAGCGTGTGGAGAGGATGGAAGGTGTTGCGACAGGTAGGACGGGTGGACGACGGTCCAACACTCCACAGAGGCCCTGGTCGGACCAGCGACCCGCATCTGCGCCGCAGAACTTGCAGACACGACGGCCATTTCAGGAGCGACGCGCATTTCAGGGACCCTCGATGCAGTACACACCAGACGGGCAGCCAATTTGTTACCGTTGTGGTGGAGTTGGCCACCTGGGCAGGCAGTGTCGCAATGAGGCGAACACTGGGCCACGCCGGCCCCAGACGGAGGCTTGTCCACCGGGTGTGCAGACGACGGCCCCTACGGCACACAGACGAGAGCAGCCAGACGATGTTAGAGAGAACACGCATCCTTTAAACTGACGCCGCTCGTTACCAAGAGTCGGGTAGACGAGCTAGTGCGTGCAGGCTCACACGCTGCCGCGCCAACTGACGCGACCGTTGGTACTTGCCCTACGGCAACTGTCTTGATGGGAGGTGTGGAGGTCAATTGCTTACTTGACACAGGTTCCCAAGTAAGTACTATCACTGAGTCTTTCTTTCGTAAGTACATTGCTCGAAACGACCGAGAGCTGGCGTCTTGCAGTTGGTTGAAGCTCACCGCTGCAAACGGGTTAGATATACCGTACGTTGGGTACGTCGAGTTGGACGTGACGACCATGGGTCAGACCATACCTGGCAGAGGTATCCTGGTTGTCTGCGATTCCCCTGATACAGATCAGTGCGCGCATAAGAAGAGGTGTCCTGGCATTCTGGGAATGAATGTCATGGGAGAGCTTGACCTTGCTCTATGGAAACGATCCCTGAACACAGGTAGTTACAGAGCAGAGTCGGTGATGGCGTCGGTCATAGACGCGAAGCAACGGGACCGGAGAGGATTTACTCGTGTGGCAGGGTGGAGCGCAGTGCGAATCCCAGCGCAGTCTTGTGCCACGTTGATGGTTTCCGGTTTGCAGAGACAGGCTGATGCGACATGGGTCGTTGTAGAACCCATTGTGGGGGCACTTCCGGTGGAGGTGTCATTGTGTAGTACTTTAACTAGAGTACACCAGGGTGCTGCACCTGTGCAGGTAGTCAACACCAGCAGAGAAGACGTGTGGCTGAAGCCGAGGATGCGAATAGGTATCTATCAGACCATCGCGCACGTCAGTGAAGAGGACAGAGTATGTTTTGAGCGAGTCAGAGTGTCTGAGGAGTACGTGAGCAGGACTGAGGACCGACCGCCAGGTGTAGTCGACACGCGTGAGGTACTAACCAGTACACTCGATGTCGGGCCACACGTCACCGCGGAGCAGGTTGCTCAACTGAGAGATCTCATCTCGCGGAACGAACGAGCATTCGCCCGAGATGATGACGACCTTGGCTATACAGACTTAGCCACGCACACGATAAGGACCACGACTGAGGTACCCGTACGACAACCTTTCAGAAGGATTCCGCCTACACAGTATCAAGAGGTGAAAGAGCACATCAAGGATCTTCTGAACAGAGGTGTTATCAGAGAGAGCCATTCGGCATGGGCTTCGCCGATCGTCCTGGTCAGGAAGAAAGACGATTCGCTGCGCATGTGCGTGGACTATCGGCTGCTCAATGCGAAGACCCACCGAGACGCATTTCCGCTGCCGCGCATTGACGAGAGTTTTGACGCCATGAGCGGCGCAAACTGGTTTACCACATTGGATCTGGCTTCAGGATACCACCAGATCGCGATGTCGGAGGACGACCGAGAGAAGACGGCCTTCACCACACCTATGGGGCTGTACGAATTCAACCGTATGCCGTTCGGGCTCTGTAACAGTCCCGCGACCTTCCAGAGGTTGATGCAGAGGTGCTTTGGAGATCTGTGTTACCAGACGGTACTCTGTTACCTTGATGACGTTATCATCTTCTCCAAGACGTTCGACAGCCACCTAGCTCAGTTGGAGCAAGTGCTACAGCGACTGCAGAGGATTGGACTGAAGCTGAAGACAAGCAAGTGTCACTTCTTACAGCGGGAGGTGCTATACCTAGGACACCATGTGTCGGCCGACGGCATTGCCACAGATCCGGCAAAGATCGATGCTGTGAAAGACTGGGCCGTGCCGAACACAGTGAAACAGCTTCGCTCCTTTGTAGGATTCGCGTCCTACTACCGTAAGTATGTGCAAGGGTTCAGTATGATTGCAGCGCCGCTACACGATCTAGTGACCAGTGTCAACAAGGAAATGAAGGAGAAGGGACGGTTGAGAGGAGCCTTCAGGGACAGTTGGAGTCCAGAGTGTGACGGAGCTTTCAAGGTTTTGAAGAACGCGCTCTGCTCAGCTCCAGTTCTCGGGTATGCCCAGTACCAACGACCATTCATAGTGGAGACGGACGCCAGCCACCTGGGTCTCGGCGCTGTCCTGTCTCAAGAACAGGAAGGACGTCGTGTCGTCATCGCGTACGCGAGTCGCAGGTTGCGACCACCCGAACGCCAGTATAGCTCGATGAAGCTAGAGATGCTAGCACTTAAGTGGGCGGTCACCACGAAGTTCCGTGCCGACCTTTACGGCGGAGAGTTCGTCATCTACACTGACAACAACCCGTTGAAGTACCTGAAGACCGCCAAGTTGGGAGCCATCGAGCAACGATGGGCAGCTGAGCTGGCCCCGTTCAACTTCACGATCGAGTATCGAGCTGGACGTTCAAACGGAAACGCAGATGCCTTGAGTCGACAGCAACTCGACTACGAGCCCGGGGCAACCGACTTCGACGAGGAAGACGACGTGCAGCAGATATGCGCAACATATGCATCGTGCACGCTACTCCCGCGAGAACTACGCCAACAGATAGAGAAGAATGACAGAGTACGACAAGAAGTCCAGGTTGAAGTAAAGGAGATAGAGGTACAGTCAACCCAGTCACCTGCCTTCCCAGCGTTGTCACCTGAAGAACTTGCAGAGTTGCAGAGGAAAGACCCAGACATCAGCAAGTTCCTGATGTACTGGAGGGTCGGCGAGAAGCCGAAGAAGTGTGCACAGAAGGGTGTGACACAACTGCTGAGACAGTGGCACAAGATGAGGGAAGAGAATGGGGTTTTGTACCGCAAGGTGAGAGATCCACAGGGTGGAGAGGTTAAGCAGCTAGTTCTACCGAGTGTGCTGAAGCACCGACTTCTAGAAGCGGTGCACGACAAGATGGGTCACCAGGGTGGTGATAGGACGAGCAGCCTGGCGTGGCACCGTTGTTACTGGCCAGGTATGCACAGAGAGATTGATGACTACGTGAAGAACTGTGCGAGGTGCACACTAGCGAAGATGCCGCGACAGAAGACACACGCGCCGATGGGACACCTTCTAGCATCGCGTCCCTTAGAGGTCTTGGCGGTGGATTACACACAGTTGGAGAAGGCAGCAGATGGACGTGAGAGTGTTCTGGTACTGACAGACGTGTTCACCAAATTCGCTTGGGCTGTACCAGCACGTGACCAGAAGGCGAACACGACGGCGCGACTATTGGTGAGAGAGTGGTTTCAGCGTTACGGAGTTCCACAGAGGATCCATTCTGACAGAGGCCGGAATTTCGAGAGTGCGACTATTGGAGAGCTGTGCAAGCTGTACGGCATCGAGAAGTCAAGGACTACGGCATACCATCCAGAGGGGAACGGGCAGTGTGAAAGGTTTAACCGGACTTTACACGACTTGTTAAGGACCCTTCCGCCGGCACAGAAGCGAAGATGGACAGAGCATCTGCCCGAGCTTTGCAACGCCTACAATGCGACGCCACATGCGTCGACGGGATATTCGCCACACTACTTGTTGTTTGGGCGCGATCCCTGGCTGCCGATAGACGCATTCCTAGGGCAGGAGTTGAGGGAGGACGATTGTGACAGCTCGATCGACGGATGGCTTGCCACACATCAGAAGAGACTGCAAGACGCGTATTCTCACGCAAGCGAACGACTCGGCGCTGCAGCAGAAAGTCGTAACCTGCCTCACGAGAGATTGAAGATGAGTGAGCCGCTGACACTTGGTTGCAGGGTTTACCAACGCAACAGACTACCAGGACGCAACAAAATGCAGGATGCATGGAGCCCGGAGGTGTACAAGGTAGTACGGATACCCACGGAGCCTGGTGGACCTTACCACATCGAGAGGGCAGACGGGTTCGGCCGACCTCAGACAGTCTGCAGGAAGAACATCCAGCCAGCGGGTCCAGGAGCGATTGTTCCGACTCCAGTGGCAGTCACTCCAGAGGTCAAAGACGACGAACGCTCTGATGAGAGCTCTGACGACGAGGAGGAGGAGGAGGATTCCCTTGTCGCGATCGTTCCAGTGGGTCGTCCGACGATAGCTCCAGCTTCAGTGGTCCAGCCGATTGCTGAGGAGGTTTTGACGGGAGAGGAGATTCACGATGAGCCTGCCGTGATACCGGTATCGATACCAGTACCCGCACCTAGGAGATCGGGCAGGATGCGGAAGAAGGTGGTACGCTTCGAAGAAGTCGACACGCAAGTACCTACACCGGCCCCGCGGCGATGCAACGTCATTCCTGAACCGTGTGGGGGGGACCAATTGACCCCCATCATGACTAGTTTGTTTGTAGCGATGGCTCGCAGTGGTGTGGTCTTTTCGGGAAGCATTAACCGGGACGGTTAAGAGTTAGTTGGGGAGGACCCAGGAGGAATTTTTAGAATTTAGATACGCAGGAATTTGGGAGAACCCTAGAGTTCCAGCGGGATTTTTTTTTTTTTATATAGTAACCGGGACGGTTAAGATTTAGTTGGGGAGGATGTAGCAGGGTGATACCATGTTAATCGCTAGGCGTCACCACCAGCTAATTAGTATTTTTATTTTGAACAGCTGACGTACCAGCCTCGTCCTGGGGGAACGCGAACGCCGATTTCGACAAATGCGTAGCCTAAATTGAGGTCGAAAATGATATAAAGCGGCGTGCTACATTTTGATGTATTTTACTATCAACTGATAGACGTTCGACACTCTGCCGCCTCGCCTTTAGCCTCGCTTTCCGCTTTTCGCCTTTCGCCTTATAGGCGTATTAGTCTCCGTTCTTCGTATTTGTAATAAAATATTACAAACGTACCCAAATCGTGTGGTGTAACTGATTGACGTTAGACGCCCTACCACA
There is a bug for gydb to identify mixed element. I have fixed it. Now the output would be:
scaffold_2380:8273..15089|Gypsy LTR Gypsy mixture no + RT|reina RNaseH|gmr1 INT|osvaldo
Do i have to rerun TE_sorter to see this change?
yes.
Thanks for clearing that up. I still have a question about how this classification is done. Below is another instance(after the fixed bug). There are 3 domains and it's been classified as 17_6 and a_clade respectively, what is the basis for this classification?
scaffold_166:133076..137407|Gypsy LTR Gypsy 17_6 no + AP|tat RT|17_6 RNaseH|17_6 INT|b_clade scaffold_443:208989..213730|Gypsy LTR Gypsy a_clade no + GAG|a_clade RT|cer2-3 RNaseH|17_6 INT|a_clade
These use the most common: 17_6 and a_clade respectively. We assume that mixture should be rare. The above two elements should be not mixture but somewhat distinct from 17_6 and a_clade respectively. The relationship should be more clear on phylogenetic tree.
About the tree, Can you please explain to me the data i can use to build this tree?
I tired extracting one domain and combining it with the same domain from Gydb database , however i'm not sure if this will show the relationship and possible classification of these elements.
Using the method, you can use one or more domains to conduct tree, or conduct multiple trees. The elements of the same clade should be clustered together and be stable across multiple trees.
Any idea how i can classify this cluster to family level? More like how to give names to this clusters based on existing classification?
I have no more idea.
Hi, Sorry to disturb again. I just found out that out of the 210 intact LTR from retriever that i used as the input for TEsorter, only 179 was classified. Any idea why the rest wasn't classified?.
I am really confused.
They might have lost protein domains or GyDB do not cover their domains.
Hi,
Thanks for this awesome tool. I am trying to annotate the genome of an invertebrate. I used LTR_retriever to identify intact elements and i am currently trying to use TE sorter to classify them into families. I used the gydb database in the software to do this since i am working with an invertebrate. However, the output classified most of the element as a_clade and b_clade.I have searched the gydb database to figure out what this means but there wasn't any matching result.
Do you have any idea what a_clade and b_clade mean?