How to extract the CDS sequences of all LTRs?

Wenwen012345 commented 2 years ago

Hello, @zhangrengang

I am studying how to extract the CDS sequence of all LTR, but I don't know how to do it. Reference from: https://onlinelibrary.wiley.com/doi/10.1111/jse.12850. But the above method is not very detailed. The original text reads as follows: "2.6 Syntenic LTR retrotransposons analysis We firstly extracted the coding sequences of all LTR retrotransposons from three chromosome-level genomes, PN40024, V. ripara and V. amurensis genomes. Then the coding sequences were translated into amino acid sequences by using TBtoolsv1.098 (Chen et al., 2020)....."

The python version of MCscanX will be used for multi-species LTR collinearity analysis. So use the CDS file (see: https://github.com/tanghaibao/jcvi/wiki/MCscan- (Python - version), "grape. bed grape. cds peach. bed peach. cds"). This process also requires the BED file. However, I observed the dom.gff file generated by TEsorter, and it doesn't seem to contain the information of full coverage of all CDS; Contains only domain location information. These domains are not integrated CDS regions. Take the following example:

CM030788.1 TEsorter CDS 10392696 10393004 48.1 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-GAG;gene=GAG;clade=Ale;evalue=1.3e-14;coverage=100.0;probability=0.87 CM030788.1 TEsorter CDS 10393602 10393808 79.2 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-PROT;gene=PROT;clade=Ale;evalue=2.7e-24;coverage=100.0;probability=0.99 CM030788.1 TEsorter CDS 10394043 10394627 274.0 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-INT;gene=INT;clade=Ale;evalue=7e-84;coverage=100.0;probability=0.98 CM030788.1 TEsorter CDS 10395538 10396209 346.8 + 2 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-RT;gene=RT;clade=Ale;evalue=6.5e-106;coverage=84.0;probability=0.98 CM030788.1 TEsorter CDS 10396473 10396850 191.8 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-RH;gene=RH;clade=Ale;evalue=4.4e-59;coverage=100.0;probability=0.99

I looked at the information and found that each location information was only the location information of a separate domain. Pieced together, the CDS sequence is not complete. And neither dom.faa nor cls.lib files seem to contain complete protein sequences translated by CDS, even if splicing is carried out. So I don't know what to do. I would appreciate your advice. Sorry, I'm new to bioinformatics. I was hoping you could give me some Pointers.

zhangrengang commented 2 years ago

You may just use the concatenated domain CDS sequences for each LTR, or the region start at the first domain and end at the last domain (CM030788.1 10392696 - 10396473 ), or the entire LTR (CM030788.1:10391873..10397114) sequence.

oushujun commented 2 years ago

@wensulin93 this seems to be a generic task, you may try agat: https://www.biostars.org/p/9465973/

Wenwen012345 commented 2 years ago

Excellent advice, lucky you can see this post! @oushujun

Wenwen012345 commented 2 years ago

@wensulin93 this seems to be a generic task, you may try agat: https://www.biostars.org/p/9465973/

Hello, I thought about this question carefully yesterday. It might help you optimize your software. Because the genome of the selected species is incomplete (including much information missing from the genomic GFF3 file), it seems that extracting CDS is not an easy task. I looked at the genetic structure of one LTR. It is found that all the sequences generated by the corresponding.dom. Gff3 file are not complete CDS sequences after splicing together, but they cover most of them. The main parts not covered are the beginning (including ATG) or the end or whatever. However, in the LTR sequence I observed, the EXACT division of CDS sequence was not observed in the GFF3 file of the genome, or even directly skipped that section (not shown in the GFF file). This represents an incomplete GFF file for the genome. Therefore, if there is no new progress, I will splicing the CDS sequences of all domains together and perform synteny analysis with MCscanX. The ultimate goal is to perform synteny analysis. I think stitching together CDS sequences of all domains will also give me the ideal synteny data. @zhangrengang

Wenwen012345 commented 2 years ago

I tried the concatenate_domains.py script, but I got an error message like this:

Traceback (most recent call last): File "/home/manager/miniconda3/bin/concatenate_domains.py", line 33, in sys.exit(load_entry_point('TEsorter==1.4.1', 'console_scripts', 'concatenate_domains.py')()) File "/home/manager/miniconda3/bin/concatenate_domains.py", line 25, in importlib_load_entry_point return next(matches).load() StopIteration

@zhangrengang

zhangrengang commented 2 years ago

@wensulin93 Yes, TEsorter do not define the EXACT start, end, or division positions of domains. But these are not neccesary for synteny analyses, so previously I gave you three solutions: the first contains only domains, the second contians almost the full GAG-POL except that the start and end regions may be not incomplete, and the second contains full CDS but also incudes non-coding regions.

Regarding the issue of concatenate_domains.py, how do you use the script? It should be used like:

concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep RH RT INT > rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln

In this example, the RH, RT and INT domains are aligned seperately and then concatennated together. The LTR-RTs that do not contain all the three domains will be excluded.

Wenwen012345 commented 2 years ago

@wensulin93 Yes, TEsorter do not define the EXACT start, end, or division positions of domains. But these are not neccesary for synteny analyses, so previously I gave you three solutions: the first contains only domains, the second contians almost the full GAG-POL except that the start and end regions may be not incomplete, and the second contains full CDS but also incudes non-coding regions.

Regarding the issue of concatenate_domains.py, how do you use the script? It should be used like:
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep RH RT INT > rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln
In this example, the RH, RT and INT domains are aligned seperately and then concatennated together. The LTR-RTs that do not contain all the three domains will be excluded.

Hello， I think I might solve a problem. Tesorter was previously installed with Conda (1.3.0, "Conda Insall Tesorter"). Then I tried to find the "concatenate_domains.py" script. The result is in a "python-scripts" folder in the conda folder. So I typed the command: ~/miniconda3/pkgs/tesorter-1.3.0-py_0/python-scripts/concatenate_domains.py 2_oute.rexdb-plant.cls.pep RT > rt2.aln

Results hint: "ZSH: / home/manager/miniconda3 / PKGS/tesorter - 1.3.0 - py_0 / python scripts/concatenate_domains.py: bad interpreter: /opt/conda/condabld/tesorter_1604435924377/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold _place: no such file or directory"

And then I realized that the first line seems to be wrong. So I changed it to "#! /usr/bin/env python"

Then came the reminder: "Traceback (most recent call last): File "/home/manager/miniconda3/pkgs/tesorter-1.3.0-py_0/python-scripts/concatenate_domains.py", line 7, in from .RunCmdsMP import run_cmd ImportError: attempted relative import with no known parent package"

I didn't find the script ".RunCmdsMP". Finally I found the script in the "modules" folder. I also found that "concatenate_domains.py" in the modules folder only works. Then the instructions were given: ~/miniconda3/pkgs/tesorter-1.3.0-py_0/site-packages/TEsorter/modules/concatenate_domains.py 2_oute.rexdb-plant.cls.pep RT > rt2.aln

The result file was successfully obtained. The conda installation cauesed the problem. Hope it can be reference! Thank you anyway!

zhangrengang / TEsorter

How to extract the CDS sequences of all LTRs? #34