zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
87 stars 19 forks source link

How to identify the homology (synteny) of LTRs? #44

Closed Wenwen012345 closed 1 year ago

Wenwen012345 commented 1 year ago

Dear @zhangrengang

TEsorter is a very good tool that we used in our recent research.

Our recent study may wish to identify collinearity (synteny) of LTRs across species (although it's not clear to me whether doing so would necessarily be meaningful). But currently I'm running into some trouble. Our manuscript had some issues pointed out by the reviewers. For example, our approach to identifying collinearity (synteny) of LTRs was inappropriate. The reference for our rough approach came from section 2.6 of (https://onlinelibrary.wiley.com/doi/10.1111/jse.12850) (although we now think that the original method of this article is also problematic.).

The reviewer's opinion is roughly: "The way to identify LTR synteny is problematic. Since LTR retrotransposons can create many identical copies in non-syntenic regions, using BLASTAll to identify the most identical LTR sequence would not help to identify the syntenic locus of the query LTR sequence.....The authors need to verify the flanking sequence of the syntenic LTRs and make sure they are also identical between species compared and may also need to use genes to anchor the calling for syntenic LTRs.”

Our purpose is mainly to prove that LTRs originate from the transmission between species, or the duplication within the genome.

The main questions are: 1 How to obtain the flanking sequences of LTRs, and how long the flanking sequences of LTRs should be obtained. 2 We do not have a well-established method to obtain the flanking sequences of LTRs, because of the very large number of LTRs. And many flanking sequences may also overlap substantially. Whether it is necessary to do this, I can't help but have a lot of entanglements. 3 I tried to find relevant literature, but apart from the literature mentioned above, I rarely saw the identification of collinearity of LTRs literature. Therefore, I doubt the feasibility and significance of the experiment (is there any relevant literature recommended? ). 4 Our purpose may be listed above. But I'm not sure that revealing the collinearity of LTRs will reveal our conjecture.

Therefore, our doubts and confusion mainly stem from this, and we hope to have some suggestions. Thank you so much!

zhangrengang commented 1 year ago

Yes, identification of collinearity of LTRs literature is very rare and I have no one to recommend. The method in the reference you mentioned is traditionally for analysing synteny of coding genes. I think the reviewer may argue your method containing many false positives (but not false negatives?) so that you need to verify the flanking sequences. So you just need to extract the flanking sequences of the syntenic LTRs you have identified and compare them to filter out potential false positives, and you need not to consider the overlaps between the flanking sequences. You can also use dot plots to visualize the synteny if it is similar to the synteny of coding genes and has only 1:1 syntenic depth between species. I do not quite understrand your purpose but I think identifying syntenic LTRs is only one of steps for your purpose.

Wenwen012345 commented 1 year ago

Thanks for the reply. @zhangrengang

Because at that time we used the CDS results of LTRs to conduct synteny analysis (from TEsorter). I was wondering if it would be useful to use the entire LTR retrotransposon to explore synteny? Because the flanking sequence of CDS is essentially the repeat sequence region and TSD region of LTRs (from LTR_Retriever).

###
CM024953.1  LTR_retriever   repeat_region   7699972 7711396 .   -   .   ID=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0000657;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG
CM024953.1  LTR_retriever   target_site_duplication 7699972 7699976 .   -   .   ID=lTSD_46;Parent=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG
CM024953.1  LTR_retriever   long_terminal_repeat    7699977 7700784 .   -   .   ID=lLTR_46;Parent=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG
CM024953.1  LTR_retriever   Gypsy_LTR_retrotransposon   7699977 7711391 .   -   .   ID=LTRRT_46;Parent=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG
CM024953.1  LTR_retriever   long_terminal_repeat    7710582 7711391 .   -   .   ID=rLTR_46;Parent=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG
CM024953.1  LTR_retriever   target_site_duplication 7711392 7711396 .   -   .   ID=rTSD_46;Parent=repeat_region_46;Name=CM024953.1:7699977..7711391;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9765;Method=structural;motif=TGCA;tsd=AGAGG

LTR_Retriever GFF3 result file

For example, I get the location information of the LTRs from the GFF3 file in the result of LTR_Retriever. Then define the start and end points. Then use a sequence extraction tool (such as TBtools) to obtain the full-length LTR retrotransposon sequence (about 10,000 bp or so) from the genome. Collinearity between species is then investigated. I don't know if this is possible. Further, if flanking sequences are required to verify collinearity. Can flanking sequences and LTR retrotransposons be used as a whole to explore synteny? What I mean is to combine LTRs and flanking sequences into one large sequence and then explore collinearity.

In addition, explore some technical aspects. To obtain flanking sequences, can one extract 5,000 bp upstream and downstream of an LTR retrotransposon? For example, in the GFF3 file of LTRs, the start and end positions are each ±5000bp, and then extracted with a sequence extraction tool to obtain flanking sequences. Is that so?

Wenwen012345 commented 1 year ago

Also, I think my main purpose is to explore whether the LTR retrotransposons in the species are derived from pre-speciation or post-speciation replication. If the LTR retrotransposon arose before speciation, it would suggest that the LTR retrotransposon should be orthologous. If it is an LTR retrotransposon that itself replicates after speciation, it suggests that the LTR should be paralogous. What is interesting about our results is that we seem to find that in the species we studied, a large part of the LTRs may be formed after species differentiation (as shown in the figure below, different lines represent different species.The burst point for LTRs appears to be after species divergence.).

image

zhangrengang commented 1 year ago
  1. You may use the entire LTR retrotransposon to explore synteny, but you need use the flanking sequence of the entire LTR retrotransposon.
  2. Flanking sequences and LTR retrotransposons can be used as a whole to explore synteny. However, you should filter the BLAST hits to ensure the hits being both between flanking sequences and between LTR retrotransposons, but not only between LTR retrotransposons.
  3. Yes, 5,000 bp upstream and downstream of an LTR retrotransposon can be extracted.
Wenwen012345 commented 1 year ago

Thank you for your kind help.

zhangrengang commented 1 year ago

I get another idea. You may aim to identify species-specific or non-specific LTRs. In our study on hawthorn genome, we used our SubPhaser tool to identify species-specific and non-specific TEs or LTRs between closely related species. This method firstly identify species-specific repeated kmers and then identify species-specific TEs/LTRs, and then compare their insertion times and phylogenies. You may compare the method with your synteny-based method.

Wenwen012345 commented 1 year ago

Thank you for your kind help.

Thank you again for your kind help.

Wenwen012345 commented 1 year ago

Hello @zhangrengang

I took this method today. But one problem is how to quickly find the associated chromosomes between two species. For example, fortunately, two of the species I studied have homologous chromosome information in the existing literature (as shown in the figure below). But how can we find homologous chromosomes of two species more quickly? Besides the slower BLAST-MCscanX method, is there a faster method?

image

In addition, the so-called "method of extracting the DNA sequences of LTRs and their upstream and downstream DNA sequences of 5000 bp and fusing them together" was tried yesterday. The results show that all LTRs do not show collinearity, and all show singleton. That's it.. We may have to do some other assessments.

zhangrengang commented 1 year ago

@Wenwen012345 minimap2 is much faster for closely related species, such as the spp of the same genus. I think none collinearity is abnormal for very close species, but is normal for distinct species.

Wenwen012345 commented 1 year ago

@Wenwen012345 minimap2 is much faster for closely related species, such as the spp of the same genus. I think none collinearity is abnormal for very close species, but is normal for distinct species.

Hi, @zhangrengang thank you for your recommendation! In addition, I checked the BLASTn file used to input MCscanX and found that (as the method mentioned above) the length of many of the paired sequences was around several hundred bp (as shown in the figure below, the fourth column). The length of the fusion sequence of "LTRs + upper and lower 5000bp" extracted by us is mostly 15000bp+. Going by the algorithm parameters MCscanX runs with, these should probably not be considered collinear. Maybe the DNA fragments are full of mutations (such as SNPs and other base mutations), resulting in only a small number of fragments matching in the BLASTn results. In fact, the two species I used should be relatively close (the divergence time is estimated to be less than 3 Mya).

image

In short, we may need to re-evaluate the method, as well as the feasibility of the research content. Or we might just use the subphaser directly.