How to merge the TEsorter repeat libraires

zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes

https://doi.org/10.1093/hr/uhac017

GNU General Public License v3.0

85 stars 19 forks source link

How to merge the TEsorter repeat libraires #52

Open manoharbisht1998 opened 5 months ago

manoharbisht1998 commented 5 months ago

Hey, thanks for the tool. How can I merge the output library of TEsorter with the repeatModeler repeat library to run RepeatMasker? Further, can I directly input the output library of TEsorter in RepeatMasker?

zhangrengang commented 5 months ago

Yes. In the output library *.cls.lib, the sequences are identical to the input, but their ID have been updated with new classifications.

manoharbisht1998 commented 5 months ago

Okay, thanks for answering the second part of my question. But I still have doubt about merging the two libraries. As the RepeatModeler provides the consensus library where the number of sequences is very less as compared to input genome fasta whereas, the TEsorter provides the number of sequences same as the input genome fasta. So I am wondering that, can I merge both the librarires in one and then run clustered the merged library using tools like CD-Hit?

zhangrengang commented 5 months ago

I do not understand. Are you using -genome option to screen a whole genome with TEsorter? Otherwise, you should not input genome fasta, but input TE fasta identified by e.g. RepeatModeler.

manoharbisht1998 commented 5 months ago

Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler. Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.

zhangrengang commented 5 months ago

You are right. Please note that the -genome option do not produce a TE library like RepeatModeler, but output annotations (*.dom.gff3) and sequences (*.dom.faa) of TE protein domains across the whole genome.

manoharbisht1998 commented 5 months ago

Okay. I am using the TEsorter v1.4.6, and I did get the *.cls.lib by using the -genome option.

zhangrengang commented 5 months ago

It is strange. How did you install it? Is it the last version from github?

manoharbisht1998 commented 5 months ago

I installed with conda environment

zhangrengang commented 5 months ago

I test the conda version, but only four files output:

$ TEsorter -genome rice6.9.5.liban -fw
$ ls
rice6.9.5.liban.rexdb.domtbl
rice6.9.5.liban.rexdb.dom.gff3
rice6.9.5.liban.rexdb.dom.faa
rice6.9.5.liban.rexdb.dom.tsv

manoharbisht1998 commented 5 months ago

Oh, it must be because I did not define my genome by parameter -genome instead I used something. TEsorter my_genome.fa -p 50 -prob 0.9 Which means TEsorter by default took it as a repeat library, I guesss.

zhangrengang commented 5 months ago

Yes.

manoharbisht1998 commented 5 months ago

Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler. Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.

Further, on this.. I run TEsorter with the RepeatModeler output consesi.fa and it took only one minute to give me the output in *.cl.lib, with the following output on screen Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Copia 75 72 8 3 LTR Gypsy 108 80 6 20 pararetrovirus unknown 7 0 0 0 LINE unknown 22 0 0 0 TIR EnSpm_CACTA 4 0 0 0 TIR MuDR_Mutator 6 0 0 0 TIR PIF_Harbinger 5 0 0 0 TIR hAT 5 0 0 0

Now I am wondering does the pipeline worked or not?

zhangrengang commented 5 months ago

It works. It is fast for small TE library.

manoharbisht1998 commented 5 months ago

Hi, I have run the RepeatMasker, and I am getting more repeats classified as "unknown" which I want to reduce. I am attaching the output of repeatMasker for my genome both using RepatModeler ---> RepeatMasker and RepeatModeler ---> TEsorter --->RepeatMasker. Do you have any suggestions on how can I reduce the number of "unknown" TEs? Further, I am also attaching the headers of the file .*cls.lib which I obtained after running TEsorter and input in RepeatMasker.

1_Unknown#Unknown 1_Unknown ( RepeatScout Family Size = 4356, Final Multiple Alignment Size = 100, Localized to 2506 out of 2617 contigs ) AAATATGAAATAAATAAAAATAATACATGGAAATGGAAAATACNGATTATTTAATTANTA

Reuslt

zhangrengang commented 5 months ago

You may use the union set of non-unknown TEs from RepatModeler and TEsorter.

manoharbisht1998 commented 5 months ago

I could not get you! are you suggesting to take only those sequences that are annotated by both RepeatModeler and TEsorter output (which we obtain after running with RepeatModeler library)?

zhangrengang commented 5 months ago

I mean you may replace the unknown classifications by TEsorter with the known classifications by RepeatModeler, like:

less rice6.9.5.liban.rexdb.cls.lib | awk '{if ($1~"#Unknown"){cls=$1; $1=">"$2; $2=cls}{print}}'

It is just to reduce the number of "unknown" TEs.

manoharbisht1998 commented 5 months ago

Okay, Thanks!