Open manoharbisht1998 opened 5 months ago
Yes. In the output library *.cls.lib
, the sequences are identical to the input, but their ID have been updated with new classifications.
Okay, thanks for answering the second part of my question. But I still have doubt about merging the two libraries. As the RepeatModeler provides the consensus library where the number of sequences is very less as compared to input genome fasta whereas, the TEsorter provides the number of sequences same as the input genome fasta. So I am wondering that, can I merge both the librarires in one and then run clustered the merged library using tools like CD-Hit?
I do not understand. Are you using -genome
option to screen a whole genome with TEsorter? Otherwise, you should not input genome fasta, but input TE fasta identified by e.g. RepeatModeler.
Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler. Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.
You are right. Please note that the -genome
option do not produce a TE library like RepeatModeler, but output annotations (*.dom.gff3
) and sequences (*.dom.faa
) of TE protein domains across the whole genome.
Okay. I am using the TEsorter v1.4.6, and I did get the *.cls.lib by using the -genome option.
It is strange. How did you install it? Is it the last version from github?
I installed with conda environment
I test the conda version, but only four files output:
$ TEsorter -genome rice6.9.5.liban -fw
$ ls
rice6.9.5.liban.rexdb.domtbl
rice6.9.5.liban.rexdb.dom.gff3
rice6.9.5.liban.rexdb.dom.faa
rice6.9.5.liban.rexdb.dom.tsv
Oh, it must be because I did not define my genome by parameter -genome instead I used something. TEsorter my_genome.fa -p 50 -prob 0.9 Which means TEsorter by default took it as a repeat library, I guesss.
Yes.
Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler. Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.
Further, on this.. I run TEsorter with the RepeatModeler output consesi.fa and it took only one minute to give me the output in *.cl.lib, with the following output on screen Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Copia 75 72 8 3 LTR Gypsy 108 80 6 20 pararetrovirus unknown 7 0 0 0 LINE unknown 22 0 0 0 TIR EnSpm_CACTA 4 0 0 0 TIR MuDR_Mutator 6 0 0 0 TIR PIF_Harbinger 5 0 0 0 TIR hAT 5 0 0 0
Now I am wondering does the pipeline worked or not?
It works. It is fast for small TE library.
Hi, I have run the RepeatMasker, and I am getting more repeats classified as "unknown" which I want to reduce. I am attaching the output of repeatMasker for my genome both using RepatModeler ---> RepeatMasker and RepeatModeler ---> TEsorter --->RepeatMasker. Do you have any suggestions on how can I reduce the number of "unknown" TEs? Further, I am also attaching the headers of the file .*cls.lib which I obtained after running TEsorter and input in RepeatMasker.
1_Unknown#Unknown 1_Unknown ( RepeatScout Family Size = 4356, Final Multiple Alignment Size = 100, Localized to 2506 out of 2617 contigs ) AAATATGAAATAAATAAAAATAATACATGGAAATGGAAAATACNGATTATTTAATTANTA
You may use the union set of non-unknown TEs from RepatModeler and TEsorter.
I could not get you! are you suggesting to take only those sequences that are annotated by both RepeatModeler and TEsorter output (which we obtain after running with RepeatModeler library)?
I mean you may replace the unknown classifications by TEsorter with the known classifications by RepeatModeler, like:
less rice6.9.5.liban.rexdb.cls.lib | awk '{if ($1~"#Unknown"){cls=$1; $1=">"$2; $2=cls}{print}}'
It is just to reduce the number of "unknown" TEs.
Okay, Thanks!
Hey, thanks for the tool. How can I merge the output library of TEsorter with the repeatModeler repeat library to run RepeatMasker? Further, can I directly input the output library of TEsorter in RepeatMasker?