zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
87 stars 19 forks source link

get_full_seqs in LTR_retriever.py generate some empty sequences which should be generated. #53

Closed LeeYEAH2 closed 7 months ago

LeeYEAH2 commented 7 months ago

Hello again! I was pudating my genome erv annotation due to the version update of the genome. BUT this time I found get_full_seqs didn't generate all the sequences LTR_retriever showed in the pass.list。 AND I'm pretty sure that get_full_seqs detected all the sequences. as shown below :

LeeYEAH2 commented 7 months ago

Chr01:100489068..100485858#LTR/unknown

Chr01:100991415..100985473#LTR/Retrovirus

Chr01:102009458..102011936#LTR/unknown TGTGGGGGCGGAGCCCCAGAGAGTAGTTTCCAGGCTCTCGGCCTCACACAGACAGGTGCTGGCTCAGGTAGTAAATGGCCAACTGTGATTGCATGGCCATCAGCTGTGGCTAGTTGGCCGTCAGCTGTAACCAGTGAGCCATTGGCCACAATATAATTGCTGTGGCTAAGGAGAGAGAGAAAGAAGGATGGGGCTAGCAAGGAGATGGCGGCTGGGCTGGCAAGCGTGGATGGCGGTTTGCAGACAGTGTGTATCCAGCCTCCAGTGAGAGTATAGTGCCGCCAGAGAGAATAAAGTGGTATGACTCCCCTACCTATGGCTCCGTGGGTGTTCCTTTTTGGCCTCACCATATCCTGCGTTCTTGTGTGGGGAGCGGGACCGGAGACCCTGCAGGCCACCCCGCATGACACATGGCGCAGCGAGCAGGGTCCCCAACATGACACATGGCGTAGTCGGCAGGATATGGTGCCGGCCAAAGCTCTCCGAAGGGCGGTGGAGCAGTTTGTGTGTATGAACACTCAGTCTGAGGAAGACCAGGAGGAGCAGCTGCCGGAGAGCTGGACCCTCGTGGAGGGGTGGGAGGACGTGGACGGTTCTCCCACCAGCACAGGAAAAGCCATGCAGCTGCTGGAGAGATGGAGCCCCGTGGAGAGGTGGAAAGATGTGGACGGTTCCCCAACCAGCACAGGCCGGAGAAAGCGAAGGTCGTTGCAGCCCTGTGGGCCGGGGAAGTTCCTGCTCAGGCAGCCCGGATGCAGGACTTGTAGTCCCAGGAGGTAACGCTTGCTGAGTCCTCCGTGGGAGATGAGGGTGAGGTCAAGGTCGTCCCTCACCCCCAGGACAGCCCTGGTGAATGACTATGGACTATGGGGAATTGCCTTCCATCCCTAATTTAATGGACTGCTTGACTGTTTGTTTGGGAACTGTTGTTAGTGGAACTGGGGGATATTTGCTTTTGTCTCTTGACTGGCCGCCATTGAGAATATGTAAGCACCTTGATTGTGAGTCGCTGTTGTTCCAGCAGGGTACCCTGAGAGGCACAGAGAGAGTGGCAGTGCGCTGAGAGGTCTAGCTGTGCCCTGAGAAGCCCTGGCTGTGTCCAGGAAGTGTGGCTGTGCCCACAGAGAGTGGTGGTGCCCTGAGAAGCCCTGGCTGTGTCCAGGAAGTGTGGCTGTTCCCAGAGAAAGTGGCAGTGCCCTGAGAAATCCTAGCTGTGCCCGGGGAAACTAGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCAGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCAGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCTGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCAGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCTGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCAGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCAGTGCCCAGGAGTACTGGTGGTGCCCTGAGAAACCCTGGCTGTGTCCGGGAAGACAGGTGGTACCCTAAGAAGTCCAGGAAGTCTGGCCGTGCCCAGAAGCACTGGTGGTGCCCTGAGAAACCCTGGCTGTGCCCAGAGAGCCTGGCTGTGTCCAGGATATTGCATTCACCCCCAGGCTCCTCGCACAGGTCCCCTCGCGGAAGACGCTTGTCGCGTAGATTGTGAGGCGAGAGCCTGTAGGGGTGGAGTGTGGGGGCGGAGCCCCAGAGAGTAGTTTCCAGGCTCTCGGCCTCACATAGACAGGTGCTGGCTCAGGTAGTAAATGGCCAACTGTGATTGCATGGCCATCAGCTGTGGCTAGTTGGCCGTCAGCTGTAACCAGTGAGCCATTGGCCACAATATAATTGCTGTGGCTAAGGAGAGAGAGAAAGAAGGATGGGGCTAGCAAGGAGATGGCGGCTGGGCTGGCAAGCGTGGATGGCGGTTTGCAGACAGTGTGTATCCAGCCTCCAGTGAGAGTATAGTGCCGCCAGAGAGAATAAAGTGGTATGACTCCCCTACCTATGGCTCCGTGGGTGTTCCTTTTTGGCCTCACCATATCCTGCGTTCTTGTGTGGGGAGCGGGACCGGAGACCCTGCAGGCCACCCCGCATGACACATGGCGCAGCGAGCAGGGTCCCCAACATGACA

LeeYEAH2 commented 7 months ago

Does get_full_seqs automatedly filter some low quality sequence? Or would it be propriate for me to generate fasta sequences accodring to the 2 pass.list files by myself, then using TEsorter to classify them?

zhangrengang commented 7 months ago

get_full_seqs does not filter some low quality sequence, but it indeed discards sequences with the same location (see the below example). Surely you can generate fasta sequences accodring to the 2 pass.list files by yourself. The below is count for an example:

$ wc -l *pass.list
    622 genome.fasta.mod.nmtf.pass.list
  16305 genome.fasta.mod.pass.list
  16927 total
$ grep -c ">" intact_ltr.fa
16296
$ cat *pass.list | cut -f1 | grep -v "#" | sort |uniq | wc -l
16296
LeeYEAH2 commented 7 months ago

Yep, the empty sequence location does overlap with existing sequence, I guess it's the problem. thx a lot