zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
85 stars 19 forks source link

Crash when special characters occurred in sequence IDs #46

Closed oushujun closed 1 year ago

oushujun commented 1 year ago

Hi Ren-Gang,

I encountered errors when running TEsorter (v1.3) on sequences with special headers. Here is one reproducible example:

>repeat1-1#Unknown GTCATCAAAGTCTAACACGGTACAGTGCCGGTCTATCAGGATTATAGGAC CGGAGATGGACCTAGTCCCTCTTTCCTCGGTCTTCAGACCTGAACACCGC CCTAATAGACAACGAATTGCAGGGCCTCTGGTTCACCCAAACGATAGTGG AGACGGACCCGCTCTCCCCGGGTTTCCCCAACAACGTCTTCCTGGTCAGG AAAGGAGGGCGGTTATCGCCCGGTAGTAACTTGAGAAATCTCTCGTAATT ACCGTATACAGTGACGACATTCTGCGACACAAAGGGTGAGGTTGATTATC TACTTGGACGACCTACTCTTAATGGCTCGTACCCCTCGCCTGGCGAACGA ACACGCCTCCGGGACAGTCCCCCATAGATCACGGGGAATCTATCCTCTCC

Save the above sequence in test.fa and run: TEsorter test.fa

2023-05-02 23:19:38,911 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH 2023-05-02 23:19:38,918 -INFO- VARS: {'sequence': 'test.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'test.fa.rexdb', 'force_write_hmmscan': False, 'processors': 4, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0} 2023-05-02 23:19:38,918 -INFO- checking dependencies: 2023-05-02 23:19:38,929 -INFO- hmmer 3.3.1 OK 2023-05-02 23:19:38,994 -INFO- blastn 2.10.0+ OK 2023-05-02 23:19:38,994 -INFO- check database rexdb 2023-05-02 23:19:38,994 -INFO- db path: /work/LAS/mhufford-lab/oushujun/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/database 2023-05-02 23:19:38,994 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2023-05-02 23:19:38,995 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK 2023-05-02 23:19:38,995 -INFO- Start classifying pipeline 2023-05-02 23:19:39,010 -INFO- total 1 sequences 2023-05-02 23:19:39,010 -INFO- translating test.fa in six frames /home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning, 2023-05-02 23:19:39,013 -INFO- HMM scanning against /work/LAS/mhufford-lab/oushujun/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm 2023-05-02 23:19:39,014 -INFO- use existed non-empty test.fa.rexdb.domtbl and skip hmmscan 2023-05-02 23:19:39,014 -INFO- generating gene anntations Traceback (most recent call last): File "/home/oushujun/las/bin/miniconda2/envs/EDTA/bin/TEsorter", line 10, in sys.exit(main()) File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main pipeline(Args()) File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 167, in pipeline maxeval = args.max_evalue, File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 919, in LTRlibAnn prefix=prefix, seqtype=seqtype, mincov=mincov, maxeval=maxeval) File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 801, in hmm2best gseq = d_seqs[rc.qname].seq[rc.envstart-1:rc.envend] KeyError: 'repeat1#Unknown/Unknown|aa1'

zhangrengang commented 1 year ago

Hi Shujun, Please update TEsorter (the lastest conda version is v1.4.6). With the version 1.4, it is ok:

2023-05-03 15:37:42,995 -INFO- Command: /media/40T/wlx/zrg/share/home/app/.local/python3/bin/TEsorter test.fa
2023-05-03 15:37:42,996 -INFO- VARS: {'sequence': 'test.fa', 'hmm_database': 'rexdb', 'db_hmm': None, 'seq_type': 'nucl', 'prefix': None, 'force_write_hmmscan': False, 'processors': 4, 'tmp_dir': './tmp-63472734-e985-11ed-8106-4cd98fb9bbe7', 'min_coverage': 20, 'max_evalue': 0.001, 'min_probability': 0.5, 'no_cleanup': False, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'genome': False, 'win_size': 270000, 'win_ovl': 30000, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0}
2023-05-03 15:37:42,996 -INFO- checking dependencies:
2023-05-03 15:37:43,098 -INFO- hmmer    3.3     OK
2023-05-03 15:37:44,391 -INFO- blastn   2.13.0+ OK
2023-05-03 15:37:44,392 -INFO- check database /media/40T/wlx/zrg/share/home/app/.local/python3/lib/python3.6/site-packages/TEsorter-1.4.1-py3.6.egg/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
2023-05-03 15:37:44,392 -INFO- db file: /media/40T/wlx/zrg/share/home/app/.local/python3/lib/python3.6/site-packages/TEsorter-1.4.1-py3.6.egg/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
2023-05-03 15:37:44,392 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm    OK
2023-05-03 15:37:44,392 -INFO- Start classifying pipeline (ELEMENT mode)
2023-05-03 15:37:44,472 -INFO- total 1 sequences
2023-05-03 15:37:44,473 -INFO- HMM scanning against `/media/40T/wlx/zrg/share/home/app/.local/python3/lib/python3.6/site-packages/TEsorter-1.4.1-py3.6.egg/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2023-05-03 15:37:44,521 -INFO- Start Pool with 4 process(es)
2023-05-03 15:37:44,522 -INFO- translating `./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/chunk.1.fasta` in six frames
/media/40T/wlx/zrg/share/home/app/.local/python3/lib/python3.6/site-packages/Bio/Seq.py:2609: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning)
2023-05-03 15:37:44,627 -INFO- Start Pool with 4 process(es)
2023-05-03 15:37:44,628 -INFO- run CMD: `hmmscan --nobias --notextw --noali --domtblout ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/chunk.1.fasta.aa.domtbl /media/40T/wlx/zrg/share/home/app/.local/python3/lib/python3.6/site-packages/TEsorter-1.4.1-py3.6.egg/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/chunk.1.fasta.aa > /dev/null`
2023-05-03 15:37:44,829 -INFO- generating gene anntations
2023-05-03 15:37:44,830 -INFO- 1 sequences classified by HMM
2023-05-03 15:37:44,830 -INFO- see protein domain sequences in `test.fa.rexdb.dom.faa` and annotation gff3 file in `test.fa.rexdb.dom.gff3`
2023-05-03 15:37:44,831 -INFO- classifying the unclassified sequences by searching against the classified ones
2023-05-03 15:37:44,886 -INFO- using the 80-80-80 rule
2023-05-03 15:37:44,886 -INFO- run CMD: `makeblastdb -in ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/pass1_classified.fa -dbtype nucl -out ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/pass1_classified.fa`
2023-05-03 15:37:46,226 -INFO- run CMD: `blastn -query ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/pass1_unclassified.fa -db ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/pass1_classified.fa -out ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 4 `
2023-05-03 15:37:47,403 -INFO- 0 sequences classified in pass 2
2023-05-03 15:37:47,404 -INFO- total 1 sequences classified.
2023-05-03 15:37:47,404 -INFO- see classified sequences in `test.fa.rexdb.cls.tsv`
2023-05-03 15:37:47,404 -INFO- writing library for RepeatMasker in `test.fa.rexdb.cls.lib`
2023-05-03 15:37:47,404 -INFO- writing classified protein domains in `test.fa.rexdb.cls.pep`
2023-05-03 15:37:47,404 -INFO- Summary of classifications:
Order           Superfamily      # of Sequences# of Clade Sequences    # of Clades# of full Domains
DIRS            unknown                       1              0              0              0
2023-05-03 15:37:47,404 -INFO- cleaning the temporary directory ./tmp-63472734-e985-11ed-8106-4cd98fb9bbe7
2023-05-03 15:37:47,405 -INFO- Pipeline done.
oushujun commented 1 year ago

I think the issue might be due to preexisting TEsorter results matching the input name but the input content has been changed, so the existing results do not match with the new content. After deleting those results, TEsorter runs normally.

zhangrengang commented 1 year ago

Yes, mismatches of sequence ID between input fasta and preexisting hmmscan outfile (*.domtbl) can result in this issue. It can be avoided by deleting the *.domtbl file or using -fw option. I will add a warning for this issue.