zhangrengang / TEsorter

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes
https://doi.org/10.1093/hr/uhac017
GNU General Public License v3.0
85 stars 19 forks source link

Edge-case crash when parsing gff3 #16

Closed philippbayer closed 3 years ago

philippbayer commented 3 years ago

Thank you for this software. I'm getting this error:

2020-09-10 20:07:45,719 -INFO- generating gene anntations
Traceback (most recent call last):
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/bin/TEsorter", line 10, in 
    sys.exit(main())
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 976, in main
    pipeline(Args())
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 171, in pipeline
    for rc in Classifier(gff, db=args.hmm_database, fout=fc):
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 391, in classify
    for rc in self.parse():
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 380, in parse
    line = LTRgffLine(line)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 609, in __init__
    super(LTRgffLine, self).__init__(line)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 604, in __init__
    self.attributes = self.parse(self.attributes)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 606, in parse
    return dict(kv.split('=') for kv in attributes.split(';'))
ValueError: dictionary update sequence element #0 has length 3; 2 is required

when it's generating the '.rexdb-plant.cls.tsv' file. My command is TEsorter -db rexdb-plant -p 28 Lee.pan.renamed.numericID.fa.mod.EDTA.TElib.fa. I'm getting the same error both with the version checked out from github (commit 2189f63dbf4f289a6dfc466ea6d625711824f94f ) and with the conda version.

I'm thinking there's a '=' too much somewhere in the .rexdb-plant.dom.gff3 file for a unusually named RexDB TE that only sometimes pops up, I'm investigating (or am I looking at the wrong file?).

zhangrengang commented 3 years ago

There should be unexpected "=" in the sequence id. I have fixed the bug and you can try the last version; or you can just modify kv.split('=') to kv.split('=', 1) in your /group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py file and have a try.

philippbayer commented 3 years ago

Thanks! That should fix it.

This is the offending line by the way:

name=DHH_uuu_Gm1-1 TEsorter CDS 1795 2637 354.8 - 0 ID=name=DHH_uuu_Gm1-1|Class_II/Subclass_2/Helitron:Helitron-HEL2;gene=HEL2;clade=Helitron;evalue=7.2e-108;coverage=58.8;probability=0.97

which comes from the SoyTE.fasta from soybase, where the very first fasta record is

>name=DHH_uuu_Gm1-1 Reference=Du et al. 2010 BMC Genomics 2010, 11:113 Class=II Sub_Class=II Order=Helitron Super_Family=Helitron Family=Ukn Description= Chromosome=Gm01:103518..115895

Your fix will fix it.