Closed mictadlo closed 4 years ago
I could reproduce the error exactly as you describe.
It appears to come from the annotation
features, which do not have an ID tag or anything similar, so the code continues until it breaks a few lines later.
I'll fix this in the next update.
Meanwhile, you could just remove those lines ( maybe grep -v annotation
), and I think it should work. You might also want to add the -K
flag.
Hi,
Thank you for grep -v annotation
and -K
option. However, I got a lot of wanings:
> python /work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.no_remark.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -K -S -x > labVSviridiplantae_hybrid_blastp-K.gff
# Parsing target sequences from uniprot-viridiplantae-reviewed-yes-isoforms.fasta Mon Feb 17 14:15:42 2020
# Found 42684 sequences Mon Feb 17 14:15:42 2020
# Parsing gff from NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.no_remark.gff3 Mon Feb 17 14:15:42 2020
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
# Counted 3458537 lines and 38 comments Mon Feb 17 14:15:52 2020
# Counted 770162 exons for 211206 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on labVSviridiplantae_hybrid_blastp.out Mon Feb 17 14:15:52 2020
WARNING: cannot finish protein at 1288 for 978 in [(7570562, 7571542), (7585902, 7586066)]
WARNING: no intervals for RDO2_ARATH in NBlab01G03730.1
WARNING: cannot finish protein at 10207325 for 1086 in [(10206930, 10207358)]
WARNING: cannot finish protein at 1567 for 177 in [(10206930, 10207358)]
WARNING: no intervals for OSK4_ORYSJ in NBlab01G04850.1
WARNING: cannot finish protein at 1567 for 177 in [(10206930, 10207358)]
WARNING: no intervals for OSK4_ORYSI in NBlab01G04850.1
WARNING: cannot finish protein at 20599745 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599745 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599721 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599724 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599754 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 31245071 for 150 in [(31240669, 31240839), (31242704, 31242772), (31242868, 31242933), (31243164, 31243436), (31244202, 31244296), (31245071, 31245125)]
WARNING: cannot finish protein at 4981 for 3924 in [(34020995, 34021771)]
WARNING: no intervals for POLX_TOBAC in NBlab01G14800.1
WARNING: cannot finish protein at 1093 for 774 in [(34020995, 34021771)]
WARNING: no intervals for PPR51_ARATH in NBlab01G14800.1
WARNING: cannot finish protein at 844 for 3822 in [(34011445, 34011765)]
WARNING: no intervals for POLX_TOBAC in NBlab01G14800.2
WARNING: cannot finish protein at 34011765 for 465 in [(34011445, 34011765)]
WARNING: cannot finish protein at 721 for 882 in [(43963612, 43963894), (43964077, 43964096)]
WARNING: no intervals for CID7_ARATH in NBlab01G19500.1
WARNING: cannot finish protein at 1147 for 117 in [(43963612, 43963894), (43964077, 43964096)]
WARNING: no intervals for CID5_ARATH in NBlab01G19500.1
WARNING: cannot finish protein at 45602279 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602288 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602195 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602282 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602282 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 48276155 for 1929 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276101 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48285736 for 2238 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285721 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48385027 for 159 in [(48375652, 48376383), (48377104, 48377197), (48377363, 48377445), (48377551, 48377645), (48377731, 48377800), (48379036, 48379129), (48379937, 48380021), (48380126, 48380161), (48380844, 48380997), (48381082, 48381216), (48383200, 48383270), (48383361, 48383439), (48383540, 48383655), (48383774, 48383921), (48384370, 48384439), (48385027, 48385181)]
...
WARNING: no intervals for CBSX2_ARATH in NBlab19G50830.1
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 603 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 480 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 1258 for 297 in [(104698785, 104699084)]
WARNING: no intervals for STY46_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 303 in [(104698785, 104699084)]
WARNING: no intervals for STY17_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 297 in [(104698785, 104699084)]
WARNING: no intervals for STY8_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 306 in [(104698785, 104699084)]
WARNING: no intervals for P2C31_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 291 in [(104698785, 104699084)]
WARNING: no intervals for P2C04_ORYSJ in NBlab19G57510.1
WARNING: cannot finish protein at 108401104 for 690 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 852 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 741 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 798 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401107 for 741 in [(108400711, 108401319)]
WARNING: cannot finish protein at 114550175 for 480 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 480 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 489 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 450 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 468 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 1150 for 408 in [(118646907, 118647245)]
WARNING: no intervals for P2C38_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 411 in [(118646907, 118647245)]
WARNING: no intervals for P2C48_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C79_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C28_ORYSJ in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C46_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1345 for 612 in [(122888006, 122888470)]
WARNING: no intervals for POLX_TOBAC in NBlab19G65350.1
WARNING: cannot finish protein at 502 for 633 in [(122888006, 122888470)]
WARNING: no intervals for POLR1_ARATH in NBlab19G65350.1
WARNING: cannot finish protein at 502 for 840 in [(122888006, 122888470)]
WARNING: no intervals for POLR2_ARATH in NBlab19G65350.1
WARNING: cannot finish protein at 124080444 for 261 in [(124080169, 124080444), (124082245, 124082370), (124082448, 124082693), (124086447, 124086566), (124087146, 124087459), (124087609, 124087734), (124087837, 124087936), (124091710, 124091844), (124092699, 124093097)]
WARNING: cannot finish protein at 124080444 for 276 in [(124080169, 124080444), (124082245, 124082370), (124082448, 124082693), (124086447, 124086566), (124087146, 124087459), (124087609, 124087734), (124087837, 124087936), (124091710, 124091844), (124092699, 124093097)]
# Removed 11310 hits by shortness
# Removed 44 hits by bitscore
# Removed 37 hits by evalue
# Removed 0 hits that exceeded query max
# Found 303185 hits for 83708 queries Mon Feb 17 14:15:58 2020
# Wrote 1318958 domain intervals Mon Feb 17 14:15:58 2020
# WARNING: 1348 matches have hits extending beyond gene bounds Mon Feb 17 14:15:58 2020
Thank you in advance,
Michal
The warnings come from problems in the GFF annotation.
WARNING: no intervals for
means that no intervals are found in the GFF.
WARNING: cannot finish protein
means that the blast hits would extend beyond the transcripts or proteins that are predicted. This often affects every protein if something were systematically wrong like using blastx instead of blastp, since its just 1348, I'm not sure what is wrong.
It looks like 300000 of them still worked. Maybe you can check in a browser if they make sense compared to the gene models.
Hi, Thank you, I fixed my GFF3 file and it worked now.
Michal
Great!
Hi, I used the latest code and got
UnboundLocalError: local variable 'geneid' referenced before assignment
. The following parameters were used:This my new GFF3 file:
What did I miss?
Thank you in advance,
Michal