wrf / genomeGTFtools

convert various features into a GFF-like file for use in genome browsers
69 stars 27 forks source link

UnboundLocalError: local variable 'geneid' referenced before assignment #7

Closed mictadlo closed 4 years ago

mictadlo commented 4 years ago

Hi, I used the latest code and got UnboundLocalError: local variable 'geneid' referenced before assignment. The following parameters were used:

python /apps/genomeGTFtools/blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -S -x > labVSviridiplantae_hybrid_blastp.gff

# Parsing target sequences from uniprot-viridiplantae-reviewed-yes-isoforms.fasta  Fri Feb 14 13:50:22 2020
# Found 42684 sequences  Fri Feb 14 13:50:23 2020
# Parsing gff from NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.gff3  Fri Feb 14 13:50:23 2020
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
Traceback (most recent call last):
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 462, in <module>
    main(sys.argv[1:],sys.stdout)
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 456, in main
    geneintervals, genestrand, genescaffold =  gtf_to_intervals(args.genes, args.cds_exons, args.skip_exons, args.transdecoder, args.no_genes, args.gff_delimiter)
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 121, in gtf_to_intervals
    geneid = geneid.rsplit(genesplit,1)[0]
UnboundLocalError: local variable 'geneid' referenced before assignment

This my new GFF3 file:

##gff-version 3
##sequence-region NbV1Ch01 1 187295484
NbV1Ch01        annotation      remark  1       187295484       .       .       .       gff-version=3;sequence-region=%28%27NbV1Ch01%27%2C 0%2C 187295484%29
NbV1Ch01        transdecoder    gene    98185   99705   .       -       .       ID=NBlab01G00010
NbV1Ch01        transdecoder    mRNA    98185   99705   .       -       .       ID=NBlab01G00010.1;Parent=NBlab01G00010
NbV1Ch01        transdecoder    exon    98185   98571   .       -       .       ID=NBlab01G00010.1.exon4;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    98679   98844   .       -       .       ID=NBlab01G00010.1.exon3;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    99134   99325   .       -       .       ID=NBlab01G00010.1.exon2;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    99417   99705   .       -       .       ID=NBlab01G00010.1.exon1;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     98186   98571   .       -       2       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     98679   98844   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     99134   99325   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     99417   99704   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    five_prime_UTR  99705   99705   .       -       .       ID=NBlab01G00010.1.5UTRg1;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    three_prime_UTR 98185   98185   .       -       .       ID=NBlab01G00010.1.3UTRg1;Parent=NBlab01G00010.1
NbV1Ch01        AUGUSTUS        gene    109665  112554  0.04    -       .       ID=NBlab01G00020
NbV1Ch01        AUGUSTUS        mRNA    109665  112554  0.04    -       .       ID=NBlab01G00020.1;Parent=NBlab01G00020
NbV1Ch01        AUGUSTUS        exon    109665  110489  .       -       .       ID=NBlab01G00020.1.exon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    110608  111042  .       -       .       ID=NBlab01G00020.1.exon2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    111592  111844  .       -       .       ID=NBlab01G00020.1.exon3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    112128  112554  .       -       .       ID=NBlab01G00020.1.exon4;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     109839  110489  0.69    -       0       ID=NBlab01G00020.1.CDS1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     110608  111042  0.21    -       0       ID=NBlab01G00020.1.CDS2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     111592  111844  0.23    -       1       ID=NBlab01G00020.1.CDS3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     112128  112450  0.95    -       0       ID=NBlab01G00020.1.CDS4;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        five_prime_utr  112451  112554  0.26    -       .       ID=NBlab01G00020.1.5UTR1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  110490  110607  0.68    -       .       ID=NBlab01G00020.1.intron1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  111043  111591  0.22    -       .       ID=NBlab01G00020.1.intron2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  111845  112127  0.49    -       .       ID=NBlab01G00020.1.intron3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        start_codon     112448  112450  .       -       0       ID=NBlab01G00020.1.SCodon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        stop_codon      109839  109841  .       -       0       ID=NBlab01G00020.1.ECodon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        three_prime_utr 109665  109838  0.25    -       .       ID=NBlab01G00020.1.3UTR1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        transcription_end_site  109665  109665  .       -       .       ID=NBlab01G00020.1.TES1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        transcription_start_site        112554  112554  .       -       .       ID=NBlab01G00020.1.TSS1;Parent=NBlab01G00020.1

What did I miss?

Thank you in advance,

Michal

wrf commented 4 years ago

I could reproduce the error exactly as you describe. It appears to come from the annotation features, which do not have an ID tag or anything similar, so the code continues until it breaks a few lines later.

I'll fix this in the next update.

Meanwhile, you could just remove those lines ( maybe grep -v annotation), and I think it should work. You might also want to add the -K flag.

mictadlo commented 4 years ago

Hi, Thank you for grep -v annotation and -K option. However, I got a lot of wanings:

> python /work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.no_remark.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -K -S -x > labVSviridiplantae_hybrid_blastp-K.gff
# Parsing target sequences from uniprot-viridiplantae-reviewed-yes-isoforms.fasta  Mon Feb 17 14:15:42 2020
# Found 42684 sequences  Mon Feb 17 14:15:42 2020
# Parsing gff from NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.no_remark.gff3  Mon Feb 17 14:15:42 2020
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
# Counted 3458537 lines and 38 comments  Mon Feb 17 14:15:52 2020
# Counted 770162 exons for 211206 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on labVSviridiplantae_hybrid_blastp.out  Mon Feb 17 14:15:52 2020
WARNING: cannot finish protein at 1288 for 978 in [(7570562, 7571542), (7585902, 7586066)]
WARNING: no intervals for RDO2_ARATH in NBlab01G03730.1
WARNING: cannot finish protein at 10207325 for 1086 in [(10206930, 10207358)]
WARNING: cannot finish protein at 1567 for 177 in [(10206930, 10207358)]
WARNING: no intervals for OSK4_ORYSJ in NBlab01G04850.1
WARNING: cannot finish protein at 1567 for 177 in [(10206930, 10207358)]
WARNING: no intervals for OSK4_ORYSI in NBlab01G04850.1
WARNING: cannot finish protein at 20599745 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599745 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599721 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599724 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 20599754 for 630 in [(20599104, 20599802)]
WARNING: cannot finish protein at 31245071 for 150 in [(31240669, 31240839), (31242704, 31242772), (31242868, 31242933), (31243164, 31243436), (31244202, 31244296), (31245071, 31245125)]
WARNING: cannot finish protein at 4981 for 3924 in [(34020995, 34021771)]
WARNING: no intervals for POLX_TOBAC in NBlab01G14800.1
WARNING: cannot finish protein at 1093 for 774 in [(34020995, 34021771)]
WARNING: no intervals for PPR51_ARATH in NBlab01G14800.1
WARNING: cannot finish protein at 844 for 3822 in [(34011445, 34011765)]
WARNING: no intervals for POLX_TOBAC in NBlab01G14800.2
WARNING: cannot finish protein at 34011765 for 465 in [(34011445, 34011765)]
WARNING: cannot finish protein at 721 for 882 in [(43963612, 43963894), (43964077, 43964096)]
WARNING: no intervals for CID7_ARATH in NBlab01G19500.1
WARNING: cannot finish protein at 1147 for 117 in [(43963612, 43963894), (43964077, 43964096)]
WARNING: no intervals for CID5_ARATH in NBlab01G19500.1
WARNING: cannot finish protein at 45602279 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602288 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602195 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602282 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 45602282 for 891 in [(45602195, 45602518)]
WARNING: cannot finish protein at 48276155 for 1929 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276101 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48276155 for 1923 in [(48275544, 48276158)]
WARNING: cannot finish protein at 48285736 for 2238 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285721 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48285736 for 2232 in [(48285473, 48285736), (48285823, 48285864)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48277912 for 1347 in [(48277905, 48277912), (48278744, 48278812), (48281362, 48281662), (48281850, 48282335), (48283005, 48283127), (48283701, 48283775)]
WARNING: cannot finish protein at 48385027 for 159 in [(48375652, 48376383), (48377104, 48377197), (48377363, 48377445), (48377551, 48377645), (48377731, 48377800), (48379036, 48379129), (48379937, 48380021), (48380126, 48380161), (48380844, 48380997), (48381082, 48381216), (48383200, 48383270), (48383361, 48383439), (48383540, 48383655), (48383774, 48383921), (48384370, 48384439), (48385027, 48385181)]
...
WARNING: no intervals for CBSX2_ARATH in NBlab19G50830.1
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 603 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 480 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 98975592 for 645 in [(98971905, 98971943), (98972413, 98972574), (98973551, 98973655), (98973763, 98973843), (98974001, 98975207), (98975592, 98975758)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 101194139 for 21 in [(101193946, 101194139), (101194220, 101194337), (101194959, 101195117), (101195223, 101195417), (101196310, 101196390)]
WARNING: cannot finish protein at 1258 for 297 in [(104698785, 104699084)]
WARNING: no intervals for STY46_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 303 in [(104698785, 104699084)]
WARNING: no intervals for STY17_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 297 in [(104698785, 104699084)]
WARNING: no intervals for STY8_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 306 in [(104698785, 104699084)]
WARNING: no intervals for P2C31_ARATH in NBlab19G57510.1
WARNING: cannot finish protein at 1258 for 291 in [(104698785, 104699084)]
WARNING: no intervals for P2C04_ORYSJ in NBlab19G57510.1
WARNING: cannot finish protein at 108401104 for 690 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 852 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 741 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401104 for 798 in [(108400711, 108401319)]
WARNING: cannot finish protein at 108401107 for 741 in [(108400711, 108401319)]
WARNING: cannot finish protein at 114550175 for 480 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 480 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 489 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 450 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 114550175 for 468 in [(114549041, 114549505), (114549616, 114549719), (114549824, 114550069), (114550175, 114550256)]
WARNING: cannot finish protein at 1150 for 408 in [(118646907, 118647245)]
WARNING: no intervals for P2C38_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 411 in [(118646907, 118647245)]
WARNING: no intervals for P2C48_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C79_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C28_ORYSJ in NBlab19G63290.1
WARNING: cannot finish protein at 1150 for 405 in [(118646907, 118647245)]
WARNING: no intervals for P2C46_ARATH in NBlab19G63290.1
WARNING: cannot finish protein at 1345 for 612 in [(122888006, 122888470)]
WARNING: no intervals for POLX_TOBAC in NBlab19G65350.1
WARNING: cannot finish protein at 502 for 633 in [(122888006, 122888470)]
WARNING: no intervals for POLR1_ARATH in NBlab19G65350.1
WARNING: cannot finish protein at 502 for 840 in [(122888006, 122888470)]
WARNING: no intervals for POLR2_ARATH in NBlab19G65350.1
WARNING: cannot finish protein at 124080444 for 261 in [(124080169, 124080444), (124082245, 124082370), (124082448, 124082693), (124086447, 124086566), (124087146, 124087459), (124087609, 124087734), (124087837, 124087936), (124091710, 124091844), (124092699, 124093097)]
WARNING: cannot finish protein at 124080444 for 276 in [(124080169, 124080444), (124082245, 124082370), (124082448, 124082693), (124086447, 124086566), (124087146, 124087459), (124087609, 124087734), (124087837, 124087936), (124091710, 124091844), (124092699, 124093097)]
# Removed 11310 hits by shortness
# Removed 44 hits by bitscore
# Removed 37 hits by evalue
# Removed 0 hits that exceeded query max
# Found 303185 hits for 83708 queries  Mon Feb 17 14:15:58 2020
# Wrote 1318958 domain intervals  Mon Feb 17 14:15:58 2020
# WARNING: 1348 matches have hits extending beyond gene bounds  Mon Feb 17 14:15:58 2020
  1. How is it possible to fix those warnings?
  2. How those warning impacting the protein structure and will the whole protein be written to the GFF3 file?

Thank you in advance,

Michal

wrf commented 4 years ago

The warnings come from problems in the GFF annotation. WARNING: no intervals for means that no intervals are found in the GFF.

WARNING: cannot finish protein means that the blast hits would extend beyond the transcripts or proteins that are predicted. This often affects every protein if something were systematically wrong like using blastx instead of blastp, since its just 1348, I'm not sure what is wrong.

It looks like 300000 of them still worked. Maybe you can check in a browser if they make sense compared to the gene models.

mictadlo commented 4 years ago

Hi, Thank you, I fixed my GFF3 file and it worked now.

Michal

wrf commented 4 years ago

Great!