wrf / genomeGTFtools

convert various features into a GFF-like file for use in genome browsers
69 stars 27 forks source link

Errors in blast2genomegff.py #15

Closed TakashiKoyama closed 3 years ago

TakashiKoyama commented 3 years ago

Hi,

I made a gene model with BRAKER and would like to add gene names using blast2genomegff.py. I, however, have had two types of error, which depends on file format (gtf or gff3). If I used gtf, I got error as follows.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gtf > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Fri Mar 19 21:36:27 2021
# Found 564277 sequences  Fri Mar 19 21:36:37 2021
# Parsing gff from augustus.hints_utr.gtf  Fri Mar 19 21:36:37 2021
Traceback (most recent call last):
  File "/usr/local/genome/bin/blast2genomegff.py", line 489, in <module>
    main(sys.argv[1:],sys.stdout)
  File "/usr/local/genome/bin/blast2genomegff.py", line 483, in main
    geneintervals, genestrand, genescaffold =  gtf_to_intervals(args.genes, args.cds_exons, args.skip_exons, args.transdecoder, args.no_genes, args.gff_delimiter)
  File "/usr/local/genome/bin/blast2genomegff.py", line 134, in gtf_to_intervals
    genestrand[geneid] = strand
UnboundLocalError: local variable 'geneid' referenced before assignment

If I used gff3, another error was obtained.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Fri Mar 19 21:38:07 2021
# Found 564277 sequences  Fri Mar 19 21:38:15 2021
# Parsing gff from augustus.hints_utr.gff3  Fri Mar 19 21:38:15 2021
# Counted 2407291 lines and 0 comments  Fri Mar 19 21:38:23 2021
# Ignored 1150218 other features in the GFF
# Counted 567177 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab  Fri Mar 19 21:38:23 2021
# Removed 761276 hits by shortness
# Removed 0 hits by bitscore
# Removed 0 hits by evalue
# Removed 0 hits that exceeded query max
# Found 0 hits for 0 queries  Fri Mar 19 21:38:26 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table

My files are as follows.

Any suggestions would be appreciated.

wrf commented 3 years ago

Hello, There is an expectation of a specific format in the GFF, and as very few programs give the same output, there is potential for many problems.

Between the two, it is probably easier to use the augustus GFF3.

It looks like the blast results are keeping the accession number, but it may not process the fasta file in the same way (as you entered -d uniprot_sprot.fasta). My guess is that you are getting 0 intervals because it cannot find the proteins in the database. Normally, blast would keep the entire header sp|P31946|1433B_HUMAN, and the program expects this to be the same. All of the subsequent processing also assumes this format, because that does not require any changes to the SwissProt file directly from the database.

Can you give me the output of head uniprot_sprot.fasta? Thanks

TakashiKoyama commented 3 years ago

Thank you for your quick reply. Here is the output.

$ head uniprot_sprot.fasta 
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC

I blasted as follow.

$ blastp -query augustus.hints_utr.aa -db uniprot_sprot -num_threads 40 -outfmt 6 -max_target_seqs 5 > braker04_augustus.hints_utr.uniprot_sprot.blastp.tab
wrf commented 3 years ago

I suspect this is due to a problem of being unable to find the accessions of the subject -b in the database fasta file -d. Normally, when I run blast, the accession of the subject would be reported as sp|Q6GZX4|001R_FRG3G, rather than just Q6GZX4. The program is expecting that your fasta file will have an protein with the exact ID Q6GZX4. I'm not sure what happened here, perhaps something to shorten the name during the makeblastdb step.

Which version of blast are you using?

I just put an update of the code that should clarify the error. Can you rerun your original command with the updated version?

TakashiKoyama commented 3 years ago

I'm using blast 2.10.1.

I made the db by

$ makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids

-perse_seqids may be the cause.

Alternatively, perhaps I can produce expected accessions by

$ blastp -query augustus.hints_utr.aa -db uniprot_sprot -num_threads 80 -outfmt "6 qseqid sallseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 5

I will let you know how it works later.

TakashiKoyama commented 3 years ago

Hi,

I re-blasted with option -outfmt "6 qseqid sallseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" and got tab-delimited file as follows.

$ head braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2
g21100.t1       sp|Q90663|SEM3D_CHICK   53.097  113     52      1       92      203     41      153     7.49e-34        130
g21100.t1       sp|O95025|SEM3D_HUMAN   54.128  109     49      1       92      199     53      161     2.96e-33        129
g21100.t1       sp|Q8BH34|SEM3D_MOUSE   51.327  113     54      1       92      203     53      165     5.50e-33        128
g21100.t1       sp|Q9W6G6|SEM3D_DANRE   47.934  121     61      1       87      205     51      171     1.34e-29        118
g21100.t1       sp|Q9W7J1|SE3AA_DANRE   42.478  113     64      1       92      203     40      152     1.41e-26        109
g21100.t2       sp|Q9W6G6|SEM3D_DANRE   44.156  77      41      1       87      161     51      127     4.15e-11        63.9
g21100.t2       sp|Q90663|SEM3D_CHICK   44.286  70      38      1       92      160     41      110     4.93e-11        63.5
g21100.t2       sp|O95025|SEM3D_HUMAN   45.714  70      37      1       92      160     53      122     1.03e-10        62.8
g21100.t2       sp|Q8BH34|SEM3D_MOUSE   42.857  70      39      1       92      160     53      122     1.17e-10        62.4
g21100.t2       sp|Q99985|SEM3C_HUMAN   45.283  53      29      0       108     160     54      106     1.66e-07        53.1

Subject IDs in the blast output is exactly same as in db fasta.

$ grep 'sp|Q90663|SEM3D_CHICK' uniprot_sprot.fasta 
>sp|Q90663|SEM3D_CHICK Semaphorin-3D OS=Gallus gallus OX=9031 GN=SEMA3D PE=2 SV=1

But I got another error in blast2genomegff.py even using this.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Sat Mar 20 03:25:35 2021
# Found 564277 sequences  Sat Mar 20 03:25:43 2021
# Parsing gff from augustus.hints_utr.gff3  Sat Mar 20 03:25:43 2021
# Counted 2407291 lines and 0 comments  Sat Mar 20 03:25:51 2021
# Ignored 1150218 other features in the GFF
# Counted 567177 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2  Sat Mar 20 03:25:51 2021
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21100.t1
WARNING: cannot finish protein at 274 for 324 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21100.t1
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21100.t1
WARNING: cannot finish protein at 259 for 357 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21100.t1
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ9W7J1SE3AA_DANRE in g21100.t1
WARNING: cannot finish protein at 259 for 225 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21100.t2
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ62177SEM3B_MOUSE in g21101.t1
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21101.t2
WARNING: cannot finish protein at 4 for 240 in []
WARNING: no intervals for spO08665SEM3A_MOUSE in g21101.t2
.
.
.
.
.
WARNING: cannot finish protein at 700 for 666 in []
WARNING: no intervals for spQ23023UNC51_CAEEL in g57895.t5
WARNING: cannot finish protein at 160 for 243 in []
WARNING: no intervals for spQ96LY2CC74B_HUMAN in g57896.t1
WARNING: cannot finish protein at 556 for 369 in []
WARNING: no intervals for spQ96LY2CC74B_HUMAN in g57896.t1
WARNING: cannot finish protein at 160 for 243 in []
WARNING: no intervals for spQ96AQ1CC74A_HUMAN in g57896.t1
WARNING: cannot finish protein at 556 for 369 in []
WARNING: no intervals for spQ96AQ1CC74A_HUMAN in g57896.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spP46109CRKL_HUMAN in g57897.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spQ5U2U2CRKL_RAT in g57897.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spP47941CRKL_MOUSE in g57897.t1
WARNING: cannot finish protein at 1 for 909 in []
WARNING: no intervals for spP87378CRK_XENLA in g57897.t1
WARNING: cannot finish protein at 1 for 909 in []
WARNING: no intervals for spP46108CRK_HUMAN in g57897.t1
WARNING: cannot finish protein at 19 for 324 in []
WARNING: no intervals for spP14380YTX1_XENLA in g57899.t1
WARNING: cannot finish protein at 607 for 300 in []
WARNING: no intervals for spQ8R0T2ZN768_MOUSE in g57899.t1
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries  Sat Mar 20 03:26:05 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table
# WARNING: 357759 matches have hits extending beyond gene bounds  Sat Mar 20 03:26:05 2021

Should I start from makeblastdb?

wrf commented 3 years ago

I think there are some hardcoded things to remove symbols. Try it with option -S

TakashiKoyama commented 3 years ago

Still have same error even with -S.

wrf commented 3 years ago

Since you are using SwissProt, I find for some other downstream analyses, that it is useful to include --add-description and --add-accession as well

wrf commented 3 years ago

I probably should program defaults for AUGUSTUS... can you try with -x -K as well?

TakashiKoyama commented 3 years ago

It didn't work. Mmm...

TakashiKoyama commented 3 years ago

I meant, still have same error.

wrf commented 3 years ago

ah ok, I can see that I did not program this in a smart way, in the meantime, can you try with options -S -x -K -F "." ?

TakashiKoyama commented 3 years ago

It doesn't change much.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 --add-description --add-accession -S -x -K -F "." > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Sat Mar 20 21:01:48 2021
# Taking length and descriptions from sequences
# Found 564277 sequences  Sat Mar 20 21:02:01 2021
# Parsing gff from augustus.hints_utr.gff3  Sat Mar 20 21:02:01 2021
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# Counted 2407291 lines and 0 comments  Sat Mar 20 21:02:10 2021
# Ignored 1150218 other features in the GFF
# Counted 510343 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2  Sat Mar 20 21:02:10 2021
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t2
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1, will not print further warnings
# Counted 761276 lines and kept 397759 hits
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries  Sat Mar 20 21:02:13 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table
# WARNING: could not find scaffold for 357759 hits  Sat Mar 20 21:02:13 2021

Sorry for making trouble...

wrf commented 3 years ago

Can you try with -S -x -K -F "." -G ?

TakashiKoyama commented 3 years ago

It works!

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 --add-description --add-accession -S -x -K -F "." -G > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Sat Mar 20 22:32:07 2021
# Taking length and descriptions from sequences
# Found 564277 sequences  Sat Mar 20 22:32:20 2021
# Parsing gff from augustus.hints_utr.gff3  Sat Mar 20 22:32:20 2021
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
# Counted 2407291 lines and 0 comments  Sat Mar 20 22:32:29 2021
# Ignored 1150218 other features in the GFF
# Counted 510343 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2  Sat Mar 20 22:32:29 2021
# Counted 761276 lines and kept 397759 hits
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries  Sat Mar 20 22:32:50 2021
# Wrote 1878164 match intervals  Sat Mar 20 22:32:50 2021

However, resultant gff is quite different from the one I expected.

$ head -30 augustus.hints_utr.blastp.uniprot_sprot.gff 
chr01   BLASTP  protein_match   13909   15161   130.0   +       .       ID=SEM3D_CHICK.g21100.t1.1;Target=SEM3D_CHICK 41 153 +;same_sense=1;Gaps=1;Mismatch=52;Evalue=7.49e-34;Description=Semaphorin-3D;Accession=Q90663
chr01   BLASTP  match_part      13909   14061   130.0   +       .       Parent=SEM3D_CHICK.g21100.t1.1
chr01   BLASTP  match_part      14196   14258   130.0   +       .       Parent=SEM3D_CHICK.g21100.t1.1
chr01   BLASTP  match_part      15042   15161   130.0   +       .       Parent=SEM3D_CHICK.g21100.t1.1
chr01   BLASTP  protein_match   13909   15149   129.0   +       .       ID=SEM3D_HUMAN.g21100.t1.1;Target=SEM3D_HUMAN 53 161 +;same_sense=1;Gaps=1;Mismatch=49;Evalue=2.96e-33;Description=Semaphorin-3D;Accession=O95025
chr01   BLASTP  match_part      13909   14061   129.0   +       .       Parent=SEM3D_HUMAN.g21100.t1.1
chr01   BLASTP  match_part      14196   14258   129.0   +       .       Parent=SEM3D_HUMAN.g21100.t1.1
chr01   BLASTP  match_part      15042   15149   129.0   +       .       Parent=SEM3D_HUMAN.g21100.t1.1
chr01   BLASTP  protein_match   13909   15161   128.0   +       .       ID=SEM3D_MOUSE.g21100.t1.1;Target=SEM3D_MOUSE 53 165 +;same_sense=1;Gaps=1;Mismatch=54;Evalue=5.5e-33;Description=Semaphorin-3D;Accession=Q8BH34
chr01   BLASTP  match_part      13909   14061   128.0   +       .       Parent=SEM3D_MOUSE.g21100.t1.1
chr01   BLASTP  match_part      14196   14258   128.0   +       .       Parent=SEM3D_MOUSE.g21100.t1.1
chr01   BLASTP  match_part      15042   15161   128.0   +       .       Parent=SEM3D_MOUSE.g21100.t1.1
chr01   BLASTP  protein_match   12466   15167   118.0   +       .       ID=SEM3D_DANRE.g21100.t1.1;Target=SEM3D_DANRE 51 171 +;same_sense=1;Gaps=1;Mismatch=61;Evalue=1.34e-29;Description=Semaphorin-3D;Accession=Q9W6G6
chr01   BLASTP  match_part      12466   12478   118.0   +       .       Parent=SEM3D_DANRE.g21100.t1.1
chr01   BLASTP  match_part      13907   14061   118.0   +       .       Parent=SEM3D_DANRE.g21100.t1.1
chr01   BLASTP  match_part      14196   14258   118.0   +       .       Parent=SEM3D_DANRE.g21100.t1.1
chr01   BLASTP  match_part      15042   15167   118.0   +       .       Parent=SEM3D_DANRE.g21100.t1.1
chr01   BLASTP  protein_match   13909   15161   109.0   +       .       ID=SE3AA_DANRE.g21100.t1.1;Target=SE3AA_DANRE 40 152 +;same_sense=1;Gaps=1;Mismatch=64;Evalue=1.41e-26;Description=Semaphorin-3aa;Accession=Q9W7J1
chr01   BLASTP  match_part      13909   14061   109.0   +       .       Parent=SE3AA_DANRE.g21100.t1.1
chr01   BLASTP  match_part      14196   14258   109.0   +       .       Parent=SE3AA_DANRE.g21100.t1.1
chr01   BLASTP  match_part      15042   15161   109.0   +       .       Parent=SE3AA_DANRE.g21100.t1.1
chr01   BLASTP  protein_match   12466   14252   63.9    +       .       ID=SEM3D_DANRE.g21100.t2.2;Target=SEM3D_DANRE 51 127 +;same_sense=1;Gaps=1;Mismatch=41;Evalue=4.15e-11;Description=Semaphorin-3D;Accession=Q9W6G6
chr01   BLASTP  match_part      12466   12478   63.9    +       .       Parent=SEM3D_DANRE.g21100.t2.2
chr01   BLASTP  match_part      13907   14061   63.9    +       .       Parent=SEM3D_DANRE.g21100.t2.2
chr01   BLASTP  match_part      14196   14252   63.9    +       .       Parent=SEM3D_DANRE.g21100.t2.2
chr01   BLASTP  protein_match   19773   21728   168.0   +       .       ID=SEM3D_DANRE.g21101.t1.3;Target=SEM3D_DANRE 366 475 +;same_sense=1;Gaps=0;Mismatch=34;Evalue=2.09e-48;Description=Semaphorin-3D;Accession=Q9W6G6
chr01   BLASTP  match_part      19773   19880   168.0   +       .       Parent=SEM3D_DANRE.g21101.t1.3
chr01   BLASTP  match_part      21507   21728   168.0   +       .       Parent=SEM3D_DANRE.g21101.t1.3
chr01   BLASTP  protein_match   19773   21728   166.0   +       .       ID=SEM3D_CHICK.g21101.t1.2;Target=SEM3D_CHICK 350 459 +;same_sense=1;Gaps=0;Mismatch=35;Evalue=1.01e-47;Description=Semaphorin-3D;Accession=Q90663
chr01   BLASTP  match_part      19773   19880   166.0   +       .       Parent=SEM3D_CHICK.g21101.t1.2

I expected that the resultant gff was similar to the original gff but gene names are included. But in this output, I do not see gene structure information (such as exon, utr, etc) Is this output usual?

wrf commented 3 years ago

Great that it works! Sorry for all the steps. Yes that is usual, since these are not gene predictions, they are matches. The point is to show how much of a target protein is matched by the gene model.

TakashiKoyama commented 3 years ago

So the usual usage of this output gff in GenomeViewer, such as IGV, is importing both of the gff (here, BRAKER-produce gff and output gff)?