Closed TakashiKoyama closed 3 years ago
Hello, There is an expectation of a specific format in the GFF, and as very few programs give the same output, there is potential for many problems.
Between the two, it is probably easier to use the augustus GFF3.
It looks like the blast results are keeping the accession number, but it may not process the fasta file in the same way (as you entered -d uniprot_sprot.fasta
). My guess is that you are getting 0 intervals because it cannot find the proteins in the database.
Normally, blast would keep the entire header sp|P31946|1433B_HUMAN
, and the program expects this to be the same. All of the subsequent processing also assumes this format, because that does not require any changes to the SwissProt file directly from the database.
Can you give me the output of head uniprot_sprot.fasta
?
Thanks
Thank you for your quick reply. Here is the output.
$ head uniprot_sprot.fasta
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
I blasted as follow.
$ blastp -query augustus.hints_utr.aa -db uniprot_sprot -num_threads 40 -outfmt 6 -max_target_seqs 5 > braker04_augustus.hints_utr.uniprot_sprot.blastp.tab
I suspect this is due to a problem of being unable to find the accessions of the subject -b
in the database fasta file -d
.
Normally, when I run blast, the accession of the subject would be reported as sp|Q6GZX4|001R_FRG3G
, rather than just Q6GZX4
. The program is expecting that your fasta file will have an protein with the exact ID Q6GZX4
.
I'm not sure what happened here, perhaps something to shorten the name during the makeblastdb
step.
Which version of blast are you using?
I just put an update of the code that should clarify the error. Can you rerun your original command with the updated version?
I'm using blast 2.10.1.
I made the db by
$ makeblastdb -in uniprot_sprot.fasta -out uniprot_sprot -dbtype prot -parse_seqids
-perse_seqids
may be the cause.
Alternatively, perhaps I can produce expected accessions by
$ blastp -query augustus.hints_utr.aa -db uniprot_sprot -num_threads 80 -outfmt "6 qseqid sallseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 5
I will let you know how it works later.
Hi,
I re-blasted with option -outfmt "6 qseqid sallseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
and got tab-delimited file as follows.
$ head braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2
g21100.t1 sp|Q90663|SEM3D_CHICK 53.097 113 52 1 92 203 41 153 7.49e-34 130
g21100.t1 sp|O95025|SEM3D_HUMAN 54.128 109 49 1 92 199 53 161 2.96e-33 129
g21100.t1 sp|Q8BH34|SEM3D_MOUSE 51.327 113 54 1 92 203 53 165 5.50e-33 128
g21100.t1 sp|Q9W6G6|SEM3D_DANRE 47.934 121 61 1 87 205 51 171 1.34e-29 118
g21100.t1 sp|Q9W7J1|SE3AA_DANRE 42.478 113 64 1 92 203 40 152 1.41e-26 109
g21100.t2 sp|Q9W6G6|SEM3D_DANRE 44.156 77 41 1 87 161 51 127 4.15e-11 63.9
g21100.t2 sp|Q90663|SEM3D_CHICK 44.286 70 38 1 92 160 41 110 4.93e-11 63.5
g21100.t2 sp|O95025|SEM3D_HUMAN 45.714 70 37 1 92 160 53 122 1.03e-10 62.8
g21100.t2 sp|Q8BH34|SEM3D_MOUSE 42.857 70 39 1 92 160 53 122 1.17e-10 62.4
g21100.t2 sp|Q99985|SEM3C_HUMAN 45.283 53 29 0 108 160 54 106 1.66e-07 53.1
Subject IDs in the blast output is exactly same as in db fasta.
$ grep 'sp|Q90663|SEM3D_CHICK' uniprot_sprot.fasta
>sp|Q90663|SEM3D_CHICK Semaphorin-3D OS=Gallus gallus OX=9031 GN=SEMA3D PE=2 SV=1
But I got another error in blast2genomegff.py
even using this.
$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta Sat Mar 20 03:25:35 2021
# Found 564277 sequences Sat Mar 20 03:25:43 2021
# Parsing gff from augustus.hints_utr.gff3 Sat Mar 20 03:25:43 2021
# Counted 2407291 lines and 0 comments Sat Mar 20 03:25:51 2021
# Ignored 1150218 other features in the GFF
# Counted 567177 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 Sat Mar 20 03:25:51 2021
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21100.t1
WARNING: cannot finish protein at 274 for 324 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21100.t1
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21100.t1
WARNING: cannot finish protein at 259 for 357 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21100.t1
WARNING: cannot finish protein at 274 for 336 in []
WARNING: no intervals for spQ9W7J1SE3AA_DANRE in g21100.t1
WARNING: cannot finish protein at 259 for 225 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21100.t2
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21101.t1
WARNING: cannot finish protein at 1 for 330 in []
WARNING: no intervals for spQ62177SEM3B_MOUSE in g21101.t1
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ9W6G6SEM3D_DANRE in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ90663SEM3D_CHICK in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spQ8BH34SEM3D_MOUSE in g21101.t2
WARNING: cannot finish protein at 1 for 243 in []
WARNING: no intervals for spO95025SEM3D_HUMAN in g21101.t2
WARNING: cannot finish protein at 4 for 240 in []
WARNING: no intervals for spO08665SEM3A_MOUSE in g21101.t2
.
.
.
.
.
WARNING: cannot finish protein at 700 for 666 in []
WARNING: no intervals for spQ23023UNC51_CAEEL in g57895.t5
WARNING: cannot finish protein at 160 for 243 in []
WARNING: no intervals for spQ96LY2CC74B_HUMAN in g57896.t1
WARNING: cannot finish protein at 556 for 369 in []
WARNING: no intervals for spQ96LY2CC74B_HUMAN in g57896.t1
WARNING: cannot finish protein at 160 for 243 in []
WARNING: no intervals for spQ96AQ1CC74A_HUMAN in g57896.t1
WARNING: cannot finish protein at 556 for 369 in []
WARNING: no intervals for spQ96AQ1CC74A_HUMAN in g57896.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spP46109CRKL_HUMAN in g57897.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spQ5U2U2CRKL_RAT in g57897.t1
WARNING: cannot finish protein at 4 for 912 in []
WARNING: no intervals for spP47941CRKL_MOUSE in g57897.t1
WARNING: cannot finish protein at 1 for 909 in []
WARNING: no intervals for spP87378CRK_XENLA in g57897.t1
WARNING: cannot finish protein at 1 for 909 in []
WARNING: no intervals for spP46108CRK_HUMAN in g57897.t1
WARNING: cannot finish protein at 19 for 324 in []
WARNING: no intervals for spP14380YTX1_XENLA in g57899.t1
WARNING: cannot finish protein at 607 for 300 in []
WARNING: no intervals for spQ8R0T2ZN768_MOUSE in g57899.t1
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries Sat Mar 20 03:26:05 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table
# WARNING: 357759 matches have hits extending beyond gene bounds Sat Mar 20 03:26:05 2021
Should I start from makeblastdb
?
I think there are some hardcoded things to remove symbols.
Try it with option -S
Still have same error even with -S
.
Since you are using SwissProt, I find for some other downstream analyses, that it is useful to include --add-description
and --add-accession
as well
I probably should program defaults for AUGUSTUS...
can you try with -x -K
as well?
It didn't work. Mmm...
I meant, still have same error.
ah ok, I can see that I did not program this in a smart way, in the meantime, can you try with options -S -x -K -F "."
?
It doesn't change much.
$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 --add-description --add-accession -S -x -K -F "." > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta Sat Mar 20 21:01:48 2021
# Taking length and descriptions from sequences
# Found 564277 sequences Sat Mar 20 21:02:01 2021
# Parsing gff from augustus.hints_utr.gff3 Sat Mar 20 21:02:01 2021
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# Counted 2407291 lines and 0 comments Sat Mar 20 21:02:10 2021
# Ignored 1150218 other features in the GFF
# Counted 510343 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 Sat Mar 20 21:02:10 2021
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t1
WARNING: cannot get scaffold for g21100.t2
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1
WARNING: cannot get scaffold for g21101.t1, will not print further warnings
# Counted 761276 lines and kept 397759 hits
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries Sat Mar 20 21:02:13 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table
# WARNING: could not find scaffold for 357759 hits Sat Mar 20 21:02:13 2021
Sorry for making trouble...
Can you try with -S -x -K -F "." -G
?
It works!
$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 --add-description --add-accession -S -x -K -F "." -G > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta Sat Mar 20 22:32:07 2021
# Taking length and descriptions from sequences
# Found 564277 sequences Sat Mar 20 22:32:20 2021
# Parsing gff from augustus.hints_utr.gff3 Sat Mar 20 22:32:20 2021
# exon features WILL BE IGNORED
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
# Counted 2407291 lines and 0 comments Sat Mar 20 22:32:29 2021
# Ignored 1150218 other features in the GFF
# Counted 510343 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab2 Sat Mar 20 22:32:29 2021
# Counted 761276 lines and kept 397759 hits
# Removed 270340 hits by shortness
# Removed 1258 hits by bitscore
# Removed 91919 hits by evalue
# Removed 40000 hits that exceeded query max
# Found 397759 hits for 72299 queries Sat Mar 20 22:32:50 2021
# Wrote 1878164 match intervals Sat Mar 20 22:32:50 2021
However, resultant gff is quite different from the one I expected.
$ head -30 augustus.hints_utr.blastp.uniprot_sprot.gff
chr01 BLASTP protein_match 13909 15161 130.0 + . ID=SEM3D_CHICK.g21100.t1.1;Target=SEM3D_CHICK 41 153 +;same_sense=1;Gaps=1;Mismatch=52;Evalue=7.49e-34;Description=Semaphorin-3D;Accession=Q90663
chr01 BLASTP match_part 13909 14061 130.0 + . Parent=SEM3D_CHICK.g21100.t1.1
chr01 BLASTP match_part 14196 14258 130.0 + . Parent=SEM3D_CHICK.g21100.t1.1
chr01 BLASTP match_part 15042 15161 130.0 + . Parent=SEM3D_CHICK.g21100.t1.1
chr01 BLASTP protein_match 13909 15149 129.0 + . ID=SEM3D_HUMAN.g21100.t1.1;Target=SEM3D_HUMAN 53 161 +;same_sense=1;Gaps=1;Mismatch=49;Evalue=2.96e-33;Description=Semaphorin-3D;Accession=O95025
chr01 BLASTP match_part 13909 14061 129.0 + . Parent=SEM3D_HUMAN.g21100.t1.1
chr01 BLASTP match_part 14196 14258 129.0 + . Parent=SEM3D_HUMAN.g21100.t1.1
chr01 BLASTP match_part 15042 15149 129.0 + . Parent=SEM3D_HUMAN.g21100.t1.1
chr01 BLASTP protein_match 13909 15161 128.0 + . ID=SEM3D_MOUSE.g21100.t1.1;Target=SEM3D_MOUSE 53 165 +;same_sense=1;Gaps=1;Mismatch=54;Evalue=5.5e-33;Description=Semaphorin-3D;Accession=Q8BH34
chr01 BLASTP match_part 13909 14061 128.0 + . Parent=SEM3D_MOUSE.g21100.t1.1
chr01 BLASTP match_part 14196 14258 128.0 + . Parent=SEM3D_MOUSE.g21100.t1.1
chr01 BLASTP match_part 15042 15161 128.0 + . Parent=SEM3D_MOUSE.g21100.t1.1
chr01 BLASTP protein_match 12466 15167 118.0 + . ID=SEM3D_DANRE.g21100.t1.1;Target=SEM3D_DANRE 51 171 +;same_sense=1;Gaps=1;Mismatch=61;Evalue=1.34e-29;Description=Semaphorin-3D;Accession=Q9W6G6
chr01 BLASTP match_part 12466 12478 118.0 + . Parent=SEM3D_DANRE.g21100.t1.1
chr01 BLASTP match_part 13907 14061 118.0 + . Parent=SEM3D_DANRE.g21100.t1.1
chr01 BLASTP match_part 14196 14258 118.0 + . Parent=SEM3D_DANRE.g21100.t1.1
chr01 BLASTP match_part 15042 15167 118.0 + . Parent=SEM3D_DANRE.g21100.t1.1
chr01 BLASTP protein_match 13909 15161 109.0 + . ID=SE3AA_DANRE.g21100.t1.1;Target=SE3AA_DANRE 40 152 +;same_sense=1;Gaps=1;Mismatch=64;Evalue=1.41e-26;Description=Semaphorin-3aa;Accession=Q9W7J1
chr01 BLASTP match_part 13909 14061 109.0 + . Parent=SE3AA_DANRE.g21100.t1.1
chr01 BLASTP match_part 14196 14258 109.0 + . Parent=SE3AA_DANRE.g21100.t1.1
chr01 BLASTP match_part 15042 15161 109.0 + . Parent=SE3AA_DANRE.g21100.t1.1
chr01 BLASTP protein_match 12466 14252 63.9 + . ID=SEM3D_DANRE.g21100.t2.2;Target=SEM3D_DANRE 51 127 +;same_sense=1;Gaps=1;Mismatch=41;Evalue=4.15e-11;Description=Semaphorin-3D;Accession=Q9W6G6
chr01 BLASTP match_part 12466 12478 63.9 + . Parent=SEM3D_DANRE.g21100.t2.2
chr01 BLASTP match_part 13907 14061 63.9 + . Parent=SEM3D_DANRE.g21100.t2.2
chr01 BLASTP match_part 14196 14252 63.9 + . Parent=SEM3D_DANRE.g21100.t2.2
chr01 BLASTP protein_match 19773 21728 168.0 + . ID=SEM3D_DANRE.g21101.t1.3;Target=SEM3D_DANRE 366 475 +;same_sense=1;Gaps=0;Mismatch=34;Evalue=2.09e-48;Description=Semaphorin-3D;Accession=Q9W6G6
chr01 BLASTP match_part 19773 19880 168.0 + . Parent=SEM3D_DANRE.g21101.t1.3
chr01 BLASTP match_part 21507 21728 168.0 + . Parent=SEM3D_DANRE.g21101.t1.3
chr01 BLASTP protein_match 19773 21728 166.0 + . ID=SEM3D_CHICK.g21101.t1.2;Target=SEM3D_CHICK 350 459 +;same_sense=1;Gaps=0;Mismatch=35;Evalue=1.01e-47;Description=Semaphorin-3D;Accession=Q90663
chr01 BLASTP match_part 19773 19880 166.0 + . Parent=SEM3D_CHICK.g21101.t1.2
I expected that the resultant gff was similar to the original gff but gene names are included. But in this output, I do not see gene structure information (such as exon, utr, etc) Is this output usual?
Great that it works! Sorry for all the steps. Yes that is usual, since these are not gene predictions, they are matches. The point is to show how much of a target protein is matched by the gene model.
So the usual usage of this output gff in GenomeViewer, such as IGV, is importing both of the gff (here, BRAKER-produce gff and output gff)?
Hi,
I made a gene model with BRAKER and would like to add gene names using blast2genomegff.py. I, however, have had two types of error, which depends on file format (gtf or gff3). If I used gtf, I got error as follows.
If I used gff3, another error was obtained.
My files are as follows.
Any suggestions would be appreciated.