Closed agroppi closed 5 years ago
Hi Alexis, I can change it so the ID is the "target name" from hmmscan, and the GFF name is the "description". You might need to configure the track in jbrowse to display the "name" tag instead of the "ID" tag. Would that be better?
Hi Warren, Yes it would be great, even if I don't know if I would be able to customise my browser (Apollo) since it's in a docker ...
On the same subject, I have another problem with blast2genomegff.py
I followed the instruction here
Blastx the transcriptome against a protein set (Uniprot)
-mapping the transcripts to the genome :
gmap -D /MyGenome -d mygenome -B 5 -t 20 -n 1 transcripts.fasta.transdecoder.cds -f gff3_gene > /mygenome_gmap.gff3
Here is a sample of the result :
##gff-version 3
# Generated by GMAP version 2017-02-25 using call: gmap.avx2 -D /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_2.0 -d Marouch_2.0 -B 5 -t 20 -n 1 -f gff3_gene /scratch/ag/sWAGMAN/Marouch_Trinotate_v2.0/Marouch_transcripts.fasta.transdecoder.cds
chr4 Marouch_2.0 gene 385577 386146 . - . ID=jg100.t1.p1.path1;Name=jg100.t1.p1
chr4 Marouch_2.0 mRNA 385577 386146 . - . ID=jg100.t1.p1.mrna1;Name=jg100.t1.p1;Parent=jg100.t1.p1.path1;coverage=100.0;identity=100.0;matches=570;mismatches=0;indels=0;unknowns=0
chr4 Marouch_2.0 exon 385577 386146 100 - . ID=jg100.t1.p1.mrna1.exon1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
chr4 Marouch_2.0 CDS 385577 386146 100 - 0 ID=jg100.t1.p1.mrna1.cds1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
###
chr7 Marouch_2.0 gene 21480534 21481511 . - . ID=jg10007.t1.p1.path1;Name=jg10007.t1.p1
chr7 Marouch_2.0 mRNA 21480534 21481511 . - . ID=jg10007.t1.p1.mrna1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.path1;coverage=100.0;identity=100.0;matches=978;mismatches=0;indels=0;unknowns=0
chr7 Marouch_2.0 exon 21480534 21481511 100 - . ID=jg10007.t1.p1.mrna1.exon1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
chr7 Marouch_2.0 CDS 21480534 21481511 100 - 0 ID=jg10007.t1.p1.mrna1.cds1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
convert gff3 to gtf with gffread from Cufflinks :
gffread mygenome_gmap.gff3 -T -o mygenome_gmap.gtf
Here is a sample of the result :
chr4 Marouch_2.0 CDS 1061432 1061749 . + 0 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1061869 1062500 . + 0 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1062693 1062749 . + 1 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1062847 1063085 . + 1 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1063182 1063300 . + 2 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1063460 1063546 . + 0 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4 Marouch_2.0 CDS 1063673 1063738 . + 0 transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
But when I try to convert to genomic coordinates with :
python blast2genomegff.py -b BlastX_results.tab \
-d uniprot_sprot.pep \
-g mygenome_gmap.gtf > mygenome_transcripts_BlastX.genome.gff
I have the following error :
# Parsing target sequences from /scratch/ag/sWAGMAN/Marouch_Trinotate/Trinotate_databases/uniprot_sprot.pep Wed Oct 24 10:59:11 2018
# Found 558125 sequences Wed Oct 24 10:59:17 2018
# Parsing gff from /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_gmap.gtf Wed Oct 24 10:59:17 2018
Traceback (most recent call last):
File "blast2genomegff.py", line 307, in <module>
main(sys.argv[1:],sys.stdout)
File "blast2genomegff.py", line 302, in main
geneintervals, genestrand, genescaffold = gtf_to_intervals(args.genes, args.exons, args.transdecoder, args.no_genes)
File "blast2genomegff.py", line 106, in gtf_to_intervals
geneintervals[geneid].append(boundaries)
UnboundLocalError: local variable 'geneid' referenced before assignment
I have tried with the -x option (same error) I use python 2.7.12
Thanks for your help :)
I can reproduce that error with the -x option, but it appears to parse correctly if that option isn't used. I think you should be able to use the gff from gmap without converting to gtf. I should make this clearer in the instructions.
I have launched the following commande line directly on the gff from gmap :
python blast2genomegff.py -b BlastX_results.tab \
-d uniprot_sprot.pep \
-g mygenome_gmap.gff3 > mygenome_transcripts_BlastX.genome.gff
But this time a I have a bunch of warnings :
...
WARNING: cannot retrieve strand for jg17793.t1
WARNING: cannot retrieve strand for jg18367.t1
WARNING: cannot retrieve strand for jg17561.t1
WARNING: cannot retrieve strand for jg19141.t1
WARNING: cannot retrieve strand for jg17040.t1
...
At the end I have :
# Removed 3520 hits by shortness
# Removed 7 hits by bitscore
# Removed 28 hits by evalue
# Found 30423 hits for 26936 queries Thu Oct 25 15:06:44 2018
# 464 hits are antisense Thu Oct 25 15:06:44 2018
# 464 genes have hits extending beyond gene bounds Thu Oct 25 15:06:44 2018
But the file mygenome_transcripts_BlastX.genome.gff is empty :(
Ah OK. The issue is not related to getting the strand, it is actually that this is the first time the program tries to find the name of the BLAST hit in the dict of IDs from the GFF file. It appears that it cannot, probably because the names in the blast results file differ from those in the GFF file. It might be related to how Transdecoder generates the fasta headers. This is probably fixable with -d "|". Can you send me a few lines of the blast results?
In any case, the pfam2gff name display should be fixed now.
Ah OK. The issue is not related to getting the strand, it is actually that this is the first time the program tries to find the name of the BLAST hit in the dict of IDs from the GFF file. It appears that it cannot, probably because the names in the blast results file differ from those in the GFF file. It might be related to how Transdecoder generates the fasta headers. This is probably fixable with -d "|". Can you send me a few lines of the blast results?
Here are some line from BlastX_results.tab :
g18743.t1.p1 POLR1_ARATH 27.304 293 144 4 325 996 1171 1463 9.78e-27 114
g21212.t1.p1 GALM_BOVIN 30.319 188 93 4 76 546 14 194 1.09e-20 90.9
jg1.t1.p1 MT878_ARATH 46.544 217 108 4 19 651 13 227 5.09e-60 195
jg10.t1.p1 DTX16_ARATH 60.131 459 183 0 97 1473 19 477 0.0 537
jg100.t1.p1 SEC5A_ARATH 66.220 1119 338 15 1 3324 1 1090 0.0 1421
jg1000.t1.p1 M310_ARATH 43.363 113 63 1 1 336 8 120 1.97e-27 109
jg10000.t1.p1 C87A3_ORYSJ 46.104 154 80 2 433 891 106 257 2.05e-35 144
jg10001.t1.p1 C87A3_ORYSJ 44.420 457 241 6 55 1386 57 513 4.60e-138 409
Sample from mygenome_gmap.gff3 :
##gff-version 3
# Generated by GMAP version 2017-02-25 using call: gmap.avx2 -D /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_2.0 -d Marouch_2.0 -B 5 -t 20 -n 1 -f gff3_gene /scratch/ag/sWAGMAN/Marouch_Trinotate_v2.0/Marouch_transcripts.fasta.transdecoder.cds
chr7 Marouch_2.0 gene 21480534 21481511 . - . ID=jg10007.t1.p1.path1;Name=jg10007.t1.p1
chr7 Marouch_2.0 mRNA 21480534 21481511 . - . ID=jg10007.t1.p1.mrna1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.path1;coverage=100.0;identity=100.0;matches=978;mismatches=0;indels=0;unknowns=0
chr7 Marouch_2.0 exon 21480534 21481511 100 - . ID=jg10007.t1.p1.mrna1.exon1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
chr7 Marouch_2.0 CDS 21480534 21481511 100 - 0 ID=jg10007.t1.p1.mrna1.cds1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
###
chr4 Marouch_2.0 gene 385577 386146 . - . ID=jg100.t1.p1.path1;Name=jg100.t1.p1
chr4 Marouch_2.0 mRNA 385577 386146 . - . ID=jg100.t1.p1.mrna1;Name=jg100.t1.p1;Parent=jg100.t1.p1.path1;coverage=100.0;identity=100.0;matches=570;mismatches=0;indels=0;unknowns=0
chr4 Marouch_2.0 exon 385577 386146 100 - . ID=jg100.t1.p1.mrna1.exon1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
chr4 Marouch_2.0 CDS 385577 386146 100 - 0 ID=jg100.t1.p1.mrna1.cds1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
Thanks !
Should be updated now, it was indeed the problem that the IDs didn't match. The program would store the ID like jg100.t1.p1.mrna1
, but the blast results were jg100.t1.p1
, so it never found the intervals. The IDs and names are different in many programs, so there isn't an easy fix.
It appears to parse correctly on the lines you sent.
You would add the option --gff-delimiter .mrna
. I think the whole command would be like:
blast2genomegff.py -b BlastX_results.tab -g mygenome_gmap.gff3 -d swissprot.fasta --gff-delimiter .mrna > blastx_v_swissprot_genome.gff
Let me know if that works.
Thanks but there is an error :
# Parsing target sequences from /scratch/ag/sWAGMAN/Marouch_Trinotate/Trinotate_databases/uniprot_sprot.pep Fri Oct 26 15:19:00 2018
Traceback (most recent call last):
File "blast2genomegff.py", line 319, in <module>
main(sys.argv[1:],sys.stdout)
File "blast2genomegff.py", line 313, in main
protlendb = make_seq_length_dict(args.database)
File "blast2genomegff.py", line 49, in make_seq_length_dict
seqid = seqrec.id.split("|")[2]
IndexError: list index out of range
Sorry, just fixed that. Try again.
Works perfectly đź‘Ť Thank you very much
I have generated genome-based coding region annotation file with Transdecoder as described here Then I have used pfam2gff .py like this :
python pfam2gff.py -g transcripts.fasta.transdecoder.genome.gff3 \ -i PFAM.out \ -T > pfam_domains.gff
But in this file the annotation from pfam are extremely shortened :
Instead of in the file
TrinotatePFAM.out
:Leading to this in a genome browser :
Is there a way to keep more informative data in the output file from pfam2gff.py ? for example
Aldose 1-epimerase
instead ofAldose_epim.1