pfam2gff genomics coordinates generate short pfam description

agroppi commented 6 years ago

I have generated genome-based coding region annotation file with Transdecoder as described here Then I have used pfam2gff .py like this : python pfam2gff.py -g transcripts.fasta.transdecoder.genome.gff3 \ -i PFAM.out \ -T > pfam_domains.gff

But in this file the annotation from pfam are extremely shortened :

chr1    hmmscan PFAM    11416724    11416749    99.5    +   .   ID=g34449.t1.p1.Aldose_epim.1;Name=PF01263.Aldose_epim.1
chr1    hmmscan PFAM    11416839    11416942    99.5    +   .   ID=g34449.t1.p1.Aldose_epim.1;Name=PF01263.Aldose_epim.1
chr4    hmmscan PFAM    3455    3615    118.3   +   .   ID=jg1.t1.p1.RVT_1.1;Name=PF00078.RVT_1.1
chr4    hmmscan PFAM    4970    5082    53.8    +   .   ID=jg1.t1.p1.rve.1;Name=PF00665.rve.1

Instead of in the file TrinotatePFAM.out :

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Aldose_epim          PF01263.19   301 g34449.t1.p1         -            194   3.7e-31  108.5   0.3   1   1   1.2e-32     2e-28   99.5   0.3    53   177    55   184    32   193 0.88 Aldose 1-epimerase
RVT_1                PF00078.26   222 jg1.t1.p1            -           1474   1.8e-34  119.1   0.1   1   1   1.6e-37   3.3e-34  118.3   0.1     1   222   627   787   627   787 0.98 Reverse transcriptase (RNA-dependent DNA polymerase)
rve                  PF00665.25   119 jg1.t1.p1            -           1474   6.4e-15   55.3   0.0   1   1   8.9e-18   1.9e-14   53.8   0.0     2   115  1108  1220  1107  1223 0.96 Integrase core domain
Retrotrans_gag       PF03732.16    97 jg1.t1.p1            -           1474   4.6e-12   46.0   1.9   1   2   2.2e-15   4.6e-12   46.0   1.9     1    96   164   259   164   260 0.96 Retrotransposon gag protein
Retrotrans_gag       PF03732.16    97 jg1.t1.p1            -           1474   4.6e-12   46.0   1.9   2   2       1.8   3.8e+03   -1.9   0.1    54    76   958   981   952   994 0.82 Retrotransposon gag protein

Leading to this in a genome browser :

Is there a way to keep more informative data in the output file from pfam2gff.py ? for example Aldose 1-epimerase instead of Aldose_epim.1

wrf commented 6 years ago

Hi Alexis, I can change it so the ID is the "target name" from hmmscan, and the GFF name is the "description". You might need to configure the track in jbrowse to display the "name" tag instead of the "ID" tag. Would that be better?

agroppi commented 6 years ago

Hi Warren, Yes it would be great, even if I don't know if I would be able to customise my browser (Apollo) since it's in a docker ...

On the same subject, I have another problem with blast2genomegff.py

I followed the instruction here

Blastx the transcriptome against a protein set (Uniprot) -mapping the transcripts to the genome : gmap -D /MyGenome -d mygenome -B 5 -t 20 -n 1 transcripts.fasta.transdecoder.cds -f gff3_gene > /mygenome_gmap.gff3 Here is a sample of the result :

##gff-version   3
# Generated by GMAP version 2017-02-25 using call:  gmap.avx2 -D /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_2.0 -d Marouch_2.0 -B 5 -t 20 -n 1 -f gff3_gene /scratch/ag/sWAGMAN/Marouch_Trinotate_v2.0/Marouch_transcripts.fasta.transdecoder.cds
chr4    Marouch_2.0 gene    385577  386146  .   -   .   ID=jg100.t1.p1.path1;Name=jg100.t1.p1
chr4    Marouch_2.0 mRNA    385577  386146  .   -   .   ID=jg100.t1.p1.mrna1;Name=jg100.t1.p1;Parent=jg100.t1.p1.path1;coverage=100.0;identity=100.0;matches=570;mismatches=0;indels=0;unknowns=0
chr4    Marouch_2.0 exon    385577  386146  100 -   .   ID=jg100.t1.p1.mrna1.exon1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
chr4    Marouch_2.0 CDS 385577  386146  100 -   0   ID=jg100.t1.p1.mrna1.cds1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
###
chr7    Marouch_2.0 gene    21480534    21481511    .   -   .   ID=jg10007.t1.p1.path1;Name=jg10007.t1.p1
chr7    Marouch_2.0 mRNA    21480534    21481511    .   -   .   ID=jg10007.t1.p1.mrna1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.path1;coverage=100.0;identity=100.0;matches=978;mismatches=0;indels=0;unknowns=0
chr7    Marouch_2.0 exon    21480534    21481511    100 -   .   ID=jg10007.t1.p1.mrna1.exon1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
chr7    Marouch_2.0 CDS 21480534    21481511    100 -   0   ID=jg10007.t1.p1.mrna1.cds1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +

convert gff3 to gtf with gffread from Cufflinks : gffread mygenome_gmap.gff3 -T -o mygenome_gmap.gtf Here is a sample of the result :

chr4    Marouch_2.0 CDS 1061432 1061749 .   +   0   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1061869 1062500 .   +   0   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1062693 1062749 .   +   1   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1062847 1063085 .   +   1   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1063182 1063300 .   +   2   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1063460 1063546 .   +   0   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";
chr4    Marouch_2.0 CDS 1063673 1063738 .   +   0   transcript_id "jg270.t1.p1.mrna1"; gene_id "jg270.t1.p1.path1"; gene_name "jg270.t1.p1";

But when I try to convert to genomic coordinates with :

python blast2genomegff.py -b BlastX_results.tab \
-d uniprot_sprot.pep \
-g mygenome_gmap.gtf > mygenome_transcripts_BlastX.genome.gff

I have the following error :

# Parsing target sequences from /scratch/ag/sWAGMAN/Marouch_Trinotate/Trinotate_databases/uniprot_sprot.pep Wed Oct 24 10:59:11 2018
# Found 558125 sequences Wed Oct 24 10:59:17 2018
# Parsing gff from /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_gmap.gtf Wed Oct 24 10:59:17 2018
Traceback (most recent call last):
  File "blast2genomegff.py", line 307, in <module>
    main(sys.argv[1:],sys.stdout)
  File "blast2genomegff.py", line 302, in main
    geneintervals, genestrand, genescaffold =  gtf_to_intervals(args.genes, args.exons, args.transdecoder, args.no_genes)
  File "blast2genomegff.py", line 106, in gtf_to_intervals
    geneintervals[geneid].append(boundaries)
UnboundLocalError: local variable 'geneid' referenced before assignment

I have tried with the -x option (same error) I use python 2.7.12

Thanks for your help :)

wrf commented 6 years ago

I can reproduce that error with the -x option, but it appears to parse correctly if that option isn't used. I think you should be able to use the gff from gmap without converting to gtf. I should make this clearer in the instructions.

agroppi commented 6 years ago

I have launched the following commande line directly on the gff from gmap :

python blast2genomegff.py -b BlastX_results.tab \
-d uniprot_sprot.pep \
-g mygenome_gmap.gff3 > mygenome_transcripts_BlastX.genome.gff

But this time a I have a bunch of warnings :

...
WARNING: cannot retrieve strand for jg17793.t1
WARNING: cannot retrieve strand for jg18367.t1
WARNING: cannot retrieve strand for jg17561.t1
WARNING: cannot retrieve strand for jg19141.t1
WARNING: cannot retrieve strand for jg17040.t1
...

At the end I have :

# Removed 3520 hits by shortness
# Removed 7 hits by bitscore
# Removed 28 hits by evalue
# Found 30423 hits for 26936 queries Thu Oct 25 15:06:44 2018
# 464 hits are antisense Thu Oct 25 15:06:44 2018
# 464 genes have hits extending beyond gene bounds Thu Oct 25 15:06:44 2018

But the file mygenome_transcripts_BlastX.genome.gff is empty :(

wrf commented 6 years ago

Ah OK. The issue is not related to getting the strand, it is actually that this is the first time the program tries to find the name of the BLAST hit in the dict of IDs from the GFF file. It appears that it cannot, probably because the names in the blast results file differ from those in the GFF file. It might be related to how Transdecoder generates the fasta headers. This is probably fixable with -d "|". Can you send me a few lines of the blast results?

wrf commented 6 years ago

In any case, the pfam2gff name display should be fixed now.

agroppi commented 6 years ago

Ah OK. The issue is not related to getting the strand, it is actually that this is the first time the program tries to find the name of the BLAST hit in the dict of IDs from the GFF file. It appears that it cannot, probably because the names in the blast results file differ from those in the GFF file. It might be related to how Transdecoder generates the fasta headers. This is probably fixable with -d "|". Can you send me a few lines of the blast results?

Here are some line from BlastX_results.tab :

g18743.t1.p1    POLR1_ARATH 27.304  293 144 4   325 996 1171    1463    9.78e-27    114
g21212.t1.p1    GALM_BOVIN  30.319  188 93  4   76  546 14  194 1.09e-20    90.9
jg1.t1.p1   MT878_ARATH 46.544  217 108 4   19  651 13  227 5.09e-60    195
jg10.t1.p1  DTX16_ARATH 60.131  459 183 0   97  1473    19  477 0.0 537
jg100.t1.p1 SEC5A_ARATH 66.220  1119    338 15  1   3324    1   1090    0.0 1421
jg1000.t1.p1    M310_ARATH  43.363  113 63  1   1   336 8   120 1.97e-27    109
jg10000.t1.p1   C87A3_ORYSJ 46.104  154 80  2   433 891 106 257 2.05e-35    144
jg10001.t1.p1   C87A3_ORYSJ 44.420  457 241 6   55  1386    57  513 4.60e-138   409

Sample from mygenome_gmap.gff3 :

##gff-version   3
# Generated by GMAP version 2017-02-25 using call:  gmap.avx2 -D /scratch/ag/sWAGMAN/Marouch_Gmap/Marouch_2.0 -d Marouch_2.0 -B 5 -t 20 -n 1 -f gff3_gene /scratch/ag/sWAGMAN/Marouch_Trinotate_v2.0/Marouch_transcripts.fasta.transdecoder.cds
chr7    Marouch_2.0 gene    21480534    21481511    .   -   .   ID=jg10007.t1.p1.path1;Name=jg10007.t1.p1
chr7    Marouch_2.0 mRNA    21480534    21481511    .   -   .   ID=jg10007.t1.p1.mrna1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.path1;coverage=100.0;identity=100.0;matches=978;mismatches=0;indels=0;unknowns=0
chr7    Marouch_2.0 exon    21480534    21481511    100 -   .   ID=jg10007.t1.p1.mrna1.exon1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
chr7    Marouch_2.0 CDS 21480534    21481511    100 -   0   ID=jg10007.t1.p1.mrna1.cds1;Name=jg10007.t1.p1;Parent=jg10007.t1.p1.mrna1;Target=jg10007.t1.p1 1 978 +
###
chr4    Marouch_2.0 gene    385577  386146  .   -   .   ID=jg100.t1.p1.path1;Name=jg100.t1.p1
chr4    Marouch_2.0 mRNA    385577  386146  .   -   .   ID=jg100.t1.p1.mrna1;Name=jg100.t1.p1;Parent=jg100.t1.p1.path1;coverage=100.0;identity=100.0;matches=570;mismatches=0;indels=0;unknowns=0
chr4    Marouch_2.0 exon    385577  386146  100 -   .   ID=jg100.t1.p1.mrna1.exon1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +
chr4    Marouch_2.0 CDS 385577  386146  100 -   0   ID=jg100.t1.p1.mrna1.cds1;Name=jg100.t1.p1;Parent=jg100.t1.p1.mrna1;Target=jg100.t1.p1 1 570 +

Thanks !

wrf commented 6 years ago

Should be updated now, it was indeed the problem that the IDs didn't match. The program would store the ID like jg100.t1.p1.mrna1, but the blast results were jg100.t1.p1, so it never found the intervals. The IDs and names are different in many programs, so there isn't an easy fix.

It appears to parse correctly on the lines you sent.

You would add the option --gff-delimiter .mrna. I think the whole command would be like:

blast2genomegff.py -b BlastX_results.tab -g mygenome_gmap.gff3 -d swissprot.fasta --gff-delimiter .mrna > blastx_v_swissprot_genome.gff

Let me know if that works.

agroppi commented 6 years ago

Thanks but there is an error :

# Parsing target sequences from /scratch/ag/sWAGMAN/Marouch_Trinotate/Trinotate_databases/uniprot_sprot.pep Fri Oct 26 15:19:00 2018
Traceback (most recent call last):
  File "blast2genomegff.py", line 319, in <module>
    main(sys.argv[1:],sys.stdout)
  File "blast2genomegff.py", line 313, in main
    protlendb = make_seq_length_dict(args.database)
  File "blast2genomegff.py", line 49, in make_seq_length_dict
    seqid = seqrec.id.split("|")[2]
IndexError: list index out of range

wrf commented 6 years ago

Sorry, just fixed that. Try again.

agroppi commented 6 years ago

Works perfectly 👍 Thank you very much

wrf / genomeGTFtools

pfam2gff genomics coordinates generate short pfam description #2