Closed mictadlo closed 3 years ago
Hi. My apologies for the slow response. This GFF file is rather complicated for the gff_recover.rb script which tries to remove invalid annotations from liftOver's output (if any). I think you should check liftOver's output yourself (run/TAIR10_GFF3_genes-fix1/lifted.gff3
) that it is fit for downstream analysis.
Hi,
> /QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head
NbV1Ch05 TAIR10 tRNA 45037019 45037091 . + . ID=AT1G01890.1;Parent=AT1G01890;Name=AT1G01890.1;Index=1
NbV1Ch11 TAIR10 tRNA 93127111 93127183 . - . ID=AT1G02480.1;Parent=AT1G02480;Name=AT1G02480.1;Index=1
NbV1Ch02 TAIR10 tRNA 81869336 81869407 . + . ID=AT1G02600.1;Parent=AT1G02600;Name=AT1G02600.1;Index=1
NbV1Ch05 TAIR10 tRNA 97695952 97696024 . + . ID=AT1G02760.1;Parent=AT1G02760;Name=AT1G02760.1;Index=1
NbV1Ch05 TAIR10 tRNA 9913146 9913217 . - . ID=AT1G03515.1;Parent=AT1G03515;Name=AT1G03515.1;Index=1
NbV1Ch13 TAIR10 tRNA 170955340 170955411 . - . ID=AT1G03570.1;Parent=AT1G03570;Name=AT1G03570.1;Index=1
NbV1Ch15 TAIR10 tRNA 91988482 91988554 . + . ID=AT1G03640.1;Parent=AT1G03640;Name=AT1G03640.1;Index=1
NbV1Ch19 TAIR10 tRNA 17742781 17742849 . + . ID=AT1G04320.1;Parent=AT1G04320;Name=AT1G04320.1;Index=1
NbV1Ch15 TAIR10 tRNA 50880103 50880176 . + . ID=AT1G06480.1;Parent=AT1G06480;Name=AT1G06480.1;Index=1
NbV1Ch19 TAIR10 tRNA 5563896 5563968 . + . ID=AT1G06610.1;Parent=AT1G06610;Name=AT1G06610.1;Index=1
...
NbV1Ch05 TAIR10 tRNA 48760991 48761061 . + . ID=AT5G66817.1;Parent=AT5G66817;Name=AT5G66817.1;Index=1
NbV1Ch13 TAIR10 tRNA 39812638 39812709 . + . ID=AT5G67455.1;Parent=AT5G67455;Name=AT5G67455.1;Index=1
NbV1Ch06 TAIR10 gene 24097055 24097111 . + . ID=AT1G64130.1
NbV1Ch06 TAIR10 exon 24097055 24097111 . + . Parent=AT1G64130.1
NbV1Ch06 TAIR10 CDS 24097055 24097111 . + 0 Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17 TAIR10 gene 18625243 18625301 . - . ID=AT2G07768.1
NbV1Ch17 TAIR10 exon 18625243 18625301 . - . Parent=AT2G07768.1
NbV1Ch17 TAIR10 CDS 18625243 18625301 . - 0 Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.1
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.2
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.2,AT5G20570.2-Protein
Did I miss anything?
Thank you in advance,
Michal
I would run gt gff3 -tidy
on that file. If the command succeeds, the gff is fine. If it fails, the simplest thing (although a bit time taking) would be to delete the problematic annotations one by one. gt
(genometools) is also used by flo, so you should have it.
See code comments between line number 75 and 89 in gff_recover.rb for the problems I identified with liftOver's output.
Hi, Thank you. I also tried the following steps here with unexpected results. Maybe, do you know what I missed?
Thank in advance,
Michal
Hi,
Sorry again. Was on annual leave.
gff_recover.rb script can only work with two-level of features: transcripts and their subfeatures. transcripts can be annotated as mRNA
, transcript
, or gene
(in the 3rd column).
transcripts can have either exon
or CDS
as their subfeatures. If you have both exon
and CDS
in the GFF, the cases where exon and corresponding CDS don't fully overlap, i.e., when exon has a UTR, it is possible that the exon may not be lifted but the CDS is but gff_recover.rb won't be able to catch this.
CDS
or exon
should have ID=
in the 9th column in addition to Parent=
. You can add ids using genometools.
five_prime_UTR
is not recognized, but is easy to add.
So you could try stripping your input GFF of everything except mRNA
and CDS
and then flo should work for you.
Other alterantives to try:
liftOver
natively understands, like genepred or so (see liftOver's help output).lifted.gff
file and remove the few broken annotations manually or using a custom script.Hi,
Thank you for your response. The below annotation has genes which have multiple mRNA
s. Should I remove all lines which contain gene
?
Could please let me know how to deals with exon, CDS and UTR with below GFF3 file?
##gff-version 3
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1
Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 4486 4605 . + 0 Parent=AT1G01010.1
Chr1 TAIR10 exon 4706 5095 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 4706 5095 . + 0 Parent=AT1G01010.1
Chr1 TAIR10 exon 5174 5326 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 5174 5326 . + 0 Parent=AT1G01010.1
Chr1 TAIR10 exon 5439 5899 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 5439 5630 . + 0 Parent=AT1G01010.1
Chr1 TAIR10 three_prime_UTR 5631 5899 . + . Parent=AT1G01010.1
Chr1 TAIR10 gene 5928 8737 . - . ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1 TAIR10 mRNA 5928 8737 . - . ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1
Chr1 TAIR10 five_prime_UTR 8667 8737 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8571 8666 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8417 8464 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 8417 8464 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8236 8325 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 8236 8325 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7942 7987 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 7942 7987 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7762 7835 . - 2 Parent=AT1G01020.1
Chr1 TAIR10 exon 7762 7835 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7564 7649 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 7564 7649 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7384 7450 . - 1 Parent=AT1G01020.1
Chr1 TAIR10 exon 7384 7450 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7157 7232 . - 0 Parent=AT1G01020.1
Chr1 TAIR10 exon 7157 7232 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 6915 7069 . - 2 Parent=AT1G01020.1
Chr1 TAIR10 three_prime_UTR 6437 6914 . - . Parent=AT1G01020.1
Chr1 TAIR10 exon 6437 7069 . - . Parent=AT1G01020.1
Chr1 TAIR10 three_prime_UTR 5928 6263 . - . Parent=AT1G01020.1
Chr1 TAIR10 exon 5928 6263 . - . Parent=AT1G01020.1
Chr1 TAIR10 mRNA 6790 8737 . - . ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2
Chr1 TAIR10 five_prime_UTR 8667 8737 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 8571 8666 . - 0 Parent=AT1G01020.2
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 8417 8464 . - 0 Parent=AT1G01020.2
Chr1 TAIR10 exon 8417 8464 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 8236 8325 . - 0 Parent=AT1G01020.2
Chr1 TAIR10 exon 8236 8325 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 7942 7987 . - 0 Parent=AT1G01020.2
Chr1 TAIR10 exon 7942 7987 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 7762 7835 . - 2 Parent=AT1G01020.2
Chr1 TAIR10 exon 7762 7835 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 7564 7649 . - 0 Parent=AT1G01020.2
Chr1 TAIR10 exon 7564 7649 . - . Parent=AT1G01020.2
Chr1 TAIR10 CDS 7315 7450 . - 1 Parent=AT1G01020.2
Chr1 TAIR10 three_prime_UTR 7157 7314 . - . Parent=AT1G01020.2
Chr1 TAIR10 exon 7157 7450 . - . Parent=AT1G01020.2
Chr1 TAIR10 three_prime_UTR 6790 7069 . - . Parent=AT1G01020.2
Chr1 TAIR10 exon 6790 7069 . - . Parent=AT1G01020.2
Thank you in advance,
Michal
Hi, I tried to lift the below TAIR10 annotation:
Next, I did
While running flo I got:
What did I miss?
Thank you in advance,
Michal