wurmlab / flo

Same species annotation lift over pipeline.
95 stars 28 forks source link

gt gff3: error: Parent ... was not defined (via "ID=") #28

Closed mictadlo closed 3 years ago

mictadlo commented 5 years ago

Hi, I tried to lift the below TAIR10 annotation:

> head TAIR10_GFF3_genes.gff
Chr1    TAIR10  chromosome  1   30427671    .   .   .   ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1

Next, I did

> gff_remove_feats.rb chromosome TAIR10_GFF3_genes.gff > TAIR10_GFF3_genes-fix1.gff |head
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4486    4605    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;

While running flo I got:

> mkdir run/TAIR10_GFF3_genes-fix1
liftOver -gff /QRISdata/Q0231/flo/tair10/TAIR10_GFF3_genes-fix1.gff run/liftover.chn run/TAIR10_GFF3_genes-fix1/lifted.gff3 run/TAIR10_GFF3_genes-fix1/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/TAIR10_GFF3_genes-fix1/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!

What did I miss?

Thank you in advance,

Michal

yeban commented 5 years ago

Hi. My apologies for the slow response. This GFF file is rather complicated for the gff_recover.rb script which tries to remove invalid annotations from liftOver's output (if any). I think you should check liftOver's output yourself (run/TAIR10_GFF3_genes-fix1/lifted.gff3) that it is fit for downstream analysis.

mictadlo commented 5 years ago

Hi,

   > /QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head
    NbV1Ch05    TAIR10  tRNA    45037019    45037091    .   +   .   ID=AT1G01890.1;Parent=AT1G01890;Name=AT1G01890.1;Index=1
    NbV1Ch11    TAIR10  tRNA    93127111    93127183    .   -   .   ID=AT1G02480.1;Parent=AT1G02480;Name=AT1G02480.1;Index=1
    NbV1Ch02    TAIR10  tRNA    81869336    81869407    .   +   .   ID=AT1G02600.1;Parent=AT1G02600;Name=AT1G02600.1;Index=1
    NbV1Ch05    TAIR10  tRNA    97695952    97696024    .   +   .   ID=AT1G02760.1;Parent=AT1G02760;Name=AT1G02760.1;Index=1
    NbV1Ch05    TAIR10  tRNA    9913146 9913217 .   -   .   ID=AT1G03515.1;Parent=AT1G03515;Name=AT1G03515.1;Index=1
    NbV1Ch13    TAIR10  tRNA    170955340   170955411   .   -   .   ID=AT1G03570.1;Parent=AT1G03570;Name=AT1G03570.1;Index=1
    NbV1Ch15    TAIR10  tRNA    91988482    91988554    .   +   .   ID=AT1G03640.1;Parent=AT1G03640;Name=AT1G03640.1;Index=1
    NbV1Ch19    TAIR10  tRNA    17742781    17742849    .   +   .   ID=AT1G04320.1;Parent=AT1G04320;Name=AT1G04320.1;Index=1
    NbV1Ch15    TAIR10  tRNA    50880103    50880176    .   +   .   ID=AT1G06480.1;Parent=AT1G06480;Name=AT1G06480.1;Index=1
    NbV1Ch19    TAIR10  tRNA    5563896 5563968 .   +   .   ID=AT1G06610.1;Parent=AT1G06610;Name=AT1G06610.1;Index=1
    ...
    NbV1Ch05    TAIR10  tRNA    48760991    48761061    .   +   .   ID=AT5G66817.1;Parent=AT5G66817;Name=AT5G66817.1;Index=1
    NbV1Ch13    TAIR10  tRNA    39812638    39812709    .   +   .   ID=AT5G67455.1;Parent=AT5G67455;Name=AT5G67455.1;Index=1
    NbV1Ch06    TAIR10  gene    24097055    24097111    .   +   .   ID=AT1G64130.1
    NbV1Ch06    TAIR10  exon    24097055    24097111    .   +   .   Parent=AT1G64130.1
    NbV1Ch06    TAIR10  CDS 24097055    24097111    .   +   0   Parent=AT1G64130.1,AT1G64130.1-Protein
    NbV1Ch17    TAIR10  gene    18625243    18625301    .   -   .   ID=AT2G07768.1
    NbV1Ch17    TAIR10  exon    18625243    18625301    .   -   .   Parent=AT2G07768.1
    NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1,AT2G07768.1-Protein
    NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.1
    NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.1,AT5G20570.1-Protein
    NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.2
    NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.2,AT5G20570.2-Protein

Did I miss anything?

Thank you in advance,

Michal

yeban commented 5 years ago

I would run gt gff3 -tidy on that file. If the command succeeds, the gff is fine. If it fails, the simplest thing (although a bit time taking) would be to delete the problematic annotations one by one. gt (genometools) is also used by flo, so you should have it.

See code comments between line number 75 and 89 in gff_recover.rb for the problems I identified with liftOver's output.

mictadlo commented 5 years ago

Hi, Thank you. I also tried the following steps here with unexpected results. Maybe, do you know what I missed?

Thank in advance,

Michal

yeban commented 4 years ago

Hi,

Sorry again. Was on annual leave.

gff_recover.rb script can only work with two-level of features: transcripts and their subfeatures. transcripts can be annotated as mRNA, transcript, or gene (in the 3rd column).

transcripts can have either exon or CDS as their subfeatures. If you have both exon and CDS in the GFF, the cases where exon and corresponding CDS don't fully overlap, i.e., when exon has a UTR, it is possible that the exon may not be lifted but the CDS is but gff_recover.rb won't be able to catch this.

CDS or exon should have ID= in the 9th column in addition to Parent=. You can add ids using genometools.

five_prime_UTR is not recognized, but is easy to add.

So you could try stripping your input GFF of everything except mRNA and CDS and then flo should work for you.

Other alterantives to try:

  1. Convert GFF to a format that liftOver natively understands, like genepred or so (see liftOver's help output).
  2. Take the lifted.gff file and remove the few broken annotations manually or using a custom script.
mictadlo commented 4 years ago

Hi, Thank you for your response. The below annotation has genes which have multiple mRNAs. Should I remove all lines which contain gene?

Could please let me know how to deals with exon, CDS and UTR with below GFF3 file?

##gff-version 3
Chr1    TAIR10  gene    3631    5899    .       +       .       ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     3760    3913    .       +       0       Parent=AT1G01010.1
Chr1    TAIR10  exon    3996    4276    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     3996    4276    .       +       2       Parent=AT1G01010.1
Chr1    TAIR10  exon    4486    4605    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     4486    4605    .       +       0       Parent=AT1G01010.1
Chr1    TAIR10  exon    4706    5095    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     4706    5095    .       +       0       Parent=AT1G01010.1
Chr1    TAIR10  exon    5174    5326    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     5174    5326    .       +       0       Parent=AT1G01010.1
Chr1    TAIR10  exon    5439    5899    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  CDS     5439    5630    .       +       0       Parent=AT1G01010.1
Chr1    TAIR10  three_prime_UTR 5631    5899    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  gene    5928    8737    .       -       .       ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10  mRNA    5928    8737    .       -       .       ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1
Chr1    TAIR10  five_prime_UTR  8667    8737    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     8571    8666    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    8571    8737    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     8417    8464    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    8417    8464    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     8236    8325    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    8236    8325    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     7942    7987    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    7942    7987    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     7762    7835    .       -       2       Parent=AT1G01020.1
Chr1    TAIR10  exon    7762    7835    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     7564    7649    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    7564    7649    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     7384    7450    .       -       1       Parent=AT1G01020.1
Chr1    TAIR10  exon    7384    7450    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     7157    7232    .       -       0       Parent=AT1G01020.1
Chr1    TAIR10  exon    7157    7232    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  CDS     6915    7069    .       -       2       Parent=AT1G01020.1
Chr1    TAIR10  three_prime_UTR 6437    6914    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  exon    6437    7069    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  three_prime_UTR 5928    6263    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  exon    5928    6263    .       -       .       Parent=AT1G01020.1
Chr1    TAIR10  mRNA    6790    8737    .       -       .       ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2
Chr1    TAIR10  five_prime_UTR  8667    8737    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     8571    8666    .       -       0       Parent=AT1G01020.2
Chr1    TAIR10  exon    8571    8737    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     8417    8464    .       -       0       Parent=AT1G01020.2
Chr1    TAIR10  exon    8417    8464    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     8236    8325    .       -       0       Parent=AT1G01020.2
Chr1    TAIR10  exon    8236    8325    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     7942    7987    .       -       0       Parent=AT1G01020.2
Chr1    TAIR10  exon    7942    7987    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     7762    7835    .       -       2       Parent=AT1G01020.2
Chr1    TAIR10  exon    7762    7835    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     7564    7649    .       -       0       Parent=AT1G01020.2
Chr1    TAIR10  exon    7564    7649    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  CDS     7315    7450    .       -       1       Parent=AT1G01020.2
Chr1    TAIR10  three_prime_UTR 7157    7314    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  exon    7157    7450    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  three_prime_UTR 6790    7069    .       -       .       Parent=AT1G01020.2
Chr1    TAIR10  exon    6790    7069    .       -       .       Parent=AT1G01020.2

Thank you in advance,

Michal