wurmlab / flo

Same species annotation lift over pipeline.
96 stars 28 forks source link

Flo failed at the GenomTools section #32

Closed Homap closed 3 years ago

Homap commented 4 years ago

Hello,

I ran flo on my data to convert the gff coordinates from one assembly version to the other. I have the files, lifted.gff3 and unlifted.gff3. The lifted.gff3 looks fine in terms of the size comparison with the original gff3.

However, at the end, I get the following error:

liftOver -gff GCF_000698965.1_ASM69896v1_genomic.flo.gff run/liftover.chn run/GCF_000698965.1_ASM69896v1_genomic.flo/lifted.gff3 run/GCF_000698965.1_ASM69896v1_genomic.flo/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/crex/proj/uppstore2017180/private/homap/ostrich_Z_diversity/src/flo/gff_recover.rb run/GCF_000698965.1_ASM69896v1_genomic.flo/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/GCF_000698965.1_ASM69896v1_genomic.flo/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: line 1 in file "-" does not contain 9 tab (\t) separated fields
rake aborted!
Command failed with status (1): [/crex/proj/uppstore2017180/private/homap/o...]
/crex/proj/uppstore2017180/private/homap/ostrich_Z_diversity/src/flo/Rakefile:60:in `block (2 levels) in <top (required)>'
/crex/proj/uppstore2017180/private/homap/ostrich_Z_diversity/src/flo/Rakefile:40:in `each'
/crex/proj/uppstore2017180/private/homap/ostrich_Z_diversity/src/flo/Rakefile:40:in `block in <top (required)>'
Tasks: TOP => default
(See full trace by running task with --trace)

I was wondering how I could resolve this issue?

yeban commented 4 years ago

You are likely getting this error because the output of gff_recover.rb script is empty. This script is run on lifted.gff3 to remove any non-sensical annotations like genes mapped to different scaffolds. The filtered output which is then piped to genome tools for validation.

If you can share lifted.gff3 file, I might be able to guess why the output of gff_recover.rb is empty.

Homap commented 4 years ago

Thanks a lot for your prompt reply. Please find the lifted gff attached. I am now also trying myself to write some Python scripts to clean it but of course, It would be really wonderful if you could have a look as well.

Thank you, Homa lifted.gff3.gz

yeban commented 4 years ago

So the gff_recover.rb script is failing because you have tabs in your 9th column. Tabs within a column must be escaped: GFF3 spec.

Even otherwise gff_recover.rb would largely be unable to work with your GFF as it contains too many features that the script does not recognise. I wrote the script for our simpler use case: transcripts and their coding sequences. Writing your own script to clean up lifted.gff3 is thus a good idea.