wurmlab / flo

Same species annotation lift over pipeline.
96 stars 28 forks source link

file "-" does not contain 9 tab (\t) separated fields #34

Closed elcortegano closed 3 years ago

elcortegano commented 4 years ago

First of all I would like to thank you for this tool. It targets a task that is extremely difficult to do for non-model organisms with other tools.

I am having however an issue that I am not being able to solve. According to the error, my gff file does not have a header, nor does contain 9 tab separated fields. But it does (file attached: gff_file.zip). This is the error:

...

Processing chromosome_2
mkdir run/ref_v5.6_exons3_chromosome_2
liftOver -gff ref_v5.6_exons3_chromosome_2.gff3 run/liftover.chn run/ref_v5.6_exons3_chromosome_2/lifted.gff3 run/ref_v5.6_exons3_chromosome_2/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/home/elcortegano/tmp/lift/flo/gff_recover.rb run/ref_v5.6_exons3_chromosome_2/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/ref_v5.6_exons3_chromosome_2/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: line 1 in fil
[gff_file.zip](https://github.com/wurmlab/flo/files/5493835/gff_file.zip)
e "-" does not contain 9 tab (\t) separated fields
rake aborted!
Command failed with status (1): [/home/elcortegano/tmp/lift/flo/gff_recover...]
/home/elcortegano/tmp/lift/flo/Rakefile:60:in `block (2 levels) in <top (required)>'
/home/elcortegano/tmp/lift/flo/Rakefile:40:in `each'
/home/elcortegano/tmp/lift/flo/Rakefile:40:in `block in <top (required)>'
/usr/share/rubygems-integration/all/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
Tasks: TOP => default
(See full trace by running task with --trace)

This is using the (attached above) gff3 file after removing annotations using gff_remove_feats.rb so that only mRNA, exon and CDS are left, although the same error is for the original file.

What is wrong with the file?

Thank you

yeban commented 4 years ago

Thanks for the kind words. The point at which the error is raised, I think flo is complaining about the lifted over GFF rather than the input. Can you check if it (run/ref_v5.6_exons3_chromosome_2/lifted.gff3) is empty? This can happen if the chromosome/scaffold names in the GFF and FASTA files don't match. Could this be the case?

elcortegano commented 4 years ago

The file is not empty (attached: lifted_gff.zip). I was also using FASTA and GFF with just one chromosome to avoid errors related to this, and the name used in the two set of files is the same.

yeban commented 4 years ago

Not sure why lifted.gff3 only contains exons. But it would explain the error message. Flo first constructs chain file, then runs liftOver, and then runs gff_recover.rb to curate liftOver's output. Because lifted.gff3 only contains exons, gff_recover.rb doesn't understand that and produces empty output. Hence the error your got.

However, the input file in your original message was called ref_v5.6_exons3_chromosome_2.gff3, while in the previous message the folder inside run/ is called ref_v5.6_exons5_chromosome_2 instead of ref_v5.6_exons3_chromosome_2, suggesting the two are different runs.

If your use case only involves lifting over exons, you can ignore this error and use lifted.gff3 as the final output.

elcortegano commented 4 years ago

Yes that is right, this was because I was removing several types of annotation in the gff (thinking they could cause an error).

However, the error is still the same with a file including not only exons but also the transcripts (mRNA) and CDS (attached gff and lifted file: gff_and_lift.zip).

yeban commented 4 years ago

Thanks for sharing the files. It works for me when I run the following command:

gff_recover.rb run/ref_v5.6_exons3_chromosome_2/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/ref_v5.6_exons3_chromosome_2/lifted_cleaned.gff

Here's the output: lifted_cleaned.gff.gz

But I am intrigued why it won't work on your system. Which Mac or Linux version and Ruby do you have? Does it work if you split the above command into two:

gff_recover.rb run/ref_v5.6_exons3_chromosome_2/lifted.gff3 > processed.gff 2> unprocessed.gff

gt gff3 -tidy -sort -addids -retainids processed.gff > further_processed.gff
elcortegano commented 4 years ago

I'm using Ubuntu 20.04, and ruby version is 2.7.0p0.

I think these commands revealed the source of the error. After running gff_recover.rb, I got an message indicating to install the ruby package bio.

Once done, the error that names this issue disappears, and eventually I got a lifted.gff with no errors!

yeban commented 3 years ago

Thanks, that was very helpful - I updated flo to handle errors in this step more gracefully.