williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Annotation files are correctly recognised #126

Closed koustav-pal closed 4 years ago

koustav-pal commented 4 years ago

Hi,

I am trying to use IRFinder and did some filtering of the annotation file prior to IRFinder analysis. I did the processing using rtracklayer, an R package for manipulating genomic tracks and information. I exported the filtered information as a gff3, but this file was not processed correctly by IRFinder. Later I found out that the gtf2bed.pl script within the package implements a very small fraction of the GFF3 format. This makes it extremely difficult to work with the package since there are no informative logs produced by the program.

A better handling of the annotation files and or better logging of messages by the program would be extremely helpful.

dg520 commented 4 years ago

@koustav-pal The gtd2bed.pl file is adapted from an open-source script that is publicly available. We believe that a robust handling of annotation files depends on a standard format of the annotation itself. Here we use restrict standard formats for GTF and GFF3. You can find general requirements for these two formats here and IRFinder-specific checklist here.

Unfortunately, not all external packages generate standard GFF3 and GTF format. We have to ask users to ensure the annotation is appropriately configured, which is out of our control. We understand your frustration, but we cannot predict beforehand what kind of non-starndard formats user will provide.

To mitigate the complex structure of GFF3 and GTF, IRFinder is designed to automatically download reference from Ensembl, where everything is standardized. But for the genomes that are not in Ensembl, users have to curate an annotation that is usable for IRFinder.

koustav-pal commented 4 years ago

Hi @dg520,

While recreating the gff3 annotation file, I specifically checked the IRFInder annotation requirements and made sure the same fields were present. However, the file was not parsed correctly. This was because the gtf2bed.pl script used in the package locates quoted strings and then parses those out. This is a suggestion for a more reliable way of parsing the field values. A more reliable and fail-safe approach would be to consider any string between the field and ; and then strip whitespace and quoted characters from these strings. This would in now way be less restrictive.

dg520 commented 4 years ago

@koustav-pal While we really appreciate your suggestion, we believe the current gtf2bed.pl is robust when the annotation strictly follows the standard format (e.g. attributes with quotes). With that said, I'll mark your suggestion down and see if we can implement it in the next update.