xapple / track

Provides easy read/write access to genomic tracks
http://xapple.github.com/track/
GNU General Public License v3.0
22 stars 3 forks source link

Allow single-base intervals in GFF where the start position equals the e... #6

Closed tfrayner closed 10 years ago

tfrayner commented 10 years ago

The GFF specification used by this module (http://genome.ucsc.edu/FAQ/FAQformat.html#format3) says that the end position of a feature is to be specified as inclusive. To me this indicates that a single-base interval should be specified with the same start and end positions, and that such intervals are not interpreted as null by the GFF spec. This fix merely allows for such GFF files to be read by track. Thanks very much!

xapple commented 10 years ago

Yes, the infamous numbering conventions. I've always followed the intuitive logic that a feature that starts and stops in the same place has a length of zero nucleotides. However, this is not always the intended meaning depending on the scheme used by your data source. As you have pointed out, some providers will signify a feature of one nucleotide when using equal boundaries. When this happens, you can use some of the convenience functions that exist to convert data from common sources to the other:

http://xapple.github.io/track/content/track.html#track.Track.ensembl_to_ucsc

Unfortunately I don't think merging your pull request is a good idea. It's important that the internal representation used by the track package remains constant across all parsers and operations. This standard is thoroughly described here in the documentation so as to alleviate the confusion:

http://xapple.github.io/track/content/numbering.html

Of course, your are welcome to change your copy of code in any way you like. Be aware though that allowing null intervals might produce unexpected effects when manipulating the tracks and computing things such as overlaps etc.

tfrayner commented 10 years ago

Thank you very much for considering my request. I completely understand your reasoning. However, while the convenience functions might work in the general case, I do not see how one might apply them to loading a pre-existing GFF with these problematic intervals, since the Track.load() function throws an exception before one can call Track.ensembl_to_ucsc(). Thanks anyway :-)

xapple commented 10 years ago

OK well in these particular circumstances when you know that the source is giving you null-interval tracks, I wouldn't advise against the simple use of a one line awk to add one to your end column.

awk '{$5=$5+1;print}' gff_test1.gff > gff_test1_fixed.gff

You can apply this to all the files coming from that source, and then start working with them safely.