Closed tfrayner closed 10 years ago
Yes, the infamous numbering conventions. I've always followed the intuitive logic that a feature that starts and stops in the same place has a length of zero nucleotides. However, this is not always the intended meaning depending on the scheme used by your data source. As you have pointed out, some providers will signify a feature of one nucleotide when using equal boundaries. When this happens, you can use some of the convenience functions that exist to convert data from common sources to the other:
http://xapple.github.io/track/content/track.html#track.Track.ensembl_to_ucsc
Unfortunately I don't think merging your pull request is a good idea. It's important that the internal representation used by the track package remains constant across all parsers and operations. This standard is thoroughly described here in the documentation so as to alleviate the confusion:
http://xapple.github.io/track/content/numbering.html
Of course, your are welcome to change your copy of code in any way you like. Be aware though that allowing null intervals might produce unexpected effects when manipulating the tracks and computing things such as overlaps etc.
Thank you very much for considering my request. I completely understand your reasoning. However, while the convenience functions might work in the general case, I do not see how one might apply them to loading a pre-existing GFF with these problematic intervals, since the Track.load() function throws an exception before one can call Track.ensembl_to_ucsc(). Thanks anyway :-)
OK well in these particular circumstances when you know that the source is giving you null-interval tracks, I wouldn't advise against the simple use of a one line awk to add one to your end column.
awk '{$5=$5+1;print}' gff_test1.gff > gff_test1_fixed.gff
You can apply this to all the files coming from that source, and then start working with them safely.
The GFF specification used by this module (http://genome.ucsc.edu/FAQ/FAQformat.html#format3) says that the end position of a feature is to be specified as inclusive. To me this indicates that a single-base interval should be specified with the same start and end positions, and that such intervals are not interpreted as null by the GFF spec. This fix merely allows for such GFF files to be read by track. Thanks very much!