I noticed when trying to use track to parse GTF files that the round-trip conversion have some unexpected behaviors. More specifically:
The attribute key-value pairs are never written as it were (the parser internally converts all of them into a '.').
The source column values of the original file are always discarded (since the source fields name in the parser and serializer are different).
The order of the GTF entries are always sorted by coordinate and there was no way to control this (which is nice but not always desired).
When a GTF score column has a value of '.', the parser always converts it to 0.0. This do not exactly mean the same and I would guess that users may prefer to have their original values intact.
This pull request mainly addresses those four behaviors:
Attribute key-value pairs are now stored and written as they were in the initial GTF file. I try to follow the existing conventions of storing all attribute key-value pairs as extra items in the internal pyrow object.
source columns are now passed properly to the newly written GTF file (I changed the field name in both the parser and serializer to 'feature', following the GTF spec)
I updated the way track.read passes around the order argument value and its default value as well. order value is now passed as it is initially called when selections is not a single string. Also, the default value has been updated to '' (empty string), to allow users to better control the line ordering (it is always the same as the input file).
The GTF parser now keeps the score column value as a string.
Along with those changes, I also:
Updated the GTF parser to also check whether the attribute column keys start with gene_id and transcript_id
Added a new GTF test case, whose rows have extra attribute key-value pairs and score values are either 0.0 or ..
Added a new test in the GTF test suite for round-trip conversion (GTF -> SQL and vice versa).
Updated the old gtf_test1.sql test case to reflect the updates in the parser and serializer.
Updated formatting in the gtf_test1.gtf test case to use tabs as column separator instead of spaces (spaces are invalid, according to the spec).
All GTF test passes in my box. There are errors, but they seem to be unrelated to my changes (they were there even before I made any changes):
ERROR: runTest (track.test.format_gzip.TestWithoutExtension):OSError: [Errno 2] No such file or directory: '/home/bow/devel/repos/watch/track/samples/gzip/features4.gzip'
ERROR: custom_boolean (track.test.manipulate.TestManips):ImportError: No module named gMiner.manipulate
FAIL: runTest (track.test.format_bigwig.TestRoundtrip):AssertionError: The files: '/home/bow/devel/repos/watch/track/samples/bigwig/scores2.bigwig' and '/tmp/tmpcS7V1k.bigwig' differ
I noticed when trying to use
track
to parse GTF files that the round-trip conversion have some unexpected behaviors. More specifically:'.'
).source
column values of the original file are always discarded (since thesource
fields name in the parser and serializer are different).'.'
, the parser always converts it to0.0
. This do not exactly mean the same and I would guess that users may prefer to have their original values intact.This pull request mainly addresses those four behaviors:
pyrow
object.source
columns are now passed properly to the newly written GTF file (I changed the field name in both the parser and serializer to 'feature', following the GTF spec)track.read
passes around theorder
argument value and its default value as well.order
value is now passed as it is initially called whenselections
is not a single string. Also, the default value has been updated to''
(empty string), to allow users to better control the line ordering (it is always the same as the input file).Along with those changes, I also:
gene_id
andtranscript_id
0.0
or.
.gtf_test1.sql
test case to reflect the updates in the parser and serializer.gtf_test1.gtf
test case to use tabs as column separator instead of spaces (spaces are invalid, according to the spec).All GTF test passes in my box. There are errors, but they seem to be unrelated to my changes (they were there even before I made any changes):
ERROR: runTest (track.test.format_gzip.TestWithoutExtension):
OSError: [Errno 2] No such file or directory: '/home/bow/devel/repos/watch/track/samples/gzip/features4.gzip'
ERROR: custom_boolean (track.test.manipulate.TestManips):
ImportError: No module named gMiner.manipulate
FAIL: runTest (track.test.format_bigwig.TestRoundtrip):
AssertionError: The files: '/home/bow/devel/repos/watch/track/samples/bigwig/scores2.bigwig' and '/tmp/tmpcS7V1k.bigwig' differ
Any thoughts on the changes :)?