xapple / track

Provides easy read/write access to genomic tracks
http://xapple.github.com/track/
GNU General Public License v3.0
22 stars 3 forks source link

Track manipulations assume features have names #4

Open Phlya opened 10 years ago

Phlya commented 10 years ago

Hi! I know you said you don't really maintain the package anymore, but still, I found a weird thing and thought, you might improve it... For now I did a dirty hack in my installation where necessary, but without understanding the architecture of the package it is hard to do it properly.

So, the thing is, it seems, that all the manipulations with tracks, such as overlapping, assume that features of tracks have names and 2 other fields after it. You can see it in, for example, overlap.py file. make_feature function from git:

def make_feature(a, b):
    return (max(a[0],b[0]),
            min(a[1],b[1]),
            make_name(a[2], b[2]),
            (a[3]+b[3])/2.0,
            a[4]==b[4] and b[4] or 0) + b[5:]

It caused error on the line with a call of _makename, because there was no a[2] or b[2] in my track - the features didn't have names.

This is how my dirty function looks and causes no trouble:

def make_feature(a, b):
    try:
    name = make_name(a[2], b[2])
    except:
        try:
            return (max(a[0],b[0]),
                    min(a[1],b[1]))
        except:
            raise ValueError
    return (max(a[0],b[0]),
            min(a[1],b[1]),
            name,
            (a[3]+b[3])/2.0,
            a[4]==b[4] and b[4] or 0) + b[5:]

The change is quite obvious, and it solved the problem, but it is not really a good way to solve it.

Another thing, concerning fields, is that such thing doesn't work:

two_not_one = overlap(complement(track1), track2)

It causes this:

Traceback (most recent call last):
  File "my_script.py", line 17, in <module>
    two_not_one = overlap(complement(track1), track2)
  File "/usr/local/lib/python2.7/dist-packages/track-1.2.0-py2.7-linux-x86_64.egg/track/manipulate.py", line 206, in __call__
    rest_of_fields = [f for f in value.fields if f not in first_fields]
AttributeError: 'VirtualTrack' object has no attribute 'fields'

I had to comment a few line in the manipulate.py to make it work, though It probably now loses the field information from tracks (I don't have any fields except for start and end, so it is not a problem for now).

It would be really nice, if you looked into these issues.

xapple commented 10 years ago

Yeah, the fields thing should definitly be corrected. I guess the package never got used enough to catch all these kind of bugs. As for your track missing the names of the features, I thought I recalled having something that would add blank fields if for instance the track was missing names or strand information... maybe it's not working correctly.

The story behind all this is that I wrote this package two ago ago when I was hired in a bioinformatics core facility. They wanted something to process and manipulate genomic tracks from different formats. This is the project we came up with. But plans were changed, the team moved on to something else, I started a PhD a in a different University and it never really got used.

It's a pity because I thought it was a nice idea with some potential and did invest about 6 months coding it. The genomics field really needs a universal parser library with an SQL (or HDF5) backend and a comprehensive interface in my opinion. Unfortunately I can't provide support for it anymore today as I'm overwhelmed with other things. But the code is GPL licensed so you are welcome to do whatever you want with it ! A few other users exist and @bow has contirubted to the package a bit in the past. Maybe you could ask him ?

It was one of my first python projects, I would do it differently today. I think I made the architecture a bit too complicated and would rather go for something requiring more lines of code from the user but much simpler to read and maintain now.

Phlya commented 10 years ago

Right, I see... I is a very sad story, I would say, because as far, as I know, it is the most comprehensive python package for such things, while other can't really handle many formats. It also has very nice options of manipulating tracks, which are really useful. Of course, there are things to improve, not counting bugs, but still, I like it very much.

BTW I have posted an issue in Biopython's tracker about creating or adopting an existing parser for such files and suggested using track; would be great if you commented there.

As for me contributing myself... I would love to, but I am afraid I am not experienced enough, and my contribution won't do much good to the project.

As of your last point, I think that the good side (ease of use) is also very important; that's why I am using the package myself.

xapple commented 10 years ago

Can you link to the issue in Biopython's tracker ?

Phlya commented 10 years ago

https://github.com/biopython/biopython/issues/278