wireservice / agate

A Python data analysis library that is optimized for humans instead of machines.
https://agate.readthedocs.io
MIT License
1.17k stars 155 forks source link

There is no obvious way to compute "moving average" like values #415

Closed anderser closed 8 years ago

anderser commented 8 years ago

This is a feature which might fit more into agate-stats or another extension. If so, shout out and I'll move the issue.

I need to find streaks of similar values in a dataset. This method is often used in sports (finding consecutive wins/losses and determine the period with most wins/losses in a row). Another application might be finding the period/date range with days of rain.

To me this is a Computation, but I am having trouble understanding how (or if) computation can reference the previous row.

Before submitting a PR, I'll just try to explain my method:

The calculation takes two parameters: the column which is your key/unique id. Maybe a timestamp, date. The other param is the column you want to find streaks of values in.

The prepare method then loops through the rows, compares the value to the previous row and if there is a new value, increases the streak value (an integer). The resulting streak value for each row is saved in a dict with value of key column as key.

The run methods then retrieves the corresponding streak value from the dict for the current row.

A working, but maybe not "to Agate standards" code here: https://gist.github.com/anderser/12d32a25f385f8a7f6d1

Sample table

|-------------+-------------|
|        the_date | has_rain  |
|-------------+-------------|
|  1985-01-01 |       True  |
|  1985-01-02 |      False  |
|  1985-01-03 |      False  |
|  1985-01-04 |      False  |
|  1985-01-05 |       True  |
|  1985-01-06 |       True  |
|  1985-01-07 |       True  |
|  1985-01-08 |       True  |
|  1985-01-09 |       True  |
|  1985-01-10 |       True  |
|         ... |        ...  |
|-------------+-------------|

Computing the streaks

rainperday_with_streaks =  rainperday.compute([
    ('streak', Streaks('the_date','has_rain'))
])

Group the data by streaks:

grouped_by_streaks = rainperday_with_streaks.group_by('streak')

Finding the start/end date of longest streaks

longest_streaks = grouped_by_streaks.aggregate([
    ('count', agate.Length()),
    ('from_date', agate.Min('the_date')),
    ('to_date', agate.Max('the_date'))
]).order_by('count', reverse=True)

Which would give you something like this:

|---------+------------+-------------|
|  count |  from_date |      to_date  |
|---------+------------+-------------|
|      24 | 2015-02-17 | 2015-03-12  |
|      21 | 2015-05-22 | 2015-06-11  |
|      19 | 2015-01-01 | 2015-01-19  |
|      16 | 2015-11-04 | 2015-11-19  |
|      15 | 2015-07-30 | 2015-08-13  |
|      14 | 2015-11-24 | 2015-12-07  |
|      13 | 2015-07-16 | 2015-07-28  |
|      11 | 2015-04-21 | 2015-05-01  |
|      11 | 2015-09-10 | 2015-09-20  |
|      10 | 2015-05-05 | 2015-05-14  |
|     ... |        ... |        ...  |
|---------+------------+-------------|
onyxfish commented 8 years ago

Hi Anders! This is a use-case I wrestled with when I was designing the computations & aggregations. This issue also comes up with things like running averages. The way you've done it is the way I intended for it to be done, though I'll admit it produces somewhat ugly code.

The alternative is to take analysis outside of a Computation and just iterate over the rows and produce new data using ordinary Python methods. Although this is code won't look like your other agate code, I always intended it to be possible to just treat an agate table as a normal Python object. (Via table.columns and table.rows.)

If you have suggestions for alternate ways of supporting this kind of operation I'd love to hear them!

onyxfish commented 8 years ago

Flagging this as a documentation issue. Running average would make a good example for the cookbook.

anderser commented 8 years ago

I call it guerilla code. It isn't pretty, but it does the job :) Some docs on how to do things like this would be great, yes. What I struggled most with was that each row doesn't (as far as I understood) have a unique key that can be used in the new dict in order to wire all up correctly in the run method. But maybe I just made this more complicated by making a Computation as you wrote. Pure Python might be easier to do (and to read/remember later)

onyxfish commented 8 years ago

Ah ha, I see. I had totally overlooked the problem before. Okay, let me think about this more. There might be a solution involving providing the row index or name to the Computation.run...

onyxfish commented 8 years ago

Huh, yeah, this doesn't really work at all, does it? Elevating the priority of this. Amazing it took so long for this case to come up!

anderser commented 8 years ago

Well it worked for me the way I did it, but maybe that was pure luck. It never felt good messing around with that key_column_name to get a "unique" key to add the streak value to anyway. I'll dig more into the Agate code and try to think of something if I understand what is going on there

onyxfish commented 8 years ago

Yeah, the way you did it is a good hack, but certainly not an ideal solution. There really shouldn't be a requirement that the table has a unique ID at all in order to compute something like this.

onyxfish commented 8 years ago

@anderser I've just pushed a set of changes to master that modify the implementation of Computation to better support these cases. This is a backwards-incompatible change, so I'm approaching it cautiously, but I feel pretty confident it's the right solution. Mind taking a look?

Here is an example of how the new interface can solve your problem: http://agate.readthedocs.org/en/latest/cookbook/compute.html#simple-moving-average

onyxfish commented 8 years ago

I'm satisfied that this working and I'm noticing some very pleasant symmetries with other parts of agate, so I'm calling this finished. If anybody spots any remaining gaps in the implementation, please reopen!

anderser commented 8 years ago

@onyxfish This is just great! Your changes using table in run method greatly simplified my Computation (and probably enabled more others). Thanks for the swift response to my messy code/proposal.

For the record: here is the adjusted Streaks Computation: https://gist.github.com/anderser/12d32a25f385f8a7f6d1

onyxfish commented 8 years ago

Hurray! That looks great! Do you mind if I use that code as an example?

anderser commented 8 years ago

Of course you may. Please do. And change if needed.