xarray-contrib / flox

Fast & furious GroupBy operations for dask.array
https://flox.readthedocs.io
Apache License 2.0
124 stars 18 forks source link

Fix numbagg aggregations #282

Closed dcherian closed 1 year ago

dcherian commented 1 year ago

Closes #281

  1. fix version check
  2. Handle fill_value for numbagg.
  3. enable numbagg for count

TODO:

max-sixty commented 1 year ago

OK awesome, thanks a lot; let me know if there's anything you need on the numbagg side!

dcherian commented 1 year ago

FYI I'm running in to a major blocker with how Xarray handles groups with all-NaN entries, and groups with no entries. I think you should just choose numbagg explicitly using the engine kwarg and move on.

max-sixty commented 1 year ago

Thanks a lot for doing all these — it looks like difficult and finicky work.

Let me know what I can do on the numbagg side to make it less of a burden on your end — particularly things that are bad / wrong in numbagg behavior (am still happy to fix bad upstream conventions if it makes your job much easier though)

dcherian commented 1 year ago

I'm not sure you should, it's really an Xarray annoyance

import pandas as pd
import numpy as np
from xarray import Dataset

times = pd.date_range("2000-01-01", freq="6H", periods=10)
ds = Dataset(
    {
        "bar": ("time", [1, 2, 3, np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan], {"meta": "data"}),
        "time": times,
    }
)

expected_time = pd.date_range("2000-01-01", freq="3H", periods=19)
expected = ds.reindex(time=expected_time)
ds.resample(time="3H").sum().bar.data
# array([ 1., nan,  2., nan,  3., nan,  0., nan,  0., nan,  0., nan,  4., nan,  5., nan,  0., nan,  0.])

^ It's NaN when there are no observations in the window, and 0 if there are only NaNs in the window. Both numpy_groupies and numbagg would just give you all 0s (i.e. the identity element) which is sensible to me.

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector. https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435 I think the result above is an unintended consequence of the implementation.

max-sixty commented 1 year ago

The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan]) in windows with all NaN observations, and then reindex with default fill_value=np.nan to the final time vector. https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435 I think the result above is an unintended consequence of the implementation.

Great, definitely agree. Possibly we could change that.

Pandas even does the arguably more logical thing!

ds.to_pandas().resample("3H").sum()
Out[5]:
                     bar
time
2000-01-01 00:00:00  1.0
2000-01-01 03:00:00  0.0
2000-01-01 06:00:00  2.0
2000-01-01 09:00:00  0.0
2000-01-01 12:00:00  3.0
2000-01-01 15:00:00  0.0
2000-01-01 18:00:00  0.0
2000-01-01 21:00:00  0.0
2000-01-02 00:00:00  0.0
2000-01-02 03:00:00  0.0
2000-01-02 06:00:00  0.0
2000-01-02 09:00:00  0.0
2000-01-02 12:00:00  4.0
2000-01-02 15:00:00  0.0
2000-01-02 18:00:00  5.0
2000-01-02 21:00:00  0.0
2000-01-03 00:00:00  0.0
2000-01-03 03:00:00  0.0
2000-01-03 06:00:00  0.0
dcherian commented 1 year ago

Sweet, looks like we are actually using numbagg by default now.

I don't understand the first row for mean but the rest are all functions I expect to send to numbagg

| Before     | After       | Ratio | Benchmark (Parameter)                                            |
| [273d319e] | [54d57d73]  |       |                                                                  |
|------------|-------------|-------|------------------------------------------------------------------|
| 297±0.7ms  | 149±0.4ms   | 0.5   | reduce.ChunkReduce2DAllAxes.time_reduce('mean', 'bins', None)    |
| 65.0±0.3ms | 28.1±0.3ms  | 0.43  | reduce.ChunkReduce2D.time_reduce('count', 'None', None)          |
| 144±0.5ms  | 45.0±0.4ms  | 0.31  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'None', None)  |
| 137±0.4ms  | 41.6±0.4ms  | 0.3   | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'None', None)   |
| 117±0.1ms  | 34.0±0.1ms  | 0.29  | reduce.ChunkReduce2D.time_reduce('count', 'bins', None)          |
| 144±0.6ms  | 42.0±0.3ms  | 0.29  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'None', None)  |
| 139±0.3ms  | 34.2±0.04ms | 0.25  | reduce.ChunkReduce2D.time_reduce('nansum', 'bins', None)         |
| 75.2±0.2ms | 17.1±0.2ms  | 0.23  | reduce.ChunkReduce2D.time_reduce('nansum', 'None', None)         |
| 271±1ms    | 58.2±0.1ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'bins', None)   |
| 205±0.3ms  | 44.5±0.4ms  | 0.22  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'None', None) |
| 276±0.9ms  | 59.2±0.4ms  | 0.21  | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'bins', None)  |
| 102±2ms    | 20.0±0.1ms  | 0.2   | reduce.ChunkReduce2D.time_reduce('nanmax', 'None', None)         |
| 337±0.8ms  | 59.7±0.4ms  | 0.18  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'bins', None) |
| 208±0.3ms  | 35.1±0.1ms  | 0.17  | reduce.ChunkReduce2D.time_reduce('nanmean', 'bins', None)        |
| 153±0.4ms  | 23.8±0.1ms  | 0.16  | reduce.ChunkReduce2D.time_reduce('nanmean', 'None', None)        |
| 276±0.5ms  | 45.1±0.4ms  | 0.16  | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'bins', None)  |
| 205±0.3ms  | 19.0±0.1ms  | 0.09  | reduce.ChunkReduce2D.time_reduce('nanmax', 'bins', None)         |
max-sixty commented 1 year ago

Nice!!

And I recently enabled parallel by default — it scales really well for multiple dims now if multiple cares are available.