Closed dcherian closed 1 year ago
OK awesome, thanks a lot; let me know if there's anything you need on the numbagg side!
FYI I'm running in to a major blocker with how Xarray handles groups with all-NaN entries, and groups with no entries. I think you should just choose numbagg explicitly using the engine
kwarg and move on.
Thanks a lot for doing all these — it looks like difficult and finicky work.
Let me know what I can do on the numbagg side to make it less of a burden on your end — particularly things that are bad / wrong in numbagg behavior (am still happy to fix bad upstream conventions if it makes your job much easier though)
I'm not sure you should, it's really an Xarray annoyance
import pandas as pd
import numpy as np
from xarray import Dataset
times = pd.date_range("2000-01-01", freq="6H", periods=10)
ds = Dataset(
{
"bar": ("time", [1, 2, 3, np.nan, np.nan, np.nan, 4, 5, np.nan, np.nan], {"meta": "data"}),
"time": times,
}
)
expected_time = pd.date_range("2000-01-01", freq="3H", periods=19)
expected = ds.reindex(time=expected_time)
ds.resample(time="3H").sum().bar.data
# array([ 1., nan, 2., nan, 3., nan, 0., nan, 0., nan, 0., nan, 4., nan, 5., nan, 0., nan, 0.])
^ It's NaN when there are no observations in the window, and 0 if there are only NaNs in the window. Both numpy_groupies and numbagg would just give you all 0s (i.e. the identity element) which is sensible to me.
The Xarray behaviour is really an artifact of the fact that we accumulate np.nansum([np.nan])
in windows with all NaN observations, and then reindex with default fill_value=np.nan
to the final time vector.
https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435
I think the result above is an unintended consequence of the implementation.
The Xarray behaviour is really an artifact of the fact that we accumulate
np.nansum([np.nan])
in windows with all NaN observations, and then reindex with defaultfill_value=np.nan
to the final time vector. https://github.com/pydata/xarray/blob/feba6984aa914327408fee3c286dae15969d2a2f/xarray/core/groupby.py#L1435 I think the result above is an unintended consequence of the implementation.
Great, definitely agree. Possibly we could change that.
Pandas even does the arguably more logical thing!
ds.to_pandas().resample("3H").sum()
Out[5]:
bar
time
2000-01-01 00:00:00 1.0
2000-01-01 03:00:00 0.0
2000-01-01 06:00:00 2.0
2000-01-01 09:00:00 0.0
2000-01-01 12:00:00 3.0
2000-01-01 15:00:00 0.0
2000-01-01 18:00:00 0.0
2000-01-01 21:00:00 0.0
2000-01-02 00:00:00 0.0
2000-01-02 03:00:00 0.0
2000-01-02 06:00:00 0.0
2000-01-02 09:00:00 0.0
2000-01-02 12:00:00 4.0
2000-01-02 15:00:00 0.0
2000-01-02 18:00:00 5.0
2000-01-02 21:00:00 0.0
2000-01-03 00:00:00 0.0
2000-01-03 03:00:00 0.0
2000-01-03 06:00:00 0.0
Sweet, looks like we are actually using numbagg by default now.
I don't understand the first row for mean
but the rest are all functions I expect to send to numbagg
| Before | After | Ratio | Benchmark (Parameter) |
| [273d319e] | [54d57d73] | | |
|------------|-------------|-------|------------------------------------------------------------------|
| 297±0.7ms | 149±0.4ms | 0.5 | reduce.ChunkReduce2DAllAxes.time_reduce('mean', 'bins', None) |
| 65.0±0.3ms | 28.1±0.3ms | 0.43 | reduce.ChunkReduce2D.time_reduce('count', 'None', None) |
| 144±0.5ms | 45.0±0.4ms | 0.31 | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'None', None) |
| 137±0.4ms | 41.6±0.4ms | 0.3 | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'None', None) |
| 117±0.1ms | 34.0±0.1ms | 0.29 | reduce.ChunkReduce2D.time_reduce('count', 'bins', None) |
| 144±0.6ms | 42.0±0.3ms | 0.29 | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'None', None) |
| 139±0.3ms | 34.2±0.04ms | 0.25 | reduce.ChunkReduce2D.time_reduce('nansum', 'bins', None) |
| 75.2±0.2ms | 17.1±0.2ms | 0.23 | reduce.ChunkReduce2D.time_reduce('nansum', 'None', None) |
| 271±1ms | 58.2±0.1ms | 0.22 | reduce.ChunkReduce2DAllAxes.time_reduce('count', 'bins', None) |
| 205±0.3ms | 44.5±0.4ms | 0.22 | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'None', None) |
| 276±0.9ms | 59.2±0.4ms | 0.21 | reduce.ChunkReduce2DAllAxes.time_reduce('nansum', 'bins', None) |
| 102±2ms | 20.0±0.1ms | 0.2 | reduce.ChunkReduce2D.time_reduce('nanmax', 'None', None) |
| 337±0.8ms | 59.7±0.4ms | 0.18 | reduce.ChunkReduce2DAllAxes.time_reduce('nanmean', 'bins', None) |
| 208±0.3ms | 35.1±0.1ms | 0.17 | reduce.ChunkReduce2D.time_reduce('nanmean', 'bins', None) |
| 153±0.4ms | 23.8±0.1ms | 0.16 | reduce.ChunkReduce2D.time_reduce('nanmean', 'None', None) |
| 276±0.5ms | 45.1±0.4ms | 0.16 | reduce.ChunkReduce2DAllAxes.time_reduce('nanmax', 'bins', None) |
| 205±0.3ms | 19.0±0.1ms | 0.09 | reduce.ChunkReduce2D.time_reduce('nanmax', 'bins', None) |
Nice!!
And I recently enabled parallel
by default — it scales really well for multiple dims now if multiple cares are available.
Closes #281
TODO:
seen_groups
infactorize_
~