Closed tomwhite closed 6 months ago
Thanks @tomwhite ! Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine
stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?
Thanks for reviewing @dcherian! The _finalize_results
change is a good improvement!
Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the
combine
stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?
This might be possible, but it would obviously be more work. The logic for map-reduce, blockwise, and cohorts is combined in dask_groupby_agg
, which complicates things. Cubed has a slightly different way to do map-reduce, blockwise may be the same (not sure yet), and Cubed may not need cohorts (https://github.com/xarray-contrib/flox/issues/224#issuecomment-1607701773).
I made a release of Cubed with the changes needed by this PR. I've also tried adding the Cubed tests to CI, so we'll see if that works.
I had a look at the grouping by multiple variables (2D) case, but it's not trivial so I'd rather do it as a follow up (#353).
I'll take a look soon, I promise.
I'm the mean time it would be good to add that notebook as documentation.
I'll take a look soon, I promise.
Thanks!
I'm the mean time it would be good to add that notebook as documentation.
Where would be a good place to add it do you think?
Under tricks and stories is fine, it's got a collection of notebooks.
Longer term we can update the "Duck Array Support" page.
Also I took a look and this looks good to me. I experimented with some refactoring but I agree that it's be good to see what's needed for blockwise/cohorts before we actually refactor things.
This is a first step to implementing #224.
I added a separate code path to the Dask one, in
cubed_groupby_agg
, since it is sufficiently different (for example, the combine step in Cubed manages memory in a different way).This PR relies on https://github.com/cubed-dev/cubed/pull/442, which adds the ability to specify the size of the grouping axis.
I added a unit test based on the Dask one, which passes a few cases - but there are plenty more to support (nans, fill values, sorting, etc).
I have included a Jupyter notebook in this PR, which shouldn't be merged, but shows the code working with a cut-down version of the example in https://flox.readthedocs.io/en/latest/user-stories/climatology-hourly.html. (I haven't tried running anything at scale yet.)
Interested to get your feedback @dcherian and @TomNicholas!