xarray-contrib / flox

Fast & furious GroupBy operations for dask.array
https://flox.readthedocs.io
Apache License 2.0
124 stars 18 forks source link

Initial minimal working Cubed example for "map-reduce" #352

Closed tomwhite closed 6 months ago

tomwhite commented 7 months ago

This is a first step to implementing #224.

I added a separate code path to the Dask one, in cubed_groupby_agg, since it is sufficiently different (for example, the combine step in Cubed manages memory in a different way).

This PR relies on https://github.com/cubed-dev/cubed/pull/442, which adds the ability to specify the size of the grouping axis.

I added a unit test based on the Dask one, which passes a few cases - but there are plenty more to support (nans, fill values, sorting, etc).

I have included a Jupyter notebook in this PR, which shouldn't be merged, but shows the code working with a cut-down version of the example in https://flox.readthedocs.io/en/latest/user-stories/climatology-hourly.html. (I haven't tried running anything at scale yet.)

Interested to get your feedback @dcherian and @TomNicholas!

dcherian commented 7 months ago

Thanks @tomwhite ! Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

tomwhite commented 7 months ago

Thanks for reviewing @dcherian! The _finalize_results change is a good improvement!

Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

This might be possible, but it would obviously be more work. The logic for map-reduce, blockwise, and cohorts is combined in dask_groupby_agg, which complicates things. Cubed has a slightly different way to do map-reduce, blockwise may be the same (not sure yet), and Cubed may not need cohorts (https://github.com/xarray-contrib/flox/issues/224#issuecomment-1607701773).

tomwhite commented 7 months ago

I made a release of Cubed with the changes needed by this PR. I've also tried adding the Cubed tests to CI, so we'll see if that works.

I had a look at the grouping by multiple variables (2D) case, but it's not trivial so I'd rather do it as a follow up (#353).

dcherian commented 7 months ago

I'll take a look soon, I promise.

I'm the mean time it would be good to add that notebook as documentation.

tomwhite commented 6 months ago

I'll take a look soon, I promise.

Thanks!

I'm the mean time it would be good to add that notebook as documentation.

Where would be a good place to add it do you think?

dcherian commented 6 months ago

Under tricks and stories is fine, it's got a collection of notebooks.

Longer term we can update the "Duck Array Support" page.

dcherian commented 6 months ago

Also I took a look and this looks good to me. I experimented with some refactoring but I agree that it's be good to see what's needed for blockwise/cohorts before we actually refactor things.