xarray-contrib / flox

Fast & furious GroupBy operations for dask.array
https://flox.readthedocs.io
Apache License 2.0
124 stars 18 forks source link

Implement `method='blockwise'` for Cubed #354

Closed tomwhite closed 5 months ago

dcherian commented 7 months ago

For context, the blocker here is some sort of internal support for variable chunking in Cubed? And presumably some rechunking to regular chunking will be needed at the end?

tomwhite commented 6 months ago

For context, the blocker here is some sort of internal support for variable chunking in Cubed? And presumably some rechunking to regular chunking will be needed at the end?

Yes, that what I had been thinking. However, now I think it should be possible to choose the rechunk boundaries when resampling so that each output chunk has the same number of groups. For the example shown in https://flox.readthedocs.io/en/latest/implementation.html#method-blockwise, the output would have two groups per chunk, rather than (2, 2, 3, 1) groups in each chunk. (It's OK if the last chunk has fewer groups.) There is slightly more data transferred this way, but it avoids a final rechunk, which avoids a whole dataset copy, so I think it's worth a try.

dcherian commented 6 months ago

The other way to think of this then is that you want cohorts with equal sized cohorts (except for the last one)