xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
89 stars 19 forks source link

[WIP] Multidimensional bins #59

Open TomNicholas opened 3 years ago

TomNicholas commented 3 years ago

Work-in-progress attempt to implement #28 . Builds atop #49.

However my test for checking the results of a 2D bins arrays doesn't actually pass and I haven't yet worked out why :/ It generates the right shape but the wrong counts.

(There are also a couple of other failing tests but I think they are just to do with input sanitization edge cases)

@aaronspring

TomNicholas commented 3 years ago

I've realised that this feature is kind of inconvenient to try and implement in xhistogram in its current form.

xhistogram's numpy API means that all ND bins have to be passed as pure numpy arrays through xhistogram.core.histogram, even when the bins are originally given as xr.DataArrays. To use those bins (in blockwise or plain _bincount) you have to be sure which axis is which, but that requires either passing already-broadcast bins arrays (breaking the API), providing extra information as to which axes are which on the bins (again breaking the API), or automatically broadcasting the numpy arrays inside xhistogram.core.histogram (possible but error-prone and frustrating when I would prefer to simply call xr.broadcast_like). The only reason to do even do this is to satisfy xhistogram's pure-numpy API (which I'm not sure anyone even uses).

Instead, IMO it would be much easier (/a better use of time) to implement this feature in xarray directly (either in https://github.com/pydata/xarray/pull/5400 or in a later PR). That's because in a pure xarray implementation there would be no need to have an internal function accepting only unlabelled numpy arrays: as you always know which axis on your given bins array is which you can grab the indices from the dim labels as needed and pass them straight to dask.array.blockwise.

I like to think that this PR will be useful as a template for the xarray implementation though.

@aaronspring FYI

TomNicholas commented 3 years ago

For future reference this SO discussion is about whether there are clever vectorized ways to do the bincounting when the weights are >1D.