xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
89 stars 19 forks source link

Error with multiple dask `args` #48

Closed dougiesquire closed 3 years ago

dougiesquire commented 3 years ago

As of 0.1.3, xhistogram now fails when provided multiple dask args. This is because the dask version of ravel_multi_index that was introduced into xhistogram in 0.1.3 does not actually replicate the API of the equivalent numpy function. Specifically, the numpy implementation can receive a tuple of integer arrays, whereas the dask version cannot.

We actually don't have a test in our suite for multiple dask args, which is why this bug went uncaught until now.

I've opened an issue with dask, but we probably need to provide a more immediate fix for xhistogram. @rabernat, what do you think about reverting back to using your wrapper of np.ravel_multi_index until the issue is resolved with dask?

dougiesquire commented 3 years ago

Note, a fix is being implemented in dask here: https://github.com/dask/dask/pull/7584

rabernat commented 3 years ago

Now here https://github.com/dask/dask/pull/7594.

dougiesquire commented 3 years ago

Just to re-emphasise, xhistogram is currently broken for multiple dask args due to this issue. Depending on how long we anticipate dask/dask#7594 may take to be released, would it be worth reverting back to @rabernat's wrapper of np.ravel_multi_index and releasing a temporary fix?

rabernat commented 3 years ago

xhistogram is currently broken for multiple dask args due to this issue

Can't we fix this by pinning a compatible dask version here?

https://github.com/xgcm/xhistogram/blob/master/setup.py#L23

dougiesquire commented 3 years ago

I don't think there is a dask version with a working version of ravel_multi_index unfortunately. This was only introduced in the latest version of dask.

rabernat commented 3 years ago

I don't think there is a dask version with a working version of ravel_multi_index unfortunately.

I'm a bit confused. Has xhistogram always been broken with multiple dask args? In my recollection, things once worked absolutely fine. This code has been in for a long time:

https://github.com/xgcm/xhistogram/blob/master/xhistogram/duck_array_ops.py#L39

So if we rewind our dask version to before 2021.03, won't it go back to working agin?

dougiesquire commented 3 years ago

This code has been in for a long time:

https://github.com/xgcm/xhistogram/blob/master/xhistogram/duck_array_ops.py#L39

I'm not sure this is correct. I think this line was added in #31 just a few months ago. That PR switched from using a "dask-enabled" version of ravel_multi_index that you wrote, to using the official, but broken, dask version.

AFAICT we have two options:

  1. revert back to using the approach we were using pre #31 until dask/dask#7594 is merged/release (i.e. add back in this code)
  2. wait for dask/dask#7594 to be merged/release.
rabernat commented 3 years ago

Yes @dougiesquire you're correct (and sorry for not following up sooner).

Which option do you think is best?

jrbourbeau commented 3 years ago

Quick update: https://github.com/dask/dask/pull/7594 was just merged and should be out in a dask release in ~1.5 weeks (not this Friday, but next Friday)

jrbourbeau commented 3 years ago

Just checking in, https://github.com/dask/dask/pull/7594 is now out in a dask release. Is it safe to close this issue? Perhaps we should bump xhistograms minimum dask version

dougiesquire commented 3 years ago

Just checking in, dask/dask#7594 is now out in a dask release. Is it safe to close this issue? Perhaps we should bump xhistograms minimum dask version

Thanks for the reminder @jrbourbeau. xhistrogram doesn't actually use dask.array.ravel_multi_index since the refactor to use dask.array.blockwise (#49), so there hopefully shouldn't be a need to pin a minimum dask version.