xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
90 stars 20 forks source link

Compliance with Dask version 2021.03 ? #27

Closed knutfrode closed 3 years ago

knutfrode commented 3 years ago

First of all thank you for making this useful tool. It has worked well for me until Dask version 2021.03, after which I am getting error as below when calling histogram. I apologize that this might be not be sufficient information to reproduce.

../../opendrift/models/basemodel.py:3707: in get_density_xarray
    h = histogram(self.ds.lon, self.ds.lat, bins=[lonbin, latbin],
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/xarray.py:133: in histogram
    h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:261: in histogram
    bin_counts = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:133: in _histogram_2d_vectorized
    bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:81: in _dispatch_bincount
    return _bincount_2d(bin_indices, weights, N, hist_shapes)
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:32: in _bincount_2d
    return bc_offset.reshape(final_shape)
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/dask/array/core.py:1919: in reshape
    return reshape(self, shape, merge_chunks=merge_chunks)
        if reduce(mul, shape, 1) != x.size:
>           raise ValueError("total size of new array must be unchanged")
E           ValueError: total size of new array must be unchanged

../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/dask/array/reshape.py:206: ValueError
dougiesquire commented 3 years ago

Noting that @aaronspring encountered what appears to be a similar problem using dask==2021.03.0 with xhistogram inside xskillscore (https://github.com/xarray-contrib/xskillscore/issues/280)

rabernat commented 3 years ago

Ok, thanks for reporting this. Need to get a fix in asap. Any ideas?

Ideally we should be testing against xarray and dask master branches.

raybellwaves commented 3 years ago

Took a quick look at one of the failing tests in xskillscore https://github.com/xarray-contrib/xskillscore/issues/280#issuecomment-798825624

Failing at dask/array/reshape.py:206. Worth getting at with a minimal example.

raybellwaves commented 3 years ago

failing here with dask: https://github.com/xgcm/xhistogram/blob/master/xhistogram/test/test_core.py#L177

The test: https://gist.github.com/raybellwaves/fe5c941eea0a885616c3c1cb3d8627b3 errors come in on the histogram call.

rabernat commented 3 years ago

The core issue is that dask's bincount function, which we import here

https://github.com/xgcm/xhistogram/blob/a3760d3fc1e0aca29f3a5180e7cbc1c5abc8f68f/xhistogram/duck_array_ops.py#L33

is now returning dask arrays with shape (). I'm looking into this. Is there a dask PR someone can point to where this behavior was changed?

raybellwaves commented 3 years ago

maybe https://github.com/dask/dask/pull/7391 will help?

rabernat commented 3 years ago

I just reviewed that PR and I'm 99% sure that it will fix the problem.

I consider this a regression in dask, so we will not be making a workaround in xhistogram. We have two choices here:

What do those affected prefer?

dougiesquire commented 3 years ago

I'm fine with doing nothing if others are. dask releases are more or less monthly and I think we've already pinned an earlier version of dask in the last xskillscore release.

jrbourbeau commented 3 years ago

https://github.com/dask/dask/pull/7391 was just merged into Dask's main branch and I confirmed that tests here pass with the changes in that PR

aaronspring commented 3 years ago

I find a followup issue with dask==2021.04: https://github.com/xarray-contrib/xskillscore/issues/288

Its a new error message:

skillscore/tests/test_probabilistic.py:831: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xskillscore/core/probabilistic.py:1209: in roc
    dim=dim,
xskillscore/core/contingency.py:122: in __init__
    self._table = self._get_contingency_table(dim)
xskillscore/core/contingency.py:167: in _get_contingency_table
    bin_dim_suffix="_bin",
xskillscore/core/utils.py:166: in histogram
    return xhist(*args, bins=bins, **kwargs)
/usr/share/miniconda/envs/xskillscore-minimum-tests/lib/python3.7/site-packages/xhistogram/xarray.py:146: in histogram
    block_size=block_size
/usr/share/miniconda/envs/xskillscore-minimum-tests/lib/python3.7/site-packages/xhistogram/core.py:271: in histogram
    block_size=block_size,
/usr/share/miniconda/envs/xskillscore-minimum-tests/lib/python3.7/site-packages/xhistogram/core.py:131: in _histogram_2d_vectorized
    bin_indices = ravel_multi_index(each_bin_indices, hist_shapes)
/usr/share/miniconda/envs/xskillscore-minimum-tests/lib/python3.7/site-packages/xhistogram/duck_array_ops.py:24: in f
    return getattr(module, name)(*args, **kwargs)
<__array_function__ internals>:6: in ravel_multi_index
    ???
/usr/share/miniconda/envs/xskillscore-minimum-tests/lib/python3.7/site-packages/dask/array/core.py:1525: in __array_function__
    return da_func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

multi_index = [dask.array<digitize, shape=(1, 200), dtype=int64, chunksize=(1, 200), chunktype=numpy.ndarray>, dask.array<digitize, shape=(1, 200), dtype=int64, chunksize=(1, 200), chunktype=numpy.ndarray>]
dims = [4, 4], mode = 'raise', order = 'C'

    @wraps(np.ravel_multi_index)
    def ravel_multi_index(multi_index, dims, mode="raise", order="C"):
>       return multi_index.map_blocks(
            _ravel_multi_index_kernel,
            dtype=np.intp,
            chunks=(multi_index.shape[-1],),
            drop_axis=0,
            func_kwargs=dict(dims=dims, mode=mode, order=order),
        )
E       AttributeError: 'list' object has no attribute 'map_blocks'

xhist master doesnt solve it: https://github.com/xarray-contrib/xskillscore/pull/289

raybellwaves commented 3 years ago

xgcm tests seem ok on upstream: https://github.com/xgcm/xhistogram/actions/workflows/upstream.yml Is there something in xskillscore that isn't tested in xgcm?

aaronspring commented 3 years ago

the xhist wrapper for datasets by @dougiesquire

dougiesquire commented 3 years ago

I think I've found the issue here. It's an issue in dask that was introduced into xhistogram in the latest version.

The dask version of ravel_multi_index (introduced into xhistogram in 0.1.3) does not actually replicate the API of the equivalent numpy function. Specifically, the numpy implementation can receive a tuple of integer arrays, whereas the dask version cannot (despite saying it can). This functionality is required in xhistogram when there are multiple args.

That is, the following works:

import numpy as np
import dask.array as dsa

arr = np.array([[3,6,6],[4,5,1]]) # Example from numpy docs
da = dsa.from_array(arr)

dsa.ravel_multi_index(da, (7,7))

but the following fails with the error 'tuple' object has no attribute 'map_blocks':

dsa.ravel_multi_index((da, da), (7,7))

This is a bug in dask and I'll open an issue about it.

xhistogram doesn't currently have tests with multiple dask args, which is why this didn't get flagged earlier (we should add them!), but xskillscore does try to use xhistogram in this way (and thus fails).

@raybellwaves, @aaronspring one solution for now in xskillscore is to pin xhistogram==0.1.2.

dougiesquire commented 3 years ago

Issue opened with dask here: https://github.com/dask/dask/issues/7580

dougiesquire commented 3 years ago

Issue opened with dask here: dask/dask#7580

A fix for this has been implemented in the latest release of dask (2021.06.1). In the meantime, however, the xhistogram refactor to use dask.array.blockwise (#49) has removed our dependence on the dask.array functions referenced in this issue, so we actually no longer need this fix.

Closing the issue.