xarray-contrib / xskillscore

Metrics for verifying forecasts
https://xskillscore.readthedocs.io/en/stable/
Apache License 2.0
225 stars 40 forks source link

check chunk size on dask arrays #99

Open raybellwaves opened 4 years ago

raybellwaves commented 4 years ago

xr.apply_unfunc expects a core dimension to be a single dask array chunk.

import numpy as np
import pandas as pd
import dask.array as da
import xarray as xr
import xskillscore as xs

stores = np.arange(400)
skus = np.arange(300)
dates = pd.date_range("1/1/2020", "12/31/2021", freq="D")

data = da.random.randint(10, size=(len(dates), len(stores), len(skus)))
y = xr.DataArray(data, coords=[dates, stores, skus], dims=["DATE", "STORE", "SKU"])
yhat = y.copy()

xs.rmse(y, yhat, ['DATE', 'STORE', 'SKU']).compute().values
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xskillscore/core/deterministic.py", line 677, in rmse
    return xr.apply_ufunc(
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xarray/core/computation.py", line 1058, in apply_ufunc
    return apply_dataarray_vfunc(
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xarray/core/computation.py", line 233, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xarray/core/computation.py", line 604, in apply_variable_ufunc
    result_data = func(*input_data)
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xarray/core/computation.py", line 586, in func
    return _apply_blockwise(
  File "/home/ray/local/bin/anaconda3/envs/xss/lib/python3.8/site-packages/xarray/core/computation.py", line 705, in _apply_blockwise
    raise ValueError(
ValueError: dimension 'DATE' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., ``.chunk({'DATE': -1})``, but beware that this may significantly increase memory usage.

The workaround is:

y = y.chunk({'DATE': -1, 'STORE': -1, 'SKU': -1})
yhat = yhat.chunk({'DATE': -1, 'STORE': -1, 'SKU': -1})

xs.rmse(y, yhat, ['DATE', 'STORE', 'SKU']).compute().values

This could be handled in xskillscore.

aaronspring commented 4 years ago

Or the user does it as you show here. Does map_blocks make sense here? Probably not, because mask blocks would do this calc on the block level but all blocks are needed for this calculation. therefore the ValueError makes sense. I think we shouldnt cover this.

aaronspring commented 4 years ago

http://xarray.pydata.org/en/latest/generated/xarray.apply_ufunc.html now allows rechunk. maybe we add this as a global config that the user can decide.