xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
90 stars 20 forks source link

divide by zero error #16

Open rabernat opened 4 years ago

rabernat commented 4 years ago
from xhistogram.core import histogram

data = np.zeros(36048471)
bins = np.array([1, 2, 3])
histogram(data, bins=[bins])

gives

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-84-530a0f6cf98c> in <module>
      3 data = np.zeros(36048471)
      4 bins = np.array([1, 2, 3])
----> 5 histogram(data, bins=[bins])

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    124 
    125     bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
--> 126                                     block_size=block_size)
    127 
    128     # just throw out everything outside of the bins, as np.histogram does

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _dispatch_bincount(bin_indices, weights, N, hist_shapes, block_size)
     76     # block_chunks is like a dask chunk, a tuple that divides up the first
     77     # axis of bin_indices
---> 78     block_chunks = _determine_block_chunks(bin_indices, block_size)
     79     if len(block_chunks) == 1:
     80         # single global chunk, don't need a loop over chunks

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _determine_block_chunks(bin_indices, block_size)
     64             block_size = min(_MAX_CHUNK_SIZE // N, M)
     65     assert isinstance(block_size, int)
---> 66     num_chunks = M // block_size
     67     block_chunks = num_chunks * (block_size,)
     68     residual = M % block_size

ZeroDivisionError: integer division or modulo by zero

Smaller sized arrays work fine.

gmacgilchrist commented 4 years ago

Hi @rabernat, I'm starting to get this error coming up often.

I'm working with model output with dimension lengths time: 31, zl: 75, yh: 576, xh: 720, selecting subsets in the time dimension prior to applying histogram (to discern speed of calculation before chugging it in for all times) and binning in xh, yh, and zl. If I specify block_size=len(time) it tends to work, but it seems that hard coding block_size comes with its own potential for errors and performance costs.

Have you had any luck getting to the bottom of this?

CptOrange16 commented 3 years ago

Hello @gmacgilchrist and @rabernat I am also seeing this error when working with xarray DataArrays with the dimensions: <xarray.DataArray 'LST' (time: 1, x: 3201, y: 3201)>. As @gmacgilchrist mentioned, If I set the block_size=data.time.size, it does not give any error. I had used large arrays previously, with dimensions like xarray.DataArray 'LST' (time: 980, x: 3201, y: 3201)>. and everything worked with no errors.

rabernat commented 3 years ago

I think I found that the problem went away if I chunked the arrays with dask. So that is one temporary workaround.

Are you using dask arrays, or are you calling this on a "loaded" numpy array?

CptOrange16 commented 3 years ago

@rabernat Not a dask array explicitly, but DataArrays from xArray should be more like dask arrays no? Basically i load a dataset from a netcdf file into a DataArray and then perform some simple operationons like a mean, standard deviation and such.

jbusecke commented 3 years ago

I just ran into this, and somehow the error is triggered based on the amount of dimensions given in dim (which probably somehow influence block_size I guess?) I boiled it down to this example:

import xarray as xr
import numpy as np
import dask.array as dsa
from xhistogram.xarray import histogram
da = xr.DataArray(dsa.random.random((12, 35, 576, 720), chunks=(3, 35, 576, 720)), dims=['time','z', 'y','x'])
da

image

bins = np.arange(-0.5, 0.5, 0.1)
count = histogram(da, bins=bins, dim=['x','y', 'z'])
count

image

Loading the dataarray into memory triggers the error

count.load()
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_17059/2516122562.py in <module>
----> 1 count.load()

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xarray/core/dataarray.py in load(self, **kwargs)
    927         dask.compute
    928         """
--> 929         ds = self._to_temp_dataset().load(**kwargs)
    930         new = self._from_temp_dataset(ds)
    931         self._variable = new._variable

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xarray/core/dataset.py in load(self, **kwargs)
    863 
    864             # evaluate all the dask arrays simultaneously
--> 865             evaluated_data = da.compute(*lazy_data.values(), **kwargs)
    866 
    867             for k, data in zip(lazy_data, evaluated_data):

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
    566         postcomputes.append(x.__dask_postcompute__())
    567 
--> 568     results = schedule(dsk, keys, **kwargs)
    569     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    570 

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     77             pool = MultiprocessingPoolExecutor(pool)
     78 
---> 79     results = get_async(
     80         pool.submit,
     81         pool._max_workers,

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    512                             _execute_task(task, data)  # Re-execute locally
    513                         else:
--> 514                             raise_exception(exc, tb)
    515                     res, worker_id = loads(res_info)
    516                     state["cache"][key] = res

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/local.py in reraise(exc, tb)
    323     if exc.__traceback__ is not tb:
    324         raise exc.with_traceback(tb)
--> 325     raise exc
    326 
    327 

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    221     try:
    222         task, data = loads(task_info)
--> 223         result = _execute_task(task, data)
    224         id = get_id()
    225         result = dumps((result, id))

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in <genexpr>(.0)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/optimization.py in __call__(self, *args)
    967         if not len(args) == len(self.inkeys):
    968             raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 969         return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
    970 
    971     def __reduce__(self):

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in get(dsk, out, cache)
    149     for key in toposort(dsk):
    150         task = dsk[key]
--> 151         result = _execute_task(task, cache)
    152         cache[key] = result
    153     result = _execute_task(out, cache)

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in <genexpr>(.0)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in <listcomp>(.0)
    113     """
    114     if isinstance(arg, list):
--> 115         return [_execute_task(a, cache) for a in arg]
    116     elif istask(arg):
    117         func, args = arg[0], arg[1:]

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/dask/utils.py in apply(func, args, kwargs)
     33 def apply(func, args, kwargs=None):
     34     if kwargs:
---> 35         return func(*args, **kwargs)
     36     else:
     37         return func(*args)

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xhistogram/core.py in _bincount(weights, axis, bins, density, block_size, *all_arrays)
    234         weights_array = None
    235 
--> 236     bin_counts = _bincount_2d_vectorized(
    237         *all_arrays_reshaped,
    238         bins=bins,

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xhistogram/core.py in _bincount_2d_vectorized(bins, weights, density, right, block_size, *args)
    183     N = reduce(lambda x, y: x * y, hist_shapes)
    184 
--> 185     bin_counts = _dispatch_bincount(
    186         bin_indices, weights, N, hist_shapes, block_size=block_size
    187     )

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xhistogram/core.py in _dispatch_bincount(bin_indices, weights, N, hist_shapes, block_size)
    127     # block_chunks is like a dask chunk, a tuple that divides up the first
    128     # axis of bin_indices
--> 129     block_chunks = _determine_block_chunks(bin_indices, block_size)
    130     if len(block_chunks) == 1:
    131         # single global chunk, don't need a loop over chunks

/scratch/gpfs2/jbusecke/conda_tigressdata/envs/cmip6_omz/lib/python3.9/site-packages/xhistogram/core.py in _determine_block_chunks(bin_indices, block_size)
    115             block_size = min(_MAX_CHUNK_SIZE // N, M)
    116     assert isinstance(block_size, int)
--> 117     num_chunks = M // block_size
    118     block_chunks = num_chunks * (block_size,)
    119     residual = M % block_size

ZeroDivisionError: integer division or modulo by zero

Now if I just count over two instead of 3 dimensions it works!

# Now without counting over z
histogram(da, bins=bins, dim=['x','y']).load()

image

As a user that completely confuses me. Note that this did NOT go away by chunking the array as some earlier posts suggested? Do you think this is the same problem?

Happy to debug further, but ultimately just thought this might be another test example that could be useful as a test case for future refactors (https://github.com/xarray-contrib/xskillscore/issues/334#issuecomment-869626314).

gmacgilchrist commented 3 years ago

This is indeed the same as the issue that I was having, which I found to be sensitive to which dimensions I was performing the histogram over and, critically, how many I left behind. If I remember correctly, this divide by zero error was occurring whenever I was attempting to histogram over all dimensions except 1, and would go away if I counted over more or fewer dimensions, which is consistent with what you're seeing. I didn't yet test the impacts of chunks.

dougiesquire commented 3 years ago

This is really helpful info @jbusecke and @gmacgilchrist! I think this issue should be quite easy to fix once we decide what we want to do with the block_size argument.

@TomNicholas, did some testing of the effect of changing this argument here: https://github.com/xgcm/xhistogram/issues/63 and there have been suggestions to just hard code it as in numpy. I'd be happy to dive into this further, or @TomNicholas is this something you've already resolved as part of the move of xhistogram into xarray?

iuryt commented 2 years ago

I had the same problem with a large xarray.Dataset and It worked after chunking the data.

iuryt commented 2 years ago

Actually, it worked for me even if I have only one chunk.

zxdawn commented 2 years ago

PS: histogram(data1, data2, bins=[bin1, bin2], dim=['longitude', 'latitude']).sum(dim='time') works for me.

jbusecke commented 1 year ago

Just ran again into this

This is really helpful info @jbusecke and @gmacgilchrist! I think this issue should be quite easy to fix once we decide what we want to do with the block_size argument.

@TomNicholas, did some testing of the effect of changing this argument here: #63 and there have been suggestions to just hard code it as in numpy. I'd be happy to dive into this further, or @TomNicholas is this something you've already resolved as part of the move of xhistogram into xarray?

Was there any progress or a suggested fix here? Just ran into this again with a large dataset.

Here is the shape/chunking:

image

And it again fails when I want to compute the histogram over all dimensions. I can easily get around this by only taking the hist along the unchunked dims and then summing in time, but a fix for this would still be very desirable IMO.

AxelHenningsson commented 1 year ago

+1 to fixing this.

iuryt commented 1 year ago

As @rabernat mentioned in the original post. It depends on the dataset size.

S = slice(None,None)
bins = [np.arange(-10,10,0.1), np.arange(0,2000,2)]
Ha = histogram(anomaly.isel(casts=S), anomaly.isel(casts=S).z, bins=bins)

anomaly has 3567 casts The code above gives divide by zero error if S=slice(None,None) and S=slice(None,3000). It doesn't error for S=slice(None,2000) or S=slice(2000,3000). So, this doesn't seem to be an error with some of the casts.

If I chunk and load, it doesn't error.

bins = [np.arange(-10,10,0.1), np.arange(0,2000,2)]
Ha = histogram(anomaly.chunk(casts=1), anomaly.chunk(casts=1).z, bins=bins).load()

I just noticed that I am not adding anything new to the problem, but will keep this post here anyway. haha

rabernat commented 1 year ago

Hi everyone! We would absolutely welcome a PR to fix this long-standing bug.

iuryt commented 1 year ago

Is anyone working on that right now or have an idea on how to fix that? I can try to help, but don't know where to start.

TomNicholas commented 1 year ago

Is anyone working on that right now or have an idea on how to fix that? I can try to help, but don't know where to start.

@iuryt the problem is in the _determine_block_chunks function. For some reason block_size is being set to 0, causing a divide by zero error in the line num_chunks = M // block_size. It doesn't make sense to have a block size of 0, so one way forward would be to add a print statement/raise an error there then try different inputs and work out what exactly causes it to be set to 0.

Alternatively one could just remove the whole block_size optimisation completely from _dispatch_bincount, so the buggy code is never called. @rabernat do you have any recollection of how important that bit of code actually is for performance?


In the meantime one can just turn off the block-level optimization by passing block_size=None. This skips the buggy logic entirely, and should work for any array, but comes with some (unknown) performance hit.

lluritu commented 9 months ago

Hi, I came across this same problem and have figured out the origin, at least in my case. When block_size="auto" in _determine_block_chunks() if N is larger than _MAX_CHUNK_SIZE the integer division in line 115 yields zero, leading to divide by zero later in 117. Setting block_size=None or rechunking the data bypasses the issue. I guess the "heuristic without much basis" needs to be changed to something better.

pedroaugustosmribeiro commented 7 months ago

I am also having problems with this hard-coded heuristic when using with xskillscore package