I've started the refactoring to use dask.array.blockwise, beginning with what @rabernat did in #49 .
So far all I've done is:
[x] Rewrite to at least call dask.array.blockwise, even if it doesn't yet work
[x] Write some tests for the new _bincount_kernel function (the one that will be called by dask.array.blockwise)
[x] Reorganise the existing histogram tests slightly
I have not done:
[ ] Write the actual underlying _bincount_kernel function needed to pass the new tests
[ ] Finish writing the other new tests
[ ] Checks and error handling
[ ] Make sure all the keyword args work (density, weighted)
[ ] Dask handling of bins
[ ] Expand tests to check both numpy and dask paths
[ ] Any chunked numpy kernel acceleration
[ ] Any numba/cython/bottleneck kernel acceleration
[ ] Remove all now-redundant code
I'm not totally sure if I'm understanding the proposed algorithm correctly - in the numpy code path then the bincounts.sum(axis) will only ever sum over length-1 axes, is that correct?
I'm also wondering how much of the _bincount_kernel code can or should just be copied directly from numpy/dask.histogramdd (with attribution of course)... Can we just loop over the "unused_inds" of the array with the histogramdd algorithm to preserve the dimensions we don't wish to reduce over? Perhaps using a generalized ufunc? Or would that be inefficient?
I've started the refactoring to use
dask.array.blockwise
, beginning with what @rabernat did in #49 .So far all I've done is:
dask.array.blockwise
, even if it doesn't yet work_bincount_kernel
function (the one that will be called bydask.array.blockwise
)I have not done:
_bincount_kernel
function needed to pass the new testsI'm not totally sure if I'm understanding the proposed algorithm correctly - in the numpy code path then the
bincounts.sum(axis)
will only ever sum over length-1 axes, is that correct?I'm also wondering how much of the
_bincount_kernel
code can or should just be copied directly fromnumpy/dask.histogramdd
(with attribution of course)... Can we just loop over the "unused_inds" of the array with thehistogramdd
algorithm to preserve the dimensions we don't wish to reduce over? Perhaps using a generalized ufunc? Or would that be inefficient?cc @dougiesquire @gjoseph92