Open TomNicholas opened 3 years ago
Thanks for doing this Tom!
My impression is that most of the heavy lifting is done by np.digitize, which is already an accelerated function, there is not much room for speedup with numba.
This is great @TomNicholas, thanks. I'm curious about your block_size
results. Larger block sizes speed things up, which I guess makes sense for small arrays. But I'm curious if you see the same thing for larger arrays. I ask because there's a note in the np.histogram
source code (https://github.com/numpy/numpy/blob/v1.20.0/numpy/lib/histograms.py#L824):
We iterate over blocks here for two reasons: the first is that for
large arrays, it is actually faster (for example for a 10^8 array it
is 2x as fast) and it results in a memory footprint 3x lower in the
limit of large arrays.
As Ryan said, digitize
is just a thin wrapper around searchsorted
, which is already implemented in C.
However, I think there is a significant optimization we could lift from np.histogram
: when bins have uniform widths (common), you can skip digitize
entirely and calculate the bin that a value belongs to in constant-time with some basic rescaling arithmetic (as opposed to binary-searching the bins list): https://github.com/numpy/numpy/blob/main/numpy/lib/histograms.py#L814-L816. This might require relaxing our When using dask arrays, bins must be provided as numpy array(s) of edges
error message to also allow n_bins + range.
there is not much room for speedup with numba
This is probably true. Instead, I think the main benefit we could get from Numba would be a more readable _bincount
implementation. All the reshaping logic is a bit hard to follow—with Numba, I wonder if we could just call np.histogram
in a for-loop and get similar performance? Also, Numba has its own histogram implementation, though it doesn't support weights. We could use that in some cases, or use it as reference for our own, or contribute Numba support for weights (maybe not that hard?).
I tried profiling the core
_bincount
function, and then accelerating it with numba, a naive application of which only made it about 30% faster overall. Notebook here (and gist here in case the notebook cache hasn't updated yet).However I've never used numba before, and not done much profiling in python, so maybe there is something else I should have tried.
I also profilied to see if the
block_size
argument has much effect.Also I noticed that #49 introduced a regression where the
block_size
argument is no-longer passed down to_bincount
, so doesn't actually do anything. Fixed in #62.@rabernat @gjoseph92 @jrbourbeau