Profiling `_bincount` and attempted acceleration with numba

TomNicholas commented 3 years ago

I tried profiling the core _bincount function, and then accelerating it with numba, a naive application of which only made it about 30% faster overall. Notebook here (and gist here in case the notebook cache hasn't updated yet).

However I've never used numba before, and not done much profiling in python, so maybe there is something else I should have tried.

I also profilied to see if the block_size argument has much effect.

Also I noticed that #49 introduced a regression where the block_size argument is no-longer passed down to _bincount, so doesn't actually do anything. Fixed in #62.

@rabernat @gjoseph92 @jrbourbeau

rabernat commented 3 years ago

Thanks for doing this Tom!

My impression is that most of the heavy lifting is done by np.digitize, which is already an accelerated function, there is not much room for speedup with numba.

dougiesquire commented 3 years ago

This is great @TomNicholas, thanks. I'm curious about your block_size results. Larger block sizes speed things up, which I guess makes sense for small arrays. But I'm curious if you see the same thing for larger arrays. I ask because there's a note in the np.histogram source code (https://github.com/numpy/numpy/blob/v1.20.0/numpy/lib/histograms.py#L824):

We iterate over blocks here for two reasons: the first is that for

large arrays, it is actually faster (for example for a 10^8 array it

is 2x as fast) and it results in a memory footprint 3x lower in the

limit of large arrays.

gjoseph92 commented 3 years ago

As Ryan said, digitize is just a thin wrapper around searchsorted, which is already implemented in C.

However, I think there is a significant optimization we could lift from np.histogram: when bins have uniform widths (common), you can skip digitize entirely and calculate the bin that a value belongs to in constant-time with some basic rescaling arithmetic (as opposed to binary-searching the bins list): https://github.com/numpy/numpy/blob/main/numpy/lib/histograms.py#L814-L816. This might require relaxing our When using dask arrays, bins must be provided as numpy array(s) of edges error message to also allow n_bins + range.

there is not much room for speedup with numba

This is probably true. Instead, I think the main benefit we could get from Numba would be a more readable _bincount implementation. All the reshaping logic is a bit hard to follow—with Numba, I wonder if we could just call np.histogram in a for-loop and get similar performance? Also, Numba has its own histogram implementation, though it doesn't support weights. We could use that in some cases, or use it as reference for our own, or contribute Numba support for weights (maybe not that hard?).

xgcm / xhistogram

Profiling `_bincount` and attempted acceleration with numba #63

We iterate over blocks here for two reasons: the first is that for

large arrays, it is actually faster (for example for a 10^8 array it

is 2x as fast) and it results in a memory footprint 3x lower in the

limit of large arrays.