The performance of means which are not on the contiguous axis in xtensor appear to be slower than optimal. I have provided benchmarks below using a more optimized approach. It uses memory coalescing to improve performance and cache hits by performing the mean in "groups" along the reduction axis rather than striding through memory. Would there be a way to implement this in xtensor to get the factor of 2 speed up?
The performance of means which are not on the contiguous axis in xtensor appear to be slower than optimal. I have provided benchmarks below using a more optimized approach. It uses memory coalescing to improve performance and cache hits by performing the mean in "groups" along the reduction axis rather than striding through memory. Would there be a way to implement this in xtensor to get the factor of 2 speed up?
See reference implementation here: https://github.com/spectre-ns/xtensor-benchmark/blob/bb2404641cfd632c459d4e91c3881ebd601b2a62/include/reduction.hpp#L14