[Performance] Sub-optimal performance on means of strided axes

The performance of means which are not on the contiguous axis in xtensor appear to be slower than optimal. I have provided benchmarks below using a more optimized approach. It uses memory coalescing to improve performance and cache hits by performing the mean in "groups" along the reduction axis rather than striding through memory. Would there be a way to implement this in xtensor to get the factor of 2 speed up?

See reference implementation here: https://github.com/spectre-ns/xtensor-benchmark/blob/bb2404641cfd632c459d4e91c3881ebd601b2a62/include/reduction.hpp#L14

xtensor_mean_on_second_axis<float>/8          561 ns          516 ns      1000000
xtensor_mean_on_second_axis<float>/64      128057 ns       129395 ns         6400
xtensor_mean_on_second_axis<double>/8         573 ns          551 ns      1445161
xtensor_mean_on_second_axis<double>/64     132998 ns       124512 ns         6400
native_mean_on_second_axis<float>/8           398 ns          392 ns      1792000
native_mean_on_second_axis<float>/64        41268 ns        41433 ns        16593
native_mean_on_second_axis<double>/8          334 ns          300 ns      2240000
native_mean_on_second_axis<double>/64       75037 ns        72545 ns        11200

xtensor-stack / xtensor

[Performance] Sub-optimal performance on means of strided axes #2771