xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.31k stars 392 forks source link

performance issue: special convolve #1758

Open serge-sans-paille opened 4 years ago

serge-sans-paille commented 4 years ago

When benchmarking the following snippets, I have very surprising results:

#from: http://stackoverflow.com/questions/2196693/improving-numpy-performance
#pythran export specialconvolve(uint32 [][])
#setup: import numpy as np ; r = np.arange(100*10000, dtype=np.uint32).reshape(1000,1000)
#run: specialconvolve(r)

def specialconvolve(a):
    # sorry, you must pad the input yourself
    rowconvol = a[1:-1,:] + a[:-2,:] + a[2:,:]
    colconvol = rowconvol[:,1:-1] + rowconvol[:,:-2] + rowconvol[:,2:] - 9*a[1:-1,1:-1]
    return colconvol

vs.

#include "xtensor/xnoalias.hpp"
#include "xtensor/xtensor.hpp"
#include "xtensor/xarray.hpp"
#include "xtensor/xrandom.hpp"
#include "xtensor/xview.hpp"
#include "xtensor/xfixed.hpp"
#define FORCE_IMPORT_ARRAY

#include "xtensor-python/pyarray.hpp"
#include "xtensor-python/pytensor.hpp"

using namespace xt;

template<typename A>
auto specialconvolve(A const& a)
{
  xtensor<uint32_t, 2> rowconvol = view(a, range(1, -1), all()) + view(a, range(xnone(), -2), all()) + view(a, range(2, xnone()), all());
  xtensor<uint32_t, 2> colconvol = view(rowconvol, all(), range(1, -1)) + view(rowconvol, all(), range(xnone(), -2)) + view(rowconvol, all(), range(2, xnone())) - 9 * view(a, range(1, -1), range(1, -1));
  return colconvol;
}

pytensor<uint32_t, 2> py_specialconvolve(pytensor<uint32_t, 2> const& a)
{
  return specialconvolve(a);
}

PYBIND11_MODULE(xtensor_specialconvolve, m)
{
    import_numpy();
    m.def("specialconvolve", py_specialconvolve);
}

I get

# name          engine best avg std
specialconvolve xtensor 8176 8251 64
specialconvolve python 4588 4792 102
serge-sans-paille commented 4 years ago

steps to reproduce:

g++ specialconvolve.cpp `python-config --cflags --ldflags --libs` -I./.npb/include -std=c++14 -O2 -shared -o xtensor_specialconvolve.so
python -m timeit -s 'import numpy as np ; r = np.arange(100*10000, dtype=np.uint32).reshape(1000,1000); from xtensor_specialconvolve import specialconvolve' 'specialconvolve(r)'
wolfv commented 4 years ago

ok, i spent a bit of time debugging this, and it's actually not a very deep issue.

The problem is that we're currently not defaulting to a static row_major layout in the python arrays which makes things slow (because we don't do enough dynamic layout checking!).

I think if you change your code to ingest a pytensor<uint32_t, 2, xt::layout_type::row_major> the speed should be roughly 1.5x faster than numpy (that's what I am seeing locally, with a pure C++ version of this benchmark) (~6.5ms for numpy, ~4ms for C++).

We should dispatch to the right assignment loops for dynamic layouts, too, though.