xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.35k stars 399 forks source link

xtensor slower than numpy #2683

Open quinor opened 1 year ago

quinor commented 1 year ago

I'm trying to replicate this numpy function in xtensor:

def upscale(arr):
    start = time.time()
    w, h = arr.shape
    arr = np.pad(arr, 1, mode="edge") / 16
    lu, ru, ld, rd = arr[:-2, :-2], arr[:-2, 2:], arr[2:, :-2], arr[2:, 2:]

    lu, ru, ld, rd = lu*9+ru*3+ld*3+rd, lu*3+ru*9+ld+rd*3, lu*3+ru+ld*9+rd*3, lu+ru*3+ld*3+rd*9
    ret = np.stack([lu, ru, ld, rd]).reshape(2, 2, w, h).transpose(2, 0, 3, 1).reshape(w*2, h*2)
    print(f"upscale took {(time.time()-start)*1000} ms")
    return ret

My attempt:

template <class E>
inline xt::xarray<float> upscale(E&& e) noexcept
{
    ta::Timeit time("upscale");
    auto step1 = xt::pad(e, {{1,1}, {1,1}, {0,0}}, xt::pad_mode::symmetric);

    auto lu = xt::view(step1, xt::range(_, -2), xt::range(_, -2));
    auto ru = xt::view(step1, xt::range(2, _), xt::range(_, -2));
    auto ld = xt::view(step1, xt::range(_, -2), xt::range(2, _));
    auto rd = xt::view(step1, xt::range(2, _), xt::range(2, _));

    auto lu2 = lu*9 + ru*3 + ld*3 + rd;
    auto ru2 = lu*3 + ru*9 + ld + rd*3;
    auto ld2 = lu*3 + ru + ld*9 + rd*3;
    auto rd2 = lu + ru*3 + ld*3 + rd*9;

    auto step3 = xt::eval(xt::reshape_view(xt::stack(
        xtuple(
            xt::stack(xtuple(lu2, ru2), 1),
            xt::stack(xtuple(ld2, rd2), 1)
        ), 3), {e.shape(0)*2, e.shape(1)*2})*(1.f/16.f));

    return step3;
}

The python version takes around 35ms to evaluate, while the c++ version runs in around 300ms (both for 2048**2 input size). This is already after I tried to optimize the code quite a lot.

Why is the C++ version slower and how do I bring it up to speed? Relevant c++ compilation options command:

/usr/bin/c++
 -DXTENSOR_USE_XSIMD
 -I/home/quinor/kody/tsparter/tsparter/.
 -I/home/quinor/kody/tsparter/tsparter/include
 -I/home/quinor/kody/tsparter/build/_deps/stb-src
 -I/home/quinor/kody/tsparter/build/_deps/xtensor-src/include
 -I/home/quinor/kody/tsparter/build/_deps/xtl-src/include
 -I/home/quinor/kody/tsparter/build/_deps/xsimd-src/include
 -O3
 -DNDEBUG
 -std=gnu++17
 -march=native
 -MD
 -MT tsparter/CMakeFiles/tsparter.dir/image_filters.cc.o
 -MF CMakeFiles/tsparter.dir/image_filters.cc.o.d
 -o CMakeFiles/tsparter.dir/image_filters.cc.o
 -c /home/quinor/kody/tsparter/tsparter/image_filters.cc
JohanMabille commented 1 year ago

We have a performance issue with the current implementation of the views, wihch produce bad assembly code. This issue is under investigation.

quinor commented 1 year ago

That's unfortunate :/ the views operations are majority of what I'm currently doing...

WillowWisp commented 10 months ago

We have a performance issue with the current implementation of the views, wihch produce bad assembly code. This issue is under investigation.

Hi, I don't know if this is relevant to this issue but a simple transpose on a { 37, 8400 } took 10ms

int boxRows = 37;
int boxCols = 8400;

std::vector<int> boxOutputArrShape = { boxRows, boxCols };
xt::xarray<float> boxOutputArr = xt::adapt(boxOutputFloatBuffer,
                                           boxRows * boxCols,
                                           xt::no_ownership(),
                                           boxOutputArrShape);

// This took a good 10ms running on iPhone 12 Pro while the equivalent in OpenCV took less than 1ms
xt::xarray<float> predictionsArr = xt::transpose(boxOutputArr);

Is there anything wrong with the above snippet? Or if this is an xt::view issue, is there any version that I can downgrade to to make it faster?