xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.31k stars 392 forks source link

xt::arange based for-loop slow #1089

Open bordingj opened 6 years ago

bordingj commented 6 years ago

I am comparing an xt::arange based for-loop and a classic for-loop:

xt::xtensor<float,1> copy_with_arange_loop(xt::xtensor<float,1>&& input){
    auto output = xt::empty_like(input);
    for (auto&& i : xt::arange<uint64_t>(input.size())){
        output[i] = input[i];
    }
    return output;
}

xt::xtensor<float,1> copy_with_classic_loop(xt::xtensor<float,1>&& input){
    auto output = xt::empty_like(input);
    for (uint64_t i=0; i<input.size(); i++){
        output[i] = input[i];
    }
    return output;
}

xt::pytensor<float,1> copy_with_arange_loop_py(xt::pytensor<float,1>& input){
    return copy_with_arange_loop(input);
}

xt::pytensor<float,1> copy_with_classic_loop_py(xt::pytensor<float,1>& input){
    return copy_with_classic_loop(input);
}

Gets me

In[8]: a = np.random.rand(100000).astype(np.float32)
In[9]: %timeit copy_with_classic_loop_py(a)
97 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[10]: %timeit copy_with_arange_loop_py(a)
289 µs ± 6.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In[11]: %timeit a.copy()*1
50.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We see that the xt::arange based for-loop performs poorly. I also find it disappointing that pure numpy is alot faster than the classic for-loop.

I am using gcc 5.4 with -O3

wolfv commented 6 years ago

Hi. can you do the for loop like this:

xt::xtensor<float,1> copy_with_classic_loop(xt::xtensor<float,1>&& input){
    auto output = xt::empty_like(input);
    auto sz = input.size();
    for (uint64_t i=0; i<sz; i++){
        output[i] = input[i];
    }
    return output;
}

Or even better use stl algos:

std::copy(&input[0], &input[sz], &output[0]);

note: to use --fast-- begin / end iterators in xtensor-python, you currently need to explicitly specify the layout:

xt::pytensor<double, 1, layout_type::row_major>
wolfv commented 6 years ago

also don't forget to compile with -O3 -march=native

about arange yeah we need to improve this function but it's not super trivial I'm afraid. We've already improved assigning an arange to a container by a lot.

bordingj commented 6 years ago

-march=native makes copy_with_classic_loop run slower on my computer.

I get the following timings on my laptop with:

template<typename T>
auto copy_with_arange_loop(xt::xexpression<T>& expr){
    auto& input = expr.derived_cast();
    auto output = xt::empty_like(input);
    for (auto&& i : xt::arange<uint64_t>(input.size())){
        output[i] = input[i];
    }
    return output;
}

template<typename T>
auto copy_with_classic_loop(xt::xexpression<T>& expr){
    auto& input = expr.derived_cast();
    auto output = xt::empty_like(input);
    auto sz = input.size();
    for (uint64_t i=0; i<sz; i++){
        output[i] = input[i];
    }
    return output;
}

template<typename T>
auto copy_with_stl(xt::xexpression<T>& expr){
    auto& input = expr.derived_cast();
    auto output = xt::empty_like(input);
    auto sz = input.size();
    std::copy(&input[0], &input[sz], &output[0]);
    return output;
}

template<typename T>
auto copy_with_raw_pointers(xt::xexpression<T>& expr){
    auto& input = expr.derived_cast();
    auto output = xt::empty_like(input);
    auto sz = input.size();
    auto input_ptr = &input[0];
    auto output_ptr = &output[0];
    for (uint64_t i=0; i<sz; i++){
        output_ptr[i] = input_ptr[i];
    }
    return output;
}
In[1]: %timeit copy_with_arange_loop_py(a)
224 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In[2]: %timeit copy_with_classic_loop_py(a)
42.8 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[3]: %timeit copy_with_stl_py(a)
13.4 µs ± 57.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In[4]: %timeit copy_with_raw_pointers_py(a)
17.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In[5]: %timeit a.copy()
13.8 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I would have expected copy_with_classic_loop to be as fast as copy_with_raw_pointers but I can see that the compiler auto-vectorize the loop in copy_with_raw_pointers but not the one in copy_with_classic_loop.