Open bordingj opened 6 years ago
Hi. can you do the for loop like this:
xt::xtensor<float,1> copy_with_classic_loop(xt::xtensor<float,1>&& input){
auto output = xt::empty_like(input);
auto sz = input.size();
for (uint64_t i=0; i<sz; i++){
output[i] = input[i];
}
return output;
}
Or even better use stl algos:
std::copy(&input[0], &input[sz], &output[0]);
note: to use --fast-- begin / end iterators in xtensor-python, you currently need to explicitly specify the layout:
xt::pytensor<double, 1, layout_type::row_major>
also don't forget to compile with -O3 -march=native
about arange
yeah we need to improve this function but it's not super trivial I'm afraid. We've already improved assigning an arange to a container by a lot.
-march=native
makes copy_with_classic_loop
run slower on my computer.
I get the following timings on my laptop with:
template<typename T>
auto copy_with_arange_loop(xt::xexpression<T>& expr){
auto& input = expr.derived_cast();
auto output = xt::empty_like(input);
for (auto&& i : xt::arange<uint64_t>(input.size())){
output[i] = input[i];
}
return output;
}
template<typename T>
auto copy_with_classic_loop(xt::xexpression<T>& expr){
auto& input = expr.derived_cast();
auto output = xt::empty_like(input);
auto sz = input.size();
for (uint64_t i=0; i<sz; i++){
output[i] = input[i];
}
return output;
}
template<typename T>
auto copy_with_stl(xt::xexpression<T>& expr){
auto& input = expr.derived_cast();
auto output = xt::empty_like(input);
auto sz = input.size();
std::copy(&input[0], &input[sz], &output[0]);
return output;
}
template<typename T>
auto copy_with_raw_pointers(xt::xexpression<T>& expr){
auto& input = expr.derived_cast();
auto output = xt::empty_like(input);
auto sz = input.size();
auto input_ptr = &input[0];
auto output_ptr = &output[0];
for (uint64_t i=0; i<sz; i++){
output_ptr[i] = input_ptr[i];
}
return output;
}
In[1]: %timeit copy_with_arange_loop_py(a)
224 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In[2]: %timeit copy_with_classic_loop_py(a)
42.8 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[3]: %timeit copy_with_stl_py(a)
13.4 µs ± 57.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In[4]: %timeit copy_with_raw_pointers_py(a)
17.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In[5]: %timeit a.copy()
13.8 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I would have expected copy_with_classic_loop
to be as fast as copy_with_raw_pointers
but I can see that the compiler auto-vectorize the loop in copy_with_raw_pointers
but not the one in copy_with_classic_loop
.
I am comparing an
xt::arange
based for-loop and a classic for-loop:Gets me
We see that the xt::arange based for-loop performs poorly. I also find it disappointing that pure numpy is alot faster than the classic for-loop.
I am using gcc 5.4 with -O3