xtensor slower than numba

bordingj commented 6 years ago

I am comparing the simple computation in python:

import numpy as np
import numba as nb

def compute_mid_price_numpy(ask_price, bid_price):
    return (ask_price + bid_price) / 2.0

@nb.vectorize(nopython=True)
def compute_mid_price_numba(ask_price, bid_price):
    return (ask_price + bid_price) / 2.0

To xtensor in cpp

template<typename E>
inline E compute_mid_price(E& ask_price, E& bid_price){
    return (ask_price + bid_price) / 2.0;
}

xt::pytensor<float, 1> compute_mid_price_xtensor(xt::pytensor<float, 1>& ask_prices,
                                                   xt::pytensor<float, 1>& bid_prices){
    return compute_mid_price(ask_prices, bid_prices);
}

xt::pytensor<double, 1, xt::layout_type::row_major> compute_mid_price_xtensor_row_major(
                                                            xt::pytensor<float, 1, xt::layout_type::row_major>& ask_prices,
                                                            xt::pytensor<float, 1, xt::layout_type::row_major>& bid_prices){
    return compute_mid_price(ask_prices, bid_prices);
}

xt::pytensor<float, 1> compute_mid_price_raw_loop(xt::pytensor<float, 1>& ask_prices,
                                                   xt::pytensor<float, 1>& bid_prices){
    auto out = xt::empty_like(ask_prices);
    auto numel = out.size();
    if (numel != bid_prices.size() ){
        throw std::runtime_error(SOURCE_ERROR("lengths must equal"));
    }
    auto out_ptr = &out[0];
    auto ask_prices_ptr = &ask_prices[0];
    auto bid_prices_ptr = &bid_prices[0];
    for (size_t i=0; i<numel; i++){
        out_ptr[i] = compute_mid_price(ask_prices_ptr[i], bid_prices_ptr[i]);
    }
    return out;
}

And with

import numpy as np
from trading.market_data import (compute_mid_price_xtensor, compute_mid_price_xtensor_row_major,
                                 compute_mid_price_raw_loop)
ask_price = np.random.rand(1000).astype(np.float32)
bid_price = np.random.rand(1000).astype(np.float32)

I get the following timings:

In[5]: %timeit compute_mid_price_numpy(ask_price, bid_price)
1.96 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In[6]: %timeit compute_mid_price_numba(ask_price, bid_price)
670 ns ± 0.844 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In[7]: %timeit compute_mid_price_xtensor(ask_price, bid_price)
1.78 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In[8]: %timeit compute_mid_price_xtensor_row_major(ask_price, bid_price)
7.99 µs ± 98.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In[9]: %timeit compute_mid_price_raw_loop(ask_price, bid_price)
771 ns ± 0.629 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

compute_mid_price_xtensor is twice as slow as numba in this example. compute_mid_price_xtensor_row_major is horrible but if I switch to doubles instead of floats I get similar performance to compute_mid_price_xtensor ? I have XTENSOR_USE_XSIMD enabled.

wolfv commented 6 years ago

Hi!

There are some issues I can spot with the code:

Mixing of float and double: in the row major case you return a pytensor of double. In the division you use a double (instead of a float, whihc is the input type). Changing the divisor to 2.0f should yield faster speeds, as well as unifying the return type to pytensor<float, 1, row_major>.
Instead of dividing by 2.0f you could also multiply with 0.5f which should give much faster speeds because divisions are quite expensive. This might be implicitly done by Numba / LLVM, however, our explicit SIMD dispatch might not do this transformation. (@SylvainCorlay mentioned this in a private conversation).

Can you post the results after these changes?

bordingj commented 6 years ago

Changing from 2.0 to 2.0f fixes the issue.

Timings (on a different computer now):

With 2.0 division:

numba : 822 nanoseconds
xtensor : 1353 nanoseconds
xtensor_row_major : 7662 nanoseconds
raw loop : 823 nanoseconds

With 0.5 multiplication:

numba : 854 nanoseconds
xtensor : 1140 nanoseconds
xtensor_row_major : 6400 nanoseconds
raw loop : 840 nanoseconds

With 2.0f division:

numba : 701 nanoseconds
xtensor : 952 nanoseconds
xtensor_row_major : 1048 nanoseconds
raw loop : 821 nanoseconds

With 0.5f multiplication:

numba : 737 nanoseconds
xtensor : 846 nanoseconds
xtensor_row_major : 976 nanoseconds
raw loop : 775 nanoseconds

wolfv commented 6 years ago

great! did you also change the return type of xtensor_row_major to float? there shouldn't be a reason why that one should be slower than the other xtensor one.

Cheers!

wolfv commented 6 years ago

btw compared to numba there might be some overhead in the function dispatching inside of pybind11 - that could be worked around with e.g. by using Cython. But I can't say how much it is, and it becomes quite negligible for "bigger" functions.

bordingj commented 6 years ago

Yes I changed the return type to float.

bordingj commented 6 years ago

I just checked. Function-call overhead for these function for pybind11 and numba is comparable. Around 400 ns for numba and 450 ns for pybind11. Creating a new ndarray with empty_like in numba and in xtensor is also comparable. Around 100-150 ns. So the actual calculation only takes 100 ns in the optimized case.

xtensor-stack / xtensor

xtensor slower than numba #1112