Closed bordingj closed 6 years ago
Hi!
There are some issues I can spot with the code:
2.0f
should yield faster speeds, as well as unifying the return type to pytensor<float, 1, row_major>
.2.0f
you could also multiply with 0.5f
which should give much faster speeds because divisions are quite expensive. This might be implicitly done by Numba / LLVM, however, our explicit SIMD dispatch might not do this transformation. (@SylvainCorlay mentioned this in a private conversation).Can you post the results after these changes?
Changing from 2.0 to 2.0f fixes the issue.
Timings (on a different computer now):
With 2.0 division:
numba : 822 nanoseconds
xtensor : 1353 nanoseconds
xtensor_row_major : 7662 nanoseconds
raw loop : 823 nanoseconds
With 0.5 multiplication:
numba : 854 nanoseconds
xtensor : 1140 nanoseconds
xtensor_row_major : 6400 nanoseconds
raw loop : 840 nanoseconds
With 2.0f division:
numba : 701 nanoseconds
xtensor : 952 nanoseconds
xtensor_row_major : 1048 nanoseconds
raw loop : 821 nanoseconds
With 0.5f multiplication:
numba : 737 nanoseconds
xtensor : 846 nanoseconds
xtensor_row_major : 976 nanoseconds
raw loop : 775 nanoseconds
great! did you also change the return type of xtensor_row_major
to float
? there shouldn't be a reason why that one should be slower than the other xtensor
one.
Cheers!
btw compared to numba there might be some overhead in the function dispatching inside of pybind11 - that could be worked around with e.g. by using Cython. But I can't say how much it is, and it becomes quite negligible for "bigger" functions.
Yes I changed the return type to float.
I just checked. Function-call overhead for these function for pybind11 and numba is comparable. Around 400 ns for numba and 450 ns for pybind11. Creating a new ndarray with empty_like in numba and in xtensor is also comparable. Around 100-150 ns. So the actual calculation only takes 100 ns in the optimized case.
I am comparing the simple computation in python:
To xtensor in cpp
And with
I get the following timings:
compute_mid_price_xtensor
is twice as slow as numba in this example.compute_mid_price_xtensor_row_major
is horrible but if I switch to doubles instead of floats I get similar performance tocompute_mid_price_xtensor
? I have XTENSOR_USE_XSIMD enabled.