Open cerati opened 5 years ago
hey,
it's a little hard to understand what exactly is going on. I'll run perf
over it and see if I can spot a place where xtensor is making a mistake.
Just from the comments inline, you didn't compile with O3
?
Also for your use case, instead of using a view, it might be the better idea to use xt::adapt
with a fixed shape.
E.g. xt::adapt(my_mem_ptr, xt::xshape<16, 16>());
Compiling with O3 and GCC 9 makes the perf differences quite a bit smaller I think:
v2 -- time for NNnrep=1600000 multiplications is 0.149402 s, i.e. per mult. [s]=9.33762e-08 plainArray_el16mx (plain loop) with align=0 -- time for NNnrep=1600000 multiplications is 0.102228 s, i.e. per mult. [s]=6.38925e-08 plainArray_xsimd -- time for NN*nrep=1600000 multiplications is 0.129436 s, i.e. per mult. [s]=8.08975e-08
Thanks for following up!
I did compile with -O3
, see the full command at:
https://github.com/cerati/xtensor-test/wiki/Results
Thanks for the suggestion, I will try GCC9 and xt::adapt
and report back.
Dear @cerati , do you have new insight into this?
@emmenlau, it looks like GCC9 will be released soon, so I am waiting for that...
I don't think gcc 9 has a lot of performance improvements over gcc 8. I just said 9 because that's the version I was using on Fedora 30 beta.
Sorry for the delay. I decided to repeat the test with gcc 8.2 (gcc 9.1 is out but I did not install it yet).
Results are linked from the same page: https://github.com/cerati/xtensor-test/wiki/Results
There are some interesting variations with respect to my previous tests (with ICC), but I still observe a factor of ~2x difference between "xtensor v0" and "array xsimd".
Not sure why this is different from your tests, @wolfv. Maybe it's the compiler version, maybe compiler options, maybe the machine?
And finally also results with GCC 9.0, added to the same page: https://github.com/cerati/xtensor-test/wiki/Results
In this configuration "array xsimd" is definitely the best option! But xtensor versions are still significantly slower.
I am exploring the usage of xtensor in the context of a physics project that requires real time processing of data (and thus is particularly sensitive to the code performance). In particular I am interested in libraries providing efficient vectorization support and possibly with GPU portability (thus I am very interested in the evolution of issue #192).
In order to test the performance of xtensor at a low level I have written a simple benchmark code performing the multiplication of two 6x6 matrices, where the idea is to perform N multiplications in SIMD. My code is at https://github.com/cerati/xtensor-test, and the results are linked from the wiki: https://github.com/cerati/xtensor-test/wiki/Results. In summary, I observe that the best results are obtained with plain arrays where matrix elements are grouped in blocks of 16 matrices (approach named el16mx in the code, similar to the approach we currently use in our code). Using xsimd on plain arrays does a pretty good job, while all my tests with xtensor show much slower processing time. In other words, it looks like xtensor adds some overhead on top of xsimd. Is this expected? Is my implementation missing key features? (that would not be surprising…)