xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.37k stars 400 forks source link

Performance tests of xtensor #1530

Open cerati opened 5 years ago

cerati commented 5 years ago

I am exploring the usage of xtensor in the context of a physics project that requires real time processing of data (and thus is particularly sensitive to the code performance). In particular I am interested in libraries providing efficient vectorization support and possibly with GPU portability (thus I am very interested in the evolution of issue #192).

In order to test the performance of xtensor at a low level I have written a simple benchmark code performing the multiplication of two 6x6 matrices, where the idea is to perform N multiplications in SIMD. My code is at https://github.com/cerati/xtensor-test, and the results are linked from the wiki: https://github.com/cerati/xtensor-test/wiki/Results. In summary, I observe that the best results are obtained with plain arrays where matrix elements are grouped in blocks of 16 matrices (approach named el16mx in the code, similar to the approach we currently use in our code). Using xsimd on plain arrays does a pretty good job, while all my tests with xtensor show much slower processing time. In other words, it looks like xtensor adds some overhead on top of xsimd. Is this expected? Is my implementation missing key features? (that would not be surprising…)

wolfv commented 5 years ago

hey,

it's a little hard to understand what exactly is going on. I'll run perf over it and see if I can spot a place where xtensor is making a mistake.

Just from the comments inline, you didn't compile with O3?

wolfv commented 5 years ago

Also for your use case, instead of using a view, it might be the better idea to use xt::adapt with a fixed shape.

E.g. xt::adapt(my_mem_ptr, xt::xshape<16, 16>());

wolfv commented 5 years ago

Compiling with O3 and GCC 9 makes the perf differences quite a bit smaller I think:

v2 -- time for NNnrep=1600000 multiplications is 0.149402 s, i.e. per mult. [s]=9.33762e-08 plainArray_el16mx (plain loop) with align=0 -- time for NNnrep=1600000 multiplications is 0.102228 s, i.e. per mult. [s]=6.38925e-08 plainArray_xsimd -- time for NN*nrep=1600000 multiplications is 0.129436 s, i.e. per mult. [s]=8.08975e-08

cerati commented 5 years ago

Thanks for following up!

I did compile with -O3, see the full command at: https://github.com/cerati/xtensor-test/wiki/Results

Thanks for the suggestion, I will try GCC9 and xt::adaptand report back.

emmenlau commented 5 years ago

Dear @cerati , do you have new insight into this?

cerati commented 5 years ago

@emmenlau, it looks like GCC9 will be released soon, so I am waiting for that...

wolfv commented 5 years ago

I don't think gcc 9 has a lot of performance improvements over gcc 8. I just said 9 because that's the version I was using on Fedora 30 beta.

cerati commented 5 years ago

Sorry for the delay. I decided to repeat the test with gcc 8.2 (gcc 9.1 is out but I did not install it yet).

Results are linked from the same page: https://github.com/cerati/xtensor-test/wiki/Results

There are some interesting variations with respect to my previous tests (with ICC), but I still observe a factor of ~2x difference between "xtensor v0" and "array xsimd".

Not sure why this is different from your tests, @wolfv. Maybe it's the compiler version, maybe compiler options, maybe the machine?

cerati commented 5 years ago

And finally also results with GCC 9.0, added to the same page: https://github.com/cerati/xtensor-test/wiki/Results

In this configuration "array xsimd" is definitely the best option! But xtensor versions are still significantly slower.