Open conjam opened 4 years ago
Hi @conjam, first, just in case, have you made sure that you are linking against OpenBLAS or MKL? xtensor-blas contains a C++ implementation (called FLENS) of most BLAS routines, but they are a lot less optimized than actual BLAS.
Also if you could give us a hint on what exactly you're doing with xtensor / xtensor-blas we might be able to help better ... One problem could be that we sometimes need to convert row-major matrices to column-major for some LAPACK operations ... that could eat performance.
First off: thanks for the quick response!
I've checked and across platforms (I develop on mac, regression on centos) and libopenblas is linked into both binaries; in case that isn't enough, I have add_definitions(-DHAVE_CBLAS=1)
and set(XTENSOR_USE_XSIMD 1)
in my CMakeLists (I followed the CMake guide y'all put out verbatim).
Currently in regression I only use xt::linalg::dot
to find the matrix product of 2D vectors.
Hi @conjam,
can you give me some more context on the slowdown, and especially your matrix / vector sizes? If you have small matrices, it's very possible that hand-written code outperforms BLAS (e.g. for 3x3 matrix-matrix or matrix-vector product).
You can get some speedup by using xtensor_fixed as a container, however, the BLAS implementation is still "dynamic" and doesn't statically know about the size of your matrices.
If you want to achieve the best performance for dot products for small matrices, I would encourage you to write them by hand and use the xtensor_fixed container.
If you have a problem with large matrices, I would appreciate it if you could give me more context so I can check what the problem might be. E.g. sizes of the matrices, some code snippets, your hand-written implementation etc.
xt::lingalg::tensordot seems to execute very slowly here, not sure if this is related. However, I found this is maybe due to preparatory mathematical & view operations I'm doing. I can influence it by using xt::eval. Nevertheless, I have two identical algorithms in python-numpy and in C++ (using xtensor-blas) and the C++ version is 50-100x slower than the python-numpy version. The result of the calculation is identical.
Hey all,
I've found great success using xtensor (and xtensor-blas); when I'm developing, I've seen ~15x speedups when compared to the handwritten stuff I had prior.
Regression is another story though; when I have jobs that use xtensor-blas, I've seen slowdowns as much as 30x when compared to original performance; this slow down is most prominent in smaller unit tests that would pass in under 500ms, and now take about ~17 seconds. Larger tests (10 second run time plus) had a large slowdown of 5x-10x.
I suspect that the problem lies in OpenBLAS as a backend, and I have tried to limit the number of threads spawned by setting
OPENBLAS_NUM_THREADS=1
, and limiting the number of threads did help, as before I did that my system would crash during regression with pthread resource errors.Before I spend cycles profiling too deeply, I figured I'd ask: has anyone seen anything similar to this ?