xtensor-stack / xtensor-blas

BLAS extension to xtensor
BSD 3-Clause "New" or "Revised" License
157 stars 55 forks source link

30x slowdown in regression #146

Open conjam opened 4 years ago

conjam commented 4 years ago

Hey all,

I've found great success using xtensor (and xtensor-blas); when I'm developing, I've seen ~15x speedups when compared to the handwritten stuff I had prior.

Regression is another story though; when I have jobs that use xtensor-blas, I've seen slowdowns as much as 30x when compared to original performance; this slow down is most prominent in smaller unit tests that would pass in under 500ms, and now take about ~17 seconds. Larger tests (10 second run time plus) had a large slowdown of 5x-10x.

I suspect that the problem lies in OpenBLAS as a backend, and I have tried to limit the number of threads spawned by setting OPENBLAS_NUM_THREADS=1, and limiting the number of threads did help, as before I did that my system would crash during regression with pthread resource errors.

Before I spend cycles profiling too deeply, I figured I'd ask: has anyone seen anything similar to this ?

wolfv commented 4 years ago

Hi @conjam, first, just in case, have you made sure that you are linking against OpenBLAS or MKL? xtensor-blas contains a C++ implementation (called FLENS) of most BLAS routines, but they are a lot less optimized than actual BLAS.

Also if you could give us a hint on what exactly you're doing with xtensor / xtensor-blas we might be able to help better ... One problem could be that we sometimes need to convert row-major matrices to column-major for some LAPACK operations ... that could eat performance.

conjam commented 4 years ago

First off: thanks for the quick response!

I've checked and across platforms (I develop on mac, regression on centos) and libopenblas is linked into both binaries; in case that isn't enough, I have add_definitions(-DHAVE_CBLAS=1) and set(XTENSOR_USE_XSIMD 1) in my CMakeLists (I followed the CMake guide y'all put out verbatim).

Currently in regression I only use xt::linalg::dot to find the matrix product of 2D vectors.

wolfv commented 4 years ago

Hi @conjam,

can you give me some more context on the slowdown, and especially your matrix / vector sizes? If you have small matrices, it's very possible that hand-written code outperforms BLAS (e.g. for 3x3 matrix-matrix or matrix-vector product).

You can get some speedup by using xtensor_fixed as a container, however, the BLAS implementation is still "dynamic" and doesn't statically know about the size of your matrices.

If you want to achieve the best performance for dot products for small matrices, I would encourage you to write them by hand and use the xtensor_fixed container.

If you have a problem with large matrices, I would appreciate it if you could give me more context so I can check what the problem might be. E.g. sizes of the matrices, some code snippets, your hand-written implementation etc.

pdumon commented 4 years ago

xt::lingalg::tensordot seems to execute very slowly here, not sure if this is related. However, I found this is maybe due to preparatory mathematical & view operations I'm doing. I can influence it by using xt::eval. Nevertheless, I have two identical algorithms in python-numpy and in C++ (using xtensor-blas) and the C++ version is 50-100x slower than the python-numpy version. The result of the calculation is identical.