xtensor-stack / xsimd

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))
https://xsimd.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.15k stars 253 forks source link

`fma()` using generic implementation despite of AVX2 and FMA enabled #1041

Closed mkatliar closed 1 month ago

mkatliar commented 1 month ago

Test program:

#include <xsimd/xsimd.hpp>

int main(int, char **)
{
    static_assert(XSIMD_WITH_AVX2);
    static_assert(XSIMD_WITH_FMA3_AVX);
    static_assert(XSIMD_WITH_FMA3_AVX2);

    xsimd::batch<float, xsimd::avx2> a {10.}, b {20.}, c {30.};
    xsimd::batch<float, xsimd::avx2> d = fma(a, b, c);
}

Compile with

clang++ -g test.cpp -mavx2 -mfma

Step into the fma() function and see that the generic version is called:

#0  xsimd::kernel::fma<xsimd::avx2, float> (x=..., y=..., z=...) at /usr/local/include/xsimd/types/../arch/././generic/xsimd_generic_arithmetic.hpp:74
#1  xsimd::fma<float, xsimd::avx2> (x=..., y=..., z=...) at /usr/local/include/xsimd/types/xsimd_api.hpp:881
#2  main () at test.cpp:10

The xsimd_fma3_avx.hpp file where correct fma() function is defined does get included:

$ clang++ -H -g test.cpp -mavx2 -mfma 2>&1 | grep xsimd_fma
..... /usr/local/include/xsimd/memory/../config/../types/xsimd_fma3_sse_register.hpp
..... /usr/local/include/xsimd/memory/../config/../types/xsimd_fma4_register.hpp
..... /usr/local/include/xsimd/memory/../config/../types/xsimd_fma3_avx2_register.hpp
..... /usr/local/include/xsimd/memory/../config/../types/xsimd_fma3_avx_register.hpp
.... /usr/local/include/xsimd/types/../arch/./xsimd_fma3_sse.hpp
.... /usr/local/include/xsimd/types/../arch/./xsimd_fma3_avx.hpp
.... /usr/local/include/xsimd/types/../arch/./xsimd_fma3_avx2.hpp
..... /usr/local/include/xsimd/types/../arch/./xsimd_fma3_avx.hpp
mkatliar commented 1 month ago

My bad: I was not realizing that xsimd::fma3<xsimd::avx2> should be specified as the architecture. The correct code is

xsimd::batch<float, xsimd::fma3<xsimd::avx2>> a {10.}, b {20.}, c {30.};
xsimd::batch<float, xsimd::fma3<xsimd::avx2>> d = fma(a, b, c);