xtensor-stack / xsimd

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))
https://xsimd.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.22k stars 258 forks source link

Build failure on ARM v8 with SVE (`neoverse_v1`) architecture #1005

Closed casparvl closed 8 months ago

casparvl commented 9 months ago

Environment

Error I'm running into a build issue when compiling code that uses xsimd:

/tmp/bot/easybuild/build/DP3/6.0/foss-2023b/DP3/antennaflagger/Flagger.cc:116:66:   required from here
/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_v1/software/IDG/1.2.0-foss-2023b/include/xsimd/arch/xsimd_neon.hpp:943:36: error: could not convert dispatcher.xsimd::kernel::detail::neon_dispatcher_base<xsimd::kernel::detail::comp_return_type, __Uint8x16_t, __Int8x16_t, __Uint16x8_t, __Int16x8_t, __Uint32x4_t, __Int32x4_t, __Float32x4_t>::binary::apply<__Float32x4_t>((& lhs)->xsimd::batch<float, xsimd::i8mm<xsimd::neon64> >::<anonymous>.xsimd::types::simd_register<float, xsimd::i8mm<xsimd::neon64> >::<anonymous>.xsimd::types::simd_register<float, xsimd::neon64>::<anonymous>.xsimd::types::simd_register<float, xsimd::neon>::operator register_type(), (& rhs)->xsimd::batch<float, xsimd::i8mm<xsimd::neon64> >::<anonymous>.xsimd::types::simd_register<float, xsimd::i8mm<xsimd::neon64> >::<anonymous>.xsimd::types::simd_register<float, xsimd::neon64>::<anonymous>.xsimd::types::simd_register<float, xsimd::neon>::operator register_type()) from xsimd::kernel::detail::comp_return_type<__Float32x4_t> {aka uint32x4_t} to xsimd::batch_bool<float, xsimd::i8mm<xsimd::neon64> >
  943 |             return dispatcher.apply(register_type(lhs), register_type(rhs));
      |                    ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                    |
      |                                    xsimd::kernel::detail::comp_return_type<__Float32x4_t> {aka uint32x4_t}

The same code, with the same compiler flags / compiler version builds fine on Neoverse N1 (and on zen2, zen3, haswell and skylake by the way).

I've tried to dig into the code of xsimd a bit, but in the above error I'm in a bit over my head when it comes to all the types flying around :) Hoping that someone with more expertise in xsimd spots where this might be going wrong... My bet is there was some change in terms of datatypes, intrinsics, or similar in Neoverse V1 that was not accounted for (yet) in xsimd that makes this go wrong compared to e.g. Neoverse N1.

Not sure if this might be useful, but to get an overview of the supported instructions on N1 vs V1, on Neoverse N1:

$ lscpu | grep Flags
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

And for Neoverse V1:

$ lscpu | grep Flags
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs dcpodp svei8mm svebf16 i8mm bf16 dgh rng

N.B. Note that this none of these codes is mine: I'm just the guy having the pleasure of trying to build them on different hardware architectures :)

serge-sans-paille commented 9 months ago

Thanks for the bug report! Can you share a minimal c++ input that fails? I'd also be interested in the output of g++ -v [extra flags] failing_input.cpp

casparvl commented 9 months ago

Thanks for the fast response. I'd love to provide a minimal c++ input, maybe you can help me create one since I'm unfamiliar with xsimd... The call where it fails is this one. Can you give me a minimal piece of code that would trigger this dispatcher.apply(...) call? I found https://xsimd.readthedocs.io/en/latest/api/dispatching.html but the code snippet there is not a fully contained example. I think the only thing it is lacking is to provide it with some actual data object before calling float res = dispatched(data, 17), but wasn't sure what that data object should be exactly... If you could help me with turning that into a fully contained example, I can try running it, and I'm giving it a 9/10 chance that will trigger the bug.

Regarding the -v run: let's try that minimal example first, see if we can get rid of the complexity of all the related code. The -v for that will probably be much cleaner.

serge-sans-paille commented 8 months ago

This looks like the right reproducer: https://godbolt.org/z/E4KxKqcMP I'll investigate some more

casparvl commented 8 months ago

Good thinking on using godbolt for this, that was much quicker than going back and forth with me trying out a compilation natively :) Thanks for investigating this and the quick fix!