xtensor-stack / xsimd

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))
https://xsimd.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.2k stars 258 forks source link

Compile-time dispatch instead of switch-case #779

Closed wermos closed 2 years ago

wermos commented 2 years ago

As I pointed out in one of my messages in #775, there are many cases like this, where a switch-case is employed to make the decision of which intrinsic to call, even though this decision could be done at compile-time.

As part of our program to improve debug build performance, as @marzer suggested, and for improving performance overall, this should be changed to using some std::enable_if type thing.

JohanMabille commented 2 years ago

I think we should run benchmarks and see how / whether this hits performancs before making the implementation more complex

marzer commented 2 years ago

One other thing that can be done if you do not wish to move away from switches is to add an 'unreachable' marker to the default case on compilers that support such a construct, e.g MSVC's __assume(0) or gcc/clang's __builtin_unreachable().

should be changed to using some std::enable_if type thing

@wermos in this particular situation I suspect they'd be better off using some internal templated helper type and specializing it based on the register type, since they're already using std::enable_if to select for integral inputs and those conditionals would necessarily have to end up much more complex to filter the overload set. Something like:


namespace detail
{
    template <typename, std::size_t>
    struct simd_add
    {
        static_assert(false, "unsupported arch/op combination");
    }

    template <>
    struct simd_add<__m128i, 1>
    {
        static constexpr auto& func = ::_mm_add_epi8;
    }

    template <>
    struct simd_add<__m128i, 2>
    {
        static constexpr auto& func = ::_mm_add_epi16;
    }

    template <>
    struct simd_add<__m128i, 4>
    {
        static constexpr auto& func = ::_mm_add_epi32;
    }

    template <>
    struct simd_add<__m128i, 8>
    {
        static constexpr auto& func = ::_mm_add_epi64;
    }
}

template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type>
inline batch<T, A> add(batch<T, A> const& self, batch<T, A> const& other, requires_arch<sse2>) noexcept
{
    return detail::simd_add<batch<T, A>::register_type, T>::func(self, other);
}
serge-sans-paille commented 2 years ago

We already have a similar system for neon intrinsic, and from a maintenance perspective, a single switch is much easier to maintain / read... I'm personally not in favor of trading ease of maintenance with debug build speed.

wermos commented 2 years ago

One thing we could do, which I think would be very nice, is use if constexpr if __cplusplus >= 201703L, and otherwise, stick to the switch-case statement.

if constexpr would solve this problem at almost zero cost to ease-of-maintenance.

serge-sans-paille commented 2 years ago

I don't mind that, provided we can write a nice compat macro that performs either a switch or a cascade of constexpr if.

wermos commented 2 years ago

@serge-sans-paille what are your thoughts on replacing the existing switch statements with an if-clause? Granted, the readability of the code will be hurt a little, but doing so would make writing a macro for switching between if and if constexpr very easy.

Moreover, from what I could tell, it doesn't look like there is really a big difference performance-wise between switch and a cascade of ifs for such a small number of cases.

marzer commented 2 years ago

is use if constexpr if __cplusplus >= 201703L

@wermos Testing the whole C++ version isn't necessary; compilers should define __cpp_if_constexpr with a value >= 201606 when if constexpr is supported.

wermos commented 2 years ago

is use if constexpr if __cplusplus >= 201703L

@wermos Testing the whole C++ version isn't necessary; compilers should define __cpp_if_constexpr with a value >= 201606 when if constexpr is supported.

The problem with these feature test macros is that they were added in C++20, so it is not neccesarily defined if, for example, a compiler is fully C++17 compliant (meaning they have if constexpr) but have no C++20 support yet (meaning these feature test macros don't exist).

marzer commented 2 years ago

The problem with these feature test macros is that they were added in C++20

Are you certain? I've been using the compiler ones for a long time. You're conflating a few separate things, I believe.

More of both were ratified in C++20, of course, but I don't believe it's somehow retroactive. They've always been there.

marzer commented 2 years ago

GCC7, Clang 6, Visual studio 2017: https://godbolt.org/z/e8E59eK6h

wermos commented 2 years ago

The problem with these feature test macros is that they were added in C++20

Are you certain? I've been using the compiler ones for a long time. You're conflating a few separate things, I believe.

Oh, this is news to me. I always thought that feature test macros were a feature which was added in C++20 but wasn't there before that.

Do you have any links for the documentation of the feature test macros that GCC and Clang support? I looked around a bit but all I could find was GCC's documentation of the feature test macros in their C Standard Library.

marzer commented 2 years ago

Do you have any links [...]

I don't; the cppreference.com page has always been my go-to. It has been there since 2015, though, so I suspect what else may have actually changed in C++20 is "oh hey these are all actually official now, whereas they've been pseudo-standard up until this point", a bit like how #pragma once is 'non-standard' but supported everywhere that matters.

The feature test macros page in 2015: https://en.cppreference.com/mwiki/index.php?title=cpp/feature_test&oldid=77408

serge-sans-paille commented 2 years ago

I like the idea of cascading if. Has the test is trivially constant, compilers should have no difficulties folding it so the runtime cost at -O2 should be null.