xtensor-stack / xsimd

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))
https://xsimd.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.2k stars 258 forks source link

Why these two pieces of code's performance differ so much #832

Closed VincentXWD closed 2 years ago

VincentXWD commented 2 years ago

I'm trying to compare the difference between xsimd and naive implementation of the demo described in "Usage". And I notice these two methods to use simd register make performance differ too much. Here are the two implementations:

void mean_xsimd(const std::vector<int64_t>& a, const std::vector<int64_t>& b, std::vector<int64_t>& res) {
    using b_type = xsimd::batch<int64_t, xsimd::avx2>;
    std::size_t inc = b_type::size;
    std::size_t size = res.size();
    // size for which the vectorization is possible
    std::size_t vec_size = size - size % inc;

    for (std::size_t i = 0; i < vec_size; i += inc) {
        static b_type avec = b_type::load_unaligned(&a[i]);
        static b_type bvec = b_type::load_unaligned(&b[i]);
        static b_type rvec = (avec + bvec);
        rvec.store_unaligned(&res[i]);
    }
    // Remaining part that cannot be vectorize
    for (std::size_t i = vec_size; i < size; ++i) {
        res[i] = (a[i] + b[i]) / 2;
    }
}
void mean_xsimd(const std::vector<int64_t>& a, const std::vector<int64_t>& b, std::vector<int64_t>& res) {
    using b_type = xsimd::batch<int64_t, xsimd::avx2>;
    std::size_t inc = b_type::size;
    std::size_t size = res.size();
    // size for which the vectorization is possible
    std::size_t vec_size = size - size % inc;
    b_type avec;
    b_type bvec;
    b_type rvec;
    for (std::size_t i = 0; i < vec_size; i += inc) {
        avec = b_type::load_unaligned(&a[i]);
        bvec = b_type::load_unaligned(&b[i]);
        rvec = (avec + bvec);
        rvec.store_unaligned(&res[i]);
    }
    // Remaining part that cannot be vectorize
    for (std::size_t i = vec_size; i < size; ++i) {
        res[i] = (a[i] + b[i]) / 2;
    }
}

I notice using "static" to decorate these registers in the loop will make the function much faster than another. I'm trying to compare the assembler but couldn't find any useful clue. Could anyone explain it? Thanks a lot.

VincentXWD commented 2 years ago

I'd like to paste all my demo code here. If anyone is interested in it. You can try it with g++ main.cc -std=c++17 -march=core-avx2 -O3

#include <bits/stdc++.h>

#include <xsimd/xsimd.hpp>

namespace timer {

template <class Unit>
Unit run_with_timer(std::function<void()> fn) {
    auto start_time = std::chrono::steady_clock::now();
    fn();
    auto end_time = std::chrono::steady_clock::now();

    std::chrono::duration<double, std::micro> us_double = end_time - start_time;
    std::cout << us_double.count() << std::endl;

    return std::chrono::duration_cast<Unit>(end_time - start_time);
}

} // namsepace timer

namespace random_number_generator {

template <typename It>
struct xorshift32_state {
    It a;
};

/* The state word must be initialized to non-zero */
template <typename It>
It xorshift32(xorshift32_state<It>& state) {
    /* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
    uint64_t x = (uint64_t)state.a;

    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    return state.a = x;
}

template <typename It>
void generate_fake_random_numbers(It seed, size_t size, std::vector<It>& vec) {
    xorshift32_state<It> state;
    state.a = seed;
    vec.resize(size);
    for (size_t i = 0; i < size; i++) {
        vec[i] = state.a;
        state.a = xorshift32(state);
    }
}

} // namespace random_number_generator

void mean(const std::vector<int64_t>& a, const std::vector<int64_t>& b, std::vector<int64_t>& res) {
    std::size_t size = res.size();
    for (std::size_t i = 0; i < size; ++i) {
        res[i] = (a[i] + b[i]) / 2;
    }
}

void mean_xsimd_nonstatic(const std::vector<int64_t>& a, const std::vector<int64_t>& b, std::vector<int64_t>& res) {
    using b_type = xsimd::batch<int64_t, xsimd::avx2>;
    std::size_t inc = b_type::size;
    std::size_t size = res.size();
    // size for which the vectorization is possible
    std::size_t vec_size = size - size % inc;
    b_type avec;
    b_type bvec;
    b_type rvec;
    for (std::size_t i = 0; i < vec_size; i += inc) {
        avec = b_type::load_unaligned(&a[i]);
        bvec = b_type::load_unaligned(&b[i]);
        rvec = (avec + bvec) / 2;
        rvec.store_unaligned(&res[i]);
    }
    // Remaining part that cannot be vectorize
    for (std::size_t i = vec_size; i < size; ++i) {
        res[i] = (a[i] + b[i]) / 2;
    }
}

void mean_xsimd_static(const std::vector<int64_t>& a, const std::vector<int64_t>& b, std::vector<int64_t>& res) {
    using b_type = xsimd::batch<int64_t, xsimd::avx2>;
    std::size_t inc = b_type::size;
    std::size_t size = res.size();
    // size for which the vectorization is possible
    std::size_t vec_size = size - size % inc;
    for (std::size_t i = 0; i < vec_size; i += inc) {
        static b_type avec = b_type::load_unaligned(&a[i]);
        static b_type bvec = b_type::load_unaligned(&b[i]);
        static b_type rvec = (avec + bvec) / 2;
        rvec.store_unaligned(&res[i]);
    }
    // Remaining part that cannot be vectorize
    for (std::size_t i = vec_size; i < size; ++i) {
        res[i] = (a[i] + b[i]) / 2;
    }
}

int main(int argc, char* argv[]) {
    size_t size = 200000;
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_time = timer::run_with_timer<std::chrono::microseconds>([&]() { mean(a, b, c); });
    }
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_simd_time = timer::run_with_timer<std::chrono::microseconds>([&]() { mean_xsimd_nonstatic(a, b, c); });
    }
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_simd_time = timer::run_with_timer<std::chrono::microseconds>([&]() { mean_xsimd_static(a, b, c); });
    }
    return 0;
}
JohanMabille commented 2 years ago

static means the variable will be initialized the first time the statement is executed only, then the variable keeps its value for the whole execution of the program (whatever the number of calls to the function embedding it). THis means this function:

int test(const std::vector<int>& v)
{
    int res = 0;
    for (size_t i = 0; i < v.size(); ++i)
    {
        static int a = v[i];
        res += a;
    }
    return res;
}

is equivalent to the following:

int test(const std::vector<int>& v)
{
    int res = 0;
    int a = v[0];
    for (size_t i = 0; i < v.size(); ++i)
    {
        res += a;
    }
    return res;
}

The compiler is able to see that the value of a doe snot change and can optimize away the loop.

VincentXWD commented 2 years ago

Thanks for your kind reply. It quite helps me a lot. I double-checked the result of the demo. I notice the summary of res vector of "static-decorated" is different from others. I modified the main function like this:

int main(int argc, char* argv[]) {
    size_t size = 200000;
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_time = timer::run_with_timer<std::chrono::microseconds>([&]() { mean(a, b, c); });
        int64_t res = 0;
        std::for_each(c.begin(), c.end(), [&](int64_t v) { res += v; });
        std::cout << "A: " << res << std::endl;
    }
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_simd_time
                = timer::run_with_timer<std::chrono::microseconds>([&]() { mean_xsimd_nonstatic(a, b, c); });
        int64_t res = 0;
        std::for_each(c.begin(), c.end(), [&](int64_t v) { res += v; });
        std::cout << "B: " << res << std::endl;
    }
    {
        std::vector<int64_t> a, b, c;
        random_number_generator::generate_fake_random_numbers((int64_t)114514, size, a);
        random_number_generator::generate_fake_random_numbers((int64_t)1919810, size, b);
        c.resize(size);
        auto mean_simd_time = timer::run_with_timer<std::chrono::microseconds>([&]() { mean_xsimd_static(a, b, c); });
        int64_t res = 0;
        std::for_each(c.begin(), c.end(), [&](int64_t v) { res += v; });
        std::cout << "C: " << res << std::endl;
    }
    return 0;
}

And the output was:

2160.93
A: 4664051133466021320
13545.2
B: 4664051133466021320
778.351
C: 7917281090032531952

It seems that three SIMD registers in "static-in-loop" method will be used by one or more instructions in the loop. Could you explain it? Thank you very much. It will help me learn XSIMD more. @JohanMabille

JohanMabille commented 2 years ago

This is unrelated to xsimd, this is the way static works; as said in my previous message:

static means the variable will be initialized the first time the statement is executed only, then the variable keeps its value for the whole execution of the program

Therefore, the simd registers are initialized in the first iteration of the loop, then they keep the same values in the other iterations.

If you run the two functions I've posted, you will notice they produce exactly the same result, illustrating the way static works.

VincentXWD commented 2 years ago

Oh, I really know it. It's pretty a naïve question. Thank you for your help and patience. As for the performance gap between SIMD and non-SIMD implementation, maybe the compiler has already optimized it whether using cacheline or some vectorial methods. cause this example is really simple. Thank you again for your explanation!

JohanMabille commented 2 years ago

As for the performance gap between SIMD and non-SIMD implementation, maybe the compiler has already optimized it whether using cacheline or some vectorial methods. cause this example is really simple.

Indeed, in simple cases the compiler can optimize the code and use the same assembly instructions as those generated by the intrinsics functions.