int_vector is substantially slower than std::vector

h-2 commented 6 years ago

Even if the entire amount of storage is pre-reserved (no memory allocations) calling push_back on the sdsl vector is substantially slower that doing the same on std::vector (up to 10x slower).

push_back<std::vector, uint8_t, true>      1 ns          1 ns  847837409 alph_size=256 preallocated_mem=1 sizeof=1
push_back<std::vector, uint16_t, true>     1 ns          1 ns  658253559 alph_size=65.536k preallocated_mem=1 sizeof=2
push_back<std::vector, uint32_t, true>     2 ns          2 ns  448232364 alph_size=4.29497G preallocated_mem=1 sizeof=4
push_back<std::vector, uint64_t, true>     2 ns          2 ns  277192593 preallocated_mem=1 sizeof=8
push_back<sdsl_int_vec, uint8_t, true>    10 ns         10 ns   71418369 alph_size=256 preallocated_mem=1 sizeof=1
push_back<sdsl_int_vec, uint16_t, true>    9 ns          9 ns   73339899 alph_size=65.536k preallocated_mem=1 sizeof=2
push_back<sdsl_int_vec, uint32_t, true>   10 ns         10 ns   66514001 alph_size=4.29497G preallocated_mem=1 sizeof=4
push_back<sdsl_int_vec, uint64_t, true>   11 ns         11 ns   57827822 preallocated_mem=1 sizeof=8

(the fourth column is number of iterations performed in fixed amount of time)

I thought that, especially, for the builtin integer types this shouldn't be so noticeable?

99991 commented 2 years ago

The function end() is at fault here: https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L787

m_size / m_width is a relatively slow operation since the compiler is not smart enough to infer that m_width never changes and could be optimized to a right shift.

Here is a benchmark using the slow end function:

#include <iostream>
#include <chrono>
#include <sdsl/bit_vectors.hpp>

#include <vector>

using namespace std;
using namespace sdsl;

void run_benchmark(){
    size_t num_runs = 1000 * 1000;

    int_vector<64> values;

    values.reserve(num_runs);

    chrono::steady_clock::time_point begin = chrono::steady_clock::now();

    for (size_t i = 0; i < num_runs; i++){
        values.amortized_resize(i + 1);

        // Uncomment to make it go fast.
        //values.m_width = 64;
        auto end = int_vector_trait<64>::end(&values, values.m_data, (values.m_size / values.m_width));

        *(end - 1) = i;
    }

    chrono::steady_clock::time_point end = chrono::steady_clock::now();

    double mean = chrono::duration_cast<chrono::nanoseconds> (end - begin).count() / (double)num_runs;

    cout << "On average, it took " << mean << " nanoseconds to push_back one value." << endl;
}

int main(){

    // Run benchmark a few times just to be sure.
    for (size_t i = 0; i < 9; i++){
        cout << "Run " << i + 1 << ": ";
        run_benchmark();
    }

    return 0;
}

And here are the results on my computer:

Run 9: On average, it took 8.11565 nanoseconds to push_back one value.

And after uncommenting the line values.m_width = 64;, the compiler is smart enough to make the optimization:

Run 9: On average, it took 1.71234 nanoseconds to push_back one value.

I do not understand why m_width exists. It seems like it is initialized with t_width at the beginning and then never written to again. It might be a good idea to simply delete m_width and replace all occurrences with t_width instead.

h-2 commented 2 years ago

I do not understand why m_width exists. It seems like it is initialized with t_width at the beginning and then never written to again. It might be a good idea to simply delete m_width and replace all occurrences with t_width instead.

I haven't double-checked this, but it seems like an easy fix.

@eseiler what do you think?

h-2 commented 2 years ago

I just checked. The width can be dynamic:

https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L246-L247

https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L625

https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L126-L136

So removing m_width is not an option.

However, replacing: https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L787

with the following should work:

iterator end() noexcept
{ 
    if constexpr (t_width == 0) // dynamic width
        return int_vector_trait<t_width>::end(this, m_data, (m_size / m_width));
    else
        return int_vector_trait<t_width>::end(this, m_data, (m_size / t_width));
}

eseiler commented 2 years ago

Sorry for the late reply, I was sick :(

Yes, we need both, as the bitvector may be dynamic. I'll try getting around to implement the if constexpr and adding a benchmark (or running one from seqan3).

eseiler commented 2 years ago

https://github.com/xxsds/sdsl-lite/pull/102

I managed to go from 9.2ns to 2.8ns.

@99991 Are you able to reproduce your times with my PR?

99991 commented 2 years ago

@eseiler I get around 3.5 ns with the linked benchmark and 2 ns after replacing sdsl::int_vector<64> with std::vector<int64_t>.

eseiler commented 2 years ago

Thanks for the feedback, @99991 !

I tweaked it a bit more. The benchmark now uses random values. I got rid of some expensive math functions in amortized_resize, which now also has a version for t_width and one for m_width.

I think I'm going ahead and merge #102 as it already improves the performance quite a lot. Having said that, I would be very curious whether you also see the improvements I saw with changing amortized_resize.

Another thing to test would be the growth_factor. There's a comment in the benchmark. Setting it to 2 often gets me close to 2ns. Leaving it at the default (1.5) is never much faster than 2.5ns.

One thing to note is that the benchmark is a microbenchmark, and even with a moderately sized vector (in the benchmark it's 64 million elements, so it should avoid most cache effects), the numbers fluctuate quite a bit.

xxsds / sdsl-lite

int_vector is substantially slower than std::vector #43