Open h-2 opened 6 years ago
The function end()
is at fault here: https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L787
m_size / m_width
is a relatively slow operation since the compiler is not smart enough to infer that m_width
never changes and could be optimized to a right shift.
Here is a benchmark using the slow end
function:
#include <iostream>
#include <chrono>
#include <sdsl/bit_vectors.hpp>
#include <vector>
using namespace std;
using namespace sdsl;
void run_benchmark(){
size_t num_runs = 1000 * 1000;
int_vector<64> values;
values.reserve(num_runs);
chrono::steady_clock::time_point begin = chrono::steady_clock::now();
for (size_t i = 0; i < num_runs; i++){
values.amortized_resize(i + 1);
// Uncomment to make it go fast.
//values.m_width = 64;
auto end = int_vector_trait<64>::end(&values, values.m_data, (values.m_size / values.m_width));
*(end - 1) = i;
}
chrono::steady_clock::time_point end = chrono::steady_clock::now();
double mean = chrono::duration_cast<chrono::nanoseconds> (end - begin).count() / (double)num_runs;
cout << "On average, it took " << mean << " nanoseconds to push_back one value." << endl;
}
int main(){
// Run benchmark a few times just to be sure.
for (size_t i = 0; i < 9; i++){
cout << "Run " << i + 1 << ": ";
run_benchmark();
}
return 0;
}
And here are the results on my computer:
Run 9: On average, it took 8.11565 nanoseconds to push_back one value.
And after uncommenting the line values.m_width = 64;
, the compiler is smart enough to make the optimization:
Run 9: On average, it took 1.71234 nanoseconds to push_back one value.
I do not understand why m_width
exists. It seems like it is initialized with t_width
at the beginning and then never written to again. It might be a good idea to simply delete m_width
and replace all occurrences with t_width
instead.
I do not understand why m_width exists. It seems like it is initialized with t_width at the beginning and then never written to again. It might be a good idea to simply delete m_width and replace all occurrences with t_width instead.
I haven't double-checked this, but it seems like an easy fix.
@eseiler what do you think?
I just checked. The width can be dynamic:
So removing m_width is not an option.
However, replacing: https://github.com/xxsds/sdsl-lite/blob/9930944f14965c4180e40f7acd5f368fd82a3329/include/sdsl/int_vector.hpp#L787
with the following should work:
iterator end() noexcept
{
if constexpr (t_width == 0) // dynamic width
return int_vector_trait<t_width>::end(this, m_data, (m_size / m_width));
else
return int_vector_trait<t_width>::end(this, m_data, (m_size / t_width));
}
Sorry for the late reply, I was sick :(
Yes, we need both, as the bitvector may be dynamic.
I'll try getting around to implement the if constexpr
and adding a benchmark (or running one from seqan3).
https://github.com/xxsds/sdsl-lite/pull/102
I managed to go from 9.2ns
to 2.8ns
.
@99991 Are you able to reproduce your times with my PR?
@eseiler I get around 3.5 ns with the linked benchmark and 2 ns after replacing sdsl::int_vector<64>
with std::vector<int64_t>
.
Thanks for the feedback, @99991 !
I tweaked it a bit more. The benchmark now uses random values.
I got rid of some expensive math functions in amortized_resize
, which now also has a version for t_width
and one for m_width
.
I think I'm going ahead and merge #102 as it already improves the performance quite a lot.
Having said that, I would be very curious whether you also see the improvements I saw with changing amortized_resize
.
Another thing to test would be the growth_factor
. There's a comment in the benchmark.
Setting it to 2
often gets me close to 2ns
. Leaving it at the default (1.5
) is never much faster than 2.5ns
.
One thing to note is that the benchmark is a microbenchmark, and even with a moderately sized vector (in the benchmark it's 64 million elements, so it should avoid most cache effects), the numbers fluctuate quite a bit.
Even if the entire amount of storage is pre-reserved (no memory allocations) calling push_back on the sdsl vector is substantially slower that doing the same on std::vector (up to 10x slower).
(the fourth column is number of iterations performed in fixed amount of time)
I thought that, especially, for the builtin integer types this shouldn't be so noticeable?