xxsds / DYNAMIC

Dynamic succinct/compressed data structures
MIT License
111 stars 20 forks source link

Batch construction of wavelet trees #13

Closed chrisbarber closed 5 years ago

chrisbarber commented 5 years ago

Orthogonal to #5, it would be nice to have (and probably easier to implement) batch construction of wavelet trees which is agnostic to the underlying bitvector implementation. While still making serial inserts (push back) to the individual bitvectors, acceleration is possible as follows:

Given a string of symbols to batch-create a new wavelet tree, first scan the string and encode all the symbols using the chosen alphabet encoder. Build an "empty" wavelet tree with nodes for each of the encoded symbols. For each node, precompute the symbols which belong to its partition, as well as whether those symbols belong to its left or right child. Now, the bitvectors of each node can be populated in parallel by iterating symbols in the string, and:

Since the empty tree is pre-constructed, and the bitvectors are all independent, there is no synchronization to worry about, and all threads can read from one copy of the string in memory. Run one thread per node up to a max number of concurrent threads.

Conceptually it is that simple but there are a few things to optimize, e.g. you can wait to create a node and initiate its thread until coming across the first symbol belonging to it, so you don't actually have to pre-scan the string first.

This would also work for push_back_many or even insert_many, where you have threads per node that scan the whole vector of symbols to push back, or the pairs of (symbol, position) to insert, and only do the push back or insert if the symbol belongs to its partition.

nicolaprezza commented 5 years ago

Thanks for the suggestion! yes, it would surely work nicely. Multi-threading is feature that would be very nice to add to the library. I don't have much time to sped on it though, so if you do go ahead, I'll happily merge.

chrisbarber commented 5 years ago

This does offer a pretty good speedup, but I suspect there might be more to be gained by implementing batch construction into the lower-levels e.g. spsi or packed_vector. Could you perhaps advise on a strategy for doing that? Could you also point to a reference for what spsi actually implements? (I see a couple papers that look right, but it would take some time for me to be sure as I am unfamiliar with the general domain.)

nicolaprezza commented 5 years ago

Awesome, thanks! spsi implements a B-tree where the leaves are packed vectors. The best reference I can think of is the library's paper (https://core.ac.uk/download/pdf/84869324.pdf).

Packed_vector is simply a vector where all integers are packed in a fixed number of bits each. This should be easy to parallelize: instead of calling many push backs, I think you could push an entire word containing several pre-packed integers.

nicolaprezza commented 5 years ago

Whoops, I've been too eager to merge: it doesn't compile :-/ does this work for you? can you fix it?

from DYNAMIC's main folder:

mkdir build cd build cmake .. make

err.txt