xxsds / DYNAMIC

Dynamic succinct/compressed data structures
MIT License
111 stars 20 forks source link

Efficient batch construction/insert for rle_str #5

Open adamnovak opened 8 years ago

adamnovak commented 8 years ago

It would be useful (I think) to have a way to make a rle_str from a vector<int>, or to insert a whole vector<int> into an rle_str at some position.

Would breaking those operations out as batch operations make them any more efficient than just inserting each value one at a time? I don't know much about the internals, but I suspect a workload of inserts all at the very end, or all one after the other, might produce tree balancing problems.

nicolaprezza commented 8 years ago

Thanks for the suggestion Adam!

In bitvectors batch inserts would take just O(log n + k) time, where k is the number of bits inserted. Tree balancing is not a problem because we would need to rebalance the tree (log n time) only every log n inserted bits (i.e. leaf size), so this cost would be amortized over the k insertions. In wavelet-tree strings this would be multiplied by log sigma. In rle_str this could probably also be done efficiently, but it's more tricky because we need to do multiple batch inserts on all the gap-encoded bitvectors corresponding to characters in the inserted vector.. I'm not sure about the complexity of doing this. I'll think about it.

While writing the library I was thinking about doing this, but since my main goal was to implement dynamic RLE FM-indexes I abandoned the idea (because in this case inserts are done at random positions and not in batches). I'll add it to the TODO list.

ekg commented 8 years ago

@nicolaprezza are you still working on RLE FM-indexes? That sounds really cool!

nicolaprezza commented 8 years ago

Thanks Erik! yes, the RLE FM-index in this library is fully operative and I have already used it to implement a couple of algorithms to compute LZ77 in compressed space. If you are interested in RLE indexes, take a look also to my github.com/nicolaprezza/lz-rlbwt repository: this one is a RLE FM-index combined with LZ77 to achieve compression also of the suffix-array sampling.