xxsds / DYNAMIC

Dynamic succinct/compressed data structures
MIT License
111 stars 20 forks source link

dynamic wavelet matrix #21

Closed ekg closed 4 years ago

ekg commented 4 years ago

This pulls in @MitI-7's DynamicWaveletMatrix, rebases it on DYNAMIC's succinct bitvector, and partially matches the interface of wt_string.

It provides a fixed-width alphabet encoding requiring log(sigma) * N bits to encode N integers. I attempted to use it as a basis for the dynamic FM-index, but ran into problems that I assume relate to the fact that I'm not using an alphabet encoder like that in wt_string. However, I do pass the tests that @MitI-7 developed for the original implementation (these are in wm_string.cpp).

How can we improve this?

ekg commented 4 years ago

It's faster for access and most operations (except insert) than the original implementation:

build/wm_string | head -7

original:

SPEED
access:1357ms
rank:1551ms
select:2014ms
insert:1407ms
erase:2448ms
update:4574ms

This PR:

SPEED         
access:261ms  
rank:341ms    
select:1205ms 
insert:1534ms 
erase:1945ms  
update:4035ms 
nicolaprezza commented 4 years ago

thanks!! the next natural step would be to implement 2D range counting and range reporting to turn this into a dynamic 2D range search data structure.

ekg commented 4 years ago

Out of curiosity, I'd also like to make it interchangeable with the wt_string. The alphabet encoding issue is a curiosity. As would be getting the matrix to be Huffman shaped.

On Sat, May 30, 2020, 19:43 Nicola Prezza notifications@github.com wrote:

Merged #21 https://github.com/xxsds/DYNAMIC/pull/21 into master.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xxsds/DYNAMIC/pull/21#event-3389925203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENLOVM2YIJFYTLLSZLRUFASVANCNFSM4NOYR5SQ .

nicolaprezza commented 4 years ago

Using other encodings with wavelet matrices is possible but trickier than wavelet trees: not any prefix-free code is good because in the matrix layout you don't want 'holes' to appear inside the bitvectors. What you need is to move the holes on the right as follows: if you want to assign k codes of length L (the lengths are those of Huffman, for example), then you sort all available binary codes of length L (i.e. those not having as prefix a shorter code) in decreasing co-lexicographic order, and pick the first k. This is described in chapter 6.2.5, paragraph "Compression" of Gonzalo's book "Compact Data Structures: A Practical Approach".

Unfortunately, Elias gamma and delta do not satisfy the above constraint, therefore those codes are not portable to a wavelet matrix. However, one could modify the Huffman alphabet encoder to return the codes described above, so that wm_string and wm_matrix are interchangeable at least on Huffman and fixed-length.