xxsds / DYNAMIC

Dynamic succinct/compressed data structures
MIT License
111 stars 20 forks source link

Various minor improvements #8

Closed ekg closed 5 years ago

ekg commented 5 years ago

I've implemented a kind of locally compressed integer vector using spsi.hpp as a template. The only difference is that I removed the partial sums, as these are irrelevant for use in a pure integer vector context with only access/insert/remove functions.

I found that using packed_vector to store the partial sums and totals in the spsi structure resulted in a very substantial slowdown, and so I have reverted to std::vector<uint64_t> for these. This change did decrease memory usage of the wt_str when working with large alphabets, but at a very great cost.

The optimal solution for handling large alphabets would be a dynamic wavelet matrix, as we have discussed. I had begun to write up an implementation, but have not completed the design (it'd be wm_string.hpp). Assistance is welcome, as I don't know when I'll get to work on this. A dynamic wavelet matrix has many applications and it's a shame that there isn't a publicly available implementation of one.

nicolaprezza commented 5 years ago

Hi Erik, thanks! yes, I expect that using vector in spsi improves speed, but my concern is space usage on small alphabets: does it increase a lot? I remember I was also initially using vector, but the tree's space was too large. That is why I switched to packed_vector. Probably the impact of the tree isn't too big on large alphabets, but what about small alphabets? do you have benchmarks?

ekg commented 5 years ago

Benchmarks with dankgraph suggests that this makes things (virtually all ops) 2x faster without any significant change in memory usage.

That said, I'm only using a small alphabet wavelet tree now. I found an encoding of the graph path space that tends to have a small alphabet as long as the graph and node IDs are partially ordered, which is often the case for large regions of variation graphs built from real data.

On Sat, Feb 2, 2019, 09:00 Nicola Prezza <notifications@github.com wrote:

Hi Erik, thanks! yes, I expect that using vector in spsi improves speed, but my concern is space usage on small alphabets: does it increase a lot? I remember I was also initially using vector, but the tree's space was too large. That is why I switched to packed_vector. Probably the impact of the tree isn't too big on large alphabets, but what about small alphabets? do you have benchmarks?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xxsds/DYNAMIC/pull/8#issuecomment-459948612, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EZKqv_d0dLNWe3G-u7x7vZT5pP5rks5vJVOSgaJpZM4aeqpd .

nicolaprezza commented 5 years ago

Wow, if speed doubles without increasing space then the change is most welcome :-) I'll also run a few tests with the compression algorithms included in the library on various alphabets. If space stays under control then I'll happily merge!

nicolaprezza commented 5 years ago

Hi Erik,

great: I did some benchmarks on DNA and code (C++) alphabets. Speed improves by a factor of 1.3 and working space essentially remains the same :-) I wonder why I didn't use vectors in the first place. Thanks! merging.