Correct minor errors with const-ification

ekg commented 5 years ago

This resolves some mistakes I made when const-ifying the library.

The main problem was in the alphabet encoder. To check if the character exists, we need to query the encoding map using the appropriate query, or else we will segfault.

@nicolaprezza I do have some questions about the encoder class. The use of a map to hold the encodings seems problematic for large alphabets where we are only using each character a handful of times. This will lead to the alphabet encoding itself taking a lot of space.

nicolaprezza commented 5 years ago

Thanks Erik. The library was designed mainly for small alphabets (where it makes sense to compress using entropy/rle): this is the reason why I didn't optimize the alphabet encoder for large alphabets. Your fix is welcome, I'm gonna merge it now. However, this does not solve entirely the problem on large alphabets since wavelet trees are implemented using pointers (it is possible to avoid pointers by using a single bitvector). This choice reduces the number of rank/select queries needed to navigate the WT topology, but introduces a O(sigma*log n)-bit overhead in the space usage. Have you tested the library on large alphabets after your modifications? if yes, what is the space usage?

ekg commented 5 years ago

I did test the library on an alphabet of around 1M and found that the space usage decreased by around 50%. This corresponded to what I observed--- that the std::vector<uint64_t>'s in the spsi tree were consuming 2/3 of the memory.

I did achieve further improvements in my application by reducing the alphabet size through a relativistic encoding, so I appreciate that the compression offered by these kinds of wavelet trees is driven by alphabet size. The application is a succinct dynamic version of a variation graph: https://github.com/vgteam/dankgraph.

However, this does not solve entirely the problem on large alphabets since wavelet trees are implemented using pointers (it is possible to avoid pointers by using a single bitvector). This choice reduces the number of rank/select queries needed to navigate the WT topology, but introduces a O(sigma*log n)-bit overhead in the space usage.

I would be very interested in exploring this change. In some cases, I can control the alphabet size. However, in the generic case it will be hard to do this without many hacks. Do you think this would be simple to try? It would be a more significant fork. Perhaps I should do so by copying the wt_string class and working there?

nicolaprezza commented 5 years ago

I did test the library on an alphabet of around 1M and found that the space usage decreased by around 50%. This corresponded to what I observed--- that the std::vector<uint64_t>'s in the spsi tree were consuming 2/3 of the memory.

very nice! I never tested the library on alphabets larger than ASCII. Thanks!

I did achieve further improvements in my application by reducing the alphabet size through a relativistic encoding, so I appreciate that the compression offered by these kinds of wavelet trees is driven by alphabet size. The application is a succinct dynamic version of a variation graph: https://github.com/vgteam/dankgraph.

However, this does not solve entirely the problem on large alphabets since wavelet trees are implemented using pointers (it is possible to avoid pointers by using a single bitvector). This choice reduces the number of rank/select queries needed to navigate the WT topology, but introduces a O(sigma*log n)-bit overhead in the space usage.

I would be very interested in exploring this change. In some cases, I can control the alphabet size. However, in the generic case it will be hard to do this without many hacks. Do you think this would be simple to try? It would be a more significant fork. Perhaps I should do so by copying the wt_string class and working there?

Yes, It would be a very nice improvement to the library! Maybe it is not too much work: you just need to represent each node as an interval on the bitvector, and replace navigation to parent/children nodes with operations on those intervals. It could be probably implemented by abstracting the node type (so you don't need to touch the WT-algorithms themselves). I would like to keep the existing wt_string class, because I think the new modification will slightly slow-down operations (due to more rank/selects). If you decide to work on it, could you create a new class? many thanks!

ekg commented 5 years ago

I may be misunderstanding something about the encoding you're describing. We would still need a way to encode the topology of the wavelet tree itself. And we would want a bitvector to delimit ranges in the backing bitvector. So the tree would be packed into two bitvectors (one backing and one delimiting) and an integer vector (encoding the topology mapping into the delimiting vector). Does this make sense? Were you considering a different encoding?

nicolaprezza commented 5 years ago

Your solution does work, but it uses two times the optimal space. It is actually possible to use just one bitvector using "wavelet matrices". These are essentially a way of concatenating the wavelet tree's bitvectors so that the topology can be inferred during navigation using no extra space. You can find an excellent description of wavelet matrices in Navarro's book "Compact Data Structures", Section 6.2.5. Using Huffman compression with wavelet matrices is slightly trickier, but still possible (it's described at the end of the section). If you don't have the book, these are some useful references (from the book):

Claude et. al. The wavelet matrix: an efficient wavelet tree for large alphabets, Information Systems, 2015.
Makinen V. and Navarro G. Rank and select revisited and extended, TCS, 2007
Gagie et al. Efficient and compact representation of prefix codes, IEEE TIT, 2015

xxsds / DYNAMIC

Correct minor errors with const-ification #7