C++ implementation: riblet

Hi Lei,

In case you are interested, I coded up a C++ implementation of RIBLT, packaged as a command-line app: https://github.com/hoytech/riblet

Rather than aiming to use it for online syncing, I'm trying to apply it to offline diffing/syncing for bulk data sets. It seems to work pretty well, and I'm excited for the potential!

My implementation is totally from scratch, so I did a bunch of things differently from your reference implementation. One feature I think is important is the ability to not have to use the priority queue when the number of symbols to generate is fixed/known in advance (suppose you know an upper bound on the number of differences to expect). This lets you construct the coded symbols without having to consume memory proportional to the number of input records.

I have a lot of ideas and comments, but I do have one question top-of-mind that I imagine you'd have some insight on:

When we're generating a stream of coded symbols, we'd expect to be able to peel this stream (and therefore recover the input) after about 1.35x the number of input symbols. But it could be more or could be less. Is there any way to know, as you're generating the symbols, when you've done just enough so that the coded symbol stream is "peelable" (without of course just trying to peel it)?

Thanks!

Doug

yangl1996 / riblt

C++ implementation: riblet #4