wiedi / node-bloem

Bloom Filter for node.js using the FNV hash function
MIT License
33 stars 15 forks source link

stringify method for persistence? #5

Open loveencounterflow opened 9 years ago

loveencounterflow commented 9 years ago

i'm using bloem to quickly test whether a given key may have been already inserted into a database. For this to work properly, i need to persist the state of the bloom filter; right now i'm doing essentially

BSON        = ( require 'bson' ).BSONPure.BSON
bloom_bfr   = BSON.serialize old_bloom
... write to storage ...
... later, read from storage ...
bloom_data  = BSON.deserialize bloom_bfr
# now we have to repair the deserialized data:
for filter in bloom_data[ 'filters' ]
  bitfield              = filter[ 'filter' ][ 'bitfield' ]
  bitfield[ 'buffer' ]  = bitfield[ 'buffer' ][ 'buffer' ]
new_bloom   = BLOEM.ScalingBloem.destringify bloom_data

While i'm taking advantage of bson's ability to efficiently serialize buffers, the solution does suffer from the strange property of bson that it insists on deserializing into a slightly different format from what you gave to it (IOW you don't get round-trip invariance as soon as a buffer is involved. i have no idea what that could be good for).

this would seem to work but leaves open the question what the recommended way of persisting a node-bloem filter is? Also, one might add that the destringify method has a confusing name, since it does not accept a string but a suitably prepared JS object.

wiedi commented 9 years ago

would functions that serialize/deserialize a filter object to/from Buffer be useful?

loveencounterflow commented 9 years ago

That is exactly my question (and sorry to be sort of late here). Ideally it would be as simple as using JSON, e.g. bloom_bfr = BLOEM.stringify old_bloom and new_bloom = BLOEM.parse bloom_bfr (module another choice of names for those methods, and/or attaching the stringify method to instances, not to the library).

The primary use case of this is of course to allow using a given filter over an existing collection across locations and across process lifetimes, which IMHO is actually the reason to use a Bloom filter at all. If you can't store and re-instantiate a Bloom filter you're pretty much limited to whatever you can do within the lifetime of a single process.

wiedi commented 9 years ago

So currently you can serialize a filter to a JSON string with:

var f = new bloem.Bloem(8, 2)
var persist_this = JSON.stringify(f)

To deserialize use:

var f = bloem.Bloem.destringify(JSON.parse(persisted_thing))

I agree that the destringify name is confusing. I am open for better name suggestions.

I also have ideas about a binary format (so serialize to Buffer) but if this is not what you need (because you're happy with JSON) I will hold of with implementing that until I need it.

loveencounterflow commented 9 years ago

So I tested your suggestions and they seem to work. That said, i believe that still leaves open some questions:

And yes, the method names are confusing; i'd suggest either BLOEM.stringify and BLOEM.parse (as JSON does) or BLOEM.serialize and BLOEM.deserialize (the more logical choice).

As for the BSON part and the question of 'going binary', i've since thrown out that part already upon learning that JSON suffices. It's just another dependency in the end with some annoying properties and an undocumented API change that made me loose time.

Whether a truly binary format is needed would appear to hinge on the question whether it could be faster and/or smaller than new Buffer JSON.stringify bloom_filter plus whatever optimization (like Gzip or LevelDB's compression) can offer.