yf0994 / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Provide method to efficiently combine Bloom filters #1134

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
My application creates Bloom filters to represent its state at various points 
in time.  I want to combine these to create a view of all points in time.  
Presently I keep filters in memory and combine them with Predicates.or.  This 
approach wastes memory because I duplicate many elements between filters.  I 
want to add BloomFilter.combine which bitwise-ORs the underlying BitArray.  I 
can combine most filters without saturation with knowledge of the number of 
total elements and care with choosing expectedInsertions.  I pushed a fix to 
GitHub:

https://github.com/andrewgaul/guava-libraries/tree/combine-bloom-filter

Original issue reported on code.google.com by arg...@gmail.com on 5 Sep 2012 at 1:12

GoogleCodeExporter commented 9 years ago
Just FYI, we *always* discuss the merit of the API change extensively before 
even looking at code.

Original comment by wasserman.louis on 5 Sep 2012 at 3:25

GoogleCodeExporter commented 9 years ago
Regarding the change itself...I like the idea of being able to union Bloom 
filters, but I'm not sure how the method should behave when it receives Bloom 
filters of different sizes.  Should we only expect the same size in bits, or 
should we force both the expectedInsertions and fpp to be the same?  My reflex 
is to throw an IAE when trying an invalid union, but is that the right response?

Original comment by wasserman.louis on 5 Sep 2012 at 2:22

GoogleCodeExporter commented 9 years ago
We must throw IAE for filters with mismatched sizes, fpp, or funnels.  We might 
need to give an additional accessor for bits.size().  The referenced patch has 
a canCombine predicate although this is not as useful as the accessor.

Original comment by arg...@gmail.com on 5 Sep 2012 at 4:24

GoogleCodeExporter commented 9 years ago
My thinking is that if you're testing canCombine at runtime, your code is 
already buggy?  You should only be trying to combine Bloom filters for the same 
set with the same parameters in any event, and you should know when you're 
writing the code which Bloom filters are compatible.

Original comment by wasserman.louis on 5 Sep 2012 at 4:29

GoogleCodeExporter commented 9 years ago
I want to support an increasing total number of insertions over time.  I plan 
to use powers of two for expectedInsertions, such that filters will fall into a 
few known sizes.  I prefer an accessor for filter size over a canCombine 
predicate, although the latter is sufficient for my application.

Original comment by arg...@gmail.com on 5 Sep 2012 at 4:39

GoogleCodeExporter commented 9 years ago
I pushed a new patch which exposes BloomFilter.size and hides canCombine:

https://github.com/andrewgaul/guava-libraries/pull/3

I would very much like Guava to accept this since otherwise I need to package 
my own version of Guava.  Any hints on whether this is likely to land upstream?

Original comment by arg...@gmail.com on 28 Sep 2012 at 12:44

GoogleCodeExporter commented 9 years ago
I like this idea too -- we've got some ongoing work with the BF API, so I'll 
grab this and see how we can make this work.

Original comment by kak@google.com on 1 Oct 2012 at 11:40

GoogleCodeExporter commented 9 years ago
I'd suggest to rethink the naming.

In case you provide `size`, it should be clear what units get used (something 
like "sizeInLong")?

The name "combine" says nothing about the operation. OK, it's union, as nothing 
else works, but why not be explicit? Moreover, it could create a new filter 
instead of modifying the first operand.

So I'd suggest to call it `addAll` like all collections do. But there's "put" 
instead of "add", so it must be actually `putAll`. But then I don't know how to 
call "canCombine".

So maybe the current names are the best, I just wanted to bring this up.

Original comment by Maaarti...@gmail.com on 5 Oct 2012 at 9:14

GoogleCodeExporter commented 9 years ago
@kak: this was accepted back in October. Any updates? We'd really like to see 
this merged for Guava 14.0. The author of the patch in #6 is a colleague and 
we've signed the contributor agreement paperwork already, if that's an issue.

Original comment by diwa...@maginatics.com on 12 Feb 2013 at 7:37

GoogleCodeExporter commented 9 years ago
Hi Diwaker,

We haven't made these changes yet, so it won't make it into Guava 14.  We can 
try to prioritize it for Guava 15 though.

-kak

Original comment by kak@google.com on 13 Feb 2013 at 12:17

GoogleCodeExporter commented 9 years ago
Thanks, Kak. It's unfortunate that this can't make it into Guava 14 -- Guava 
releases are infrequent enough that waiting for Guava 15 might not be an option 
for us. Nevertheless, if you can go through the existing patch and give us 
feedback, we can have it ready for inclusion whenever Guava devs are ready to 
consume it. Thanks!

Original comment by diwa...@maginatics.com on 13 Feb 2013 at 12:40

GoogleCodeExporter commented 9 years ago
Hi Kak,

> We haven't made these changes yet, so it won't make it into Guava 14.  We can 
try to prioritize it for Guava 15 though.

Now that Guava 14 is out, we'd like to see this merged for Guava 15. We're more 
than happy to help with testing and updating the patch if there are any 
comments or suggestions.

Original comment by diwa...@maginatics.com on 28 Feb 2013 at 6:38

GoogleCodeExporter commented 9 years ago
Hi Diwaker,

Thanks again for your patience. I'm working on this today, but like Martin, I 
have some hesitation about the proposed instance method naming.

You've proposed the verb "combine'. In our C++ version, we're using the verb 
"merge". I'm not sure which is better (or if something like union is even 
better?)

Since this is going to be an instance method on BloomFilter which mutates the 
current instance (instead of a static method which returns a new instance), I'm 
not sure combine is the right method name.

Maybe combineWith? mergeWith?

Anyone have any thoughts here?

Original comment by k...@kloover.com on 28 Feb 2013 at 9:38

GoogleCodeExporter commented 9 years ago
Both combineWith and mergeWith sound better. I have a slight preference for the 
latter -- mergeWith is clearer in indicating that we're mutating this instance 
IMO.

Original comment by diwa...@maginatics.com on 28 Feb 2013 at 9:56

GoogleCodeExporter commented 9 years ago
I do not have an opinion on naming this method.  addAll, combineWith, 
mergeWith, unionWith, or whatever I originally submitted work for me.  The 
important bit is to include the functionality upstream so Maginatics can stop 
using an internal fork.

Original comment by arg...@gmail.com on 28 Feb 2013 at 10:06

GoogleCodeExporter commented 9 years ago
OK, sounds good...this is out for code review now and will definitely make 
Guava 15 (should be available at HEAD within a few days).  Only thing standing 
in it's way now is the naming issue (which we're discussing internally now).

Original comment by kak@google.com on 28 Feb 2013 at 10:41

GoogleCodeExporter commented 9 years ago
Just about to submit this change.  We're gonna go with BloomFilter#mergeWith 
for now, but we might change it before Guava 15.0 is released (so if anyone 
wants to make a strong argument for another name, please speak up!)

Original comment by kak@google.com on 1 Mar 2013 at 2:12

GoogleCodeExporter commented 9 years ago
Great, thanks Kurt!

Original comment by arg...@gmail.com on 1 Mar 2013 at 2:16

GoogleCodeExporter commented 9 years ago
Excellent. Thanks Kak for the quick turnaround!

Original comment by diwa...@maginatics.com on 1 Mar 2013 at 2:16

GoogleCodeExporter commented 9 years ago
https://code.google.com/p/guava-libraries/source/detail?r=a1858510da0d99dfc3e362
2741d2134fb22e18c1

Original comment by kak@google.com on 1 Mar 2013 at 5:50

GoogleCodeExporter commented 9 years ago
Kurt, can you also expose a BloomFilter.size method which calls BitArray.size?  
Callers want to know whether they should expect mergeWith to succeed.  In the 
Maginatics use case, we have BloomFilters of various power of two sizes and we 
merge only compatible sizes.  This was present in the pull request in comment 
#6.

Original comment by arg...@gmail.com on 1 Mar 2013 at 6:16

GoogleCodeExporter commented 9 years ago
@argaul: Sure - adding BloomFilter#size right now!  Should be pushed out next 
week.

Original comment by kak@google.com on 1 Mar 2013 at 7:38

GoogleCodeExporter commented 9 years ago
Would it make sense to have a "canMerge(BloomFilter)" method (in addition to 
size() perhaps), since there are quite a few things besides just the size that 
have to be the same in order to merge?

Original comment by cgdec...@gmail.com on 1 Mar 2013 at 8:19

GoogleCodeExporter commented 9 years ago
Colin: I fully agree that there should be such a method, especially because of 
the internals of the implementation being rather hidden. Given the name 
"mergeWith" I'd suggest to call it "canMergeWith".

Actually, combining BF of different sizes can make some sense, too. Because of 
the only requirement on BF being that it returns no false negatives, it's 
clearly possible. When the size ratio is something like 1:2, it even works 
fine. With "ugly" ratios, the result is completely unusable. Sticking with 
powers of two (which is IMHO a good idea anyway) would enforce "nice" ratios. 
The implementation is quite straightforward, but I doubt it's worth it - I just 
wanted to bring it up in case somebody could need it one day.

Original comment by Maaarti...@gmail.com on 1 Mar 2013 at 10:52

GoogleCodeExporter commented 9 years ago
Added BF#canMergeWith(BF)...should be migrated externally within a few days.

Original comment by kak@google.com on 4 Mar 2013 at 7:12

GoogleCodeExporter commented 9 years ago
Can you make BloomFilter.size public instead of package private?  I want to 
call this method to separate BloomFilters into groups which I later call 
mergeWith on.

Original comment by arg...@gmail.com on 19 Mar 2013 at 6:40

GoogleCodeExporter commented 9 years ago
@argaul: Hmmm, is there a reason BF#isCompatible(BF) won't work for you?

Original comment by kak@google.com on 19 Mar 2013 at 6:42

GoogleCodeExporter commented 9 years ago
I have many BloomFilters of varying sizes so O(n**2) isCompatible calls is not 
ideal.  I guess if callers choose only a few sizes, e.g., power of 2, the worst 
case has a smaller bound.

Original comment by arg...@gmail.com on 19 Mar 2013 at 6:55

GoogleCodeExporter commented 9 years ago
@argaul: but currently we only allow for merging of BFs with the same exact 
size...so if you are trying to separate the BFs into groups, the merge will 
fail later if they are not isCompatible() == true.

Original comment by kak@google.com on 19 Mar 2013 at 7:11

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08