zcwease / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

BloomFilter<E> #12

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
A BloomFilter is a useful data structure.

  http://en.wikipedia.org/wiki/Bloom_filter

Proposed straw-man API follows.  It shares only four methods in common with
Collection, and adds two of its own.

/**
 * A probabilistic "shadow" of a set of elements, useful
 * when the set itself would be too expensive to maintain in
 * memory and query directly. A Bloom filter can give false
 * positivites, but never false negatives. That is, adding
 * an element to the filter guarantees that {@link
 * #mightContain} will return {@code true}, but {@link
 * #mightContain} returning {@code true} does not guarantee
 * that this element was ever actually added to the filter.
 */
public final class BloomFilter<E> implements Serializable {
  /**
   * Returns {@code true} if it is <i>possible</i>
   * (probability nonzero) that {@code element} is contained
   * in the set represented by this Bloom filter.  If this
   * method returns {@code false}, this element is
   * <i>definitely</i> not present. If it {@code true}, the
   * probability that this element has <i>not</i> actually
   * been added is given by {@link
   * #getFalsePositiveProbability()}.
   */
  public boolean mightContain(@Nullable E element) { ... }

  /**
   * Returns the probability that {@link #mightContain} will
   * return {@code true} for an element not actually
   * contained in this set.
   */
  public double getFalsePositiveProbability() { ... }

  /**
   * Adds {@code newElement} to this Bloom filter, so that
   * {@code mightContain(newElement)} is now guaranteed to
   * return {@code true}.
   *
   * @return true if the Bloom filter changed as a result of
   * this call
   */
  public boolean add(@Nullable E newElement) { ... }

  // self-explanatory methods:
  public boolean addAll(Iterable<? extends E> elements) { ... }
  public boolean isEmpty() { ... }
  public void clear() { ... }
}

Original issue reported on code.google.com by kevin...@gmail.com on 23 Oct 2007 at 4:02

GoogleCodeExporter commented 9 years ago
Hypothetical use cases:

1. Wikipedia uses an example of a spell-checker; if you put legal words in a 
Bloom
filter, you will be sure not to call something an error that isn't, but you 
could
miss flagging certain misspellings on occasion.

2. A "screen" so that you can optimize away an expensive rpc/db query in case 
the
Bloom filter gives you a definitive "no".

3. Suppose you want to display a special message to customers that are in your
special set of "most valued".  But you don't mind if that message also gets 
shown to
some other folks as well.  You could put all your valued customers into a
BloomFilter<Customer>, and whenever this filter "mightContain" the logged-in
customer, show the message.

There may be practical reasons why a Bloom filter is or isn't a good choice in 
any of
these situations, but hopefully these are useful as illustrations.

Note: a Bloom filter requires good hashing to work properly, not just a crummy 
32-bit
hash code from Object.hashCode().  So this enhancement request depends on us
providing some better support for hashing.

Original comment by kevin...@gmail.com on 30 Oct 2007 at 11:06

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 3 Nov 2007 at 5:33

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 2 Jun 2008 at 5:48

GoogleCodeExporter commented 9 years ago

Original comment by kevin...@gmail.com on 17 Sep 2009 at 6:02

GoogleCodeExporter commented 9 years ago
I've just taken a look at the interface and thought I'd respond with some 
initial 
thoughts.

There is another method of the Collection interface that could be usefully 
adopted:

boolean mightContainAll(Collection<?> c) //equivalent of containsAll()

Implementing size() is a possibility, by counting the number of adds that 
mutate the 
filter, but probably not worthwhile; it should be possible (better actually) to 
compute the falsePositiveProbability without this count and to use the number 
of bits 
set within the filter instead.

Extending the Collection interface directly is of little benefit because the 
iterator() method cannot be supported; such a Collection implementation would 
interact very poorly with any of the standard collection classes. Where it 
would 
provide a small benefit is that the BloomFilter interface could avoid 
overloading 
'batch' operations with versions that take another bloom filter:

boolean addAll(BloomFilter<E> filter)

boolean mightContainAll(BloomFilter<E> filter) //containsAll would be a more 
accurate 
in this case

These last two operations are clearly only defined over compatible filters, but 
are 
important because they are very efficient. But the conditions under which two 
BloomFilters are compatible is not a straightforward issue since the filters 
must 
match in capacity (number of bits) and in their hashing. Accessors over these 
would 
enable two different BloomFilter implementations to determine if they are 
compatible 
to each other. For example:

int getCapacity()
Hash getHash() //serializablity requires hash generators be serializable also

And this is not sufficient because access to the actual bitset would also be 
necessary, it might be necessary to replace the capacity accessor with a method 
that 
exposes the filter's state through a specific bitset object:

Bitset getBits()

Whether this Bitset would be an interface or concrete I find difficult to 
determine. 
What are the benefits of all this?

 * The BloomFilter 'union' operation (addAll) and 'subset' test (containsAll) become 
possible.
 * Exposure of the filter bits allows an application to persist the filter outside of 
standard java serialization.
 * Equality of BloomFilters becomes defined (as equality of bitset and hash).
 * BloomFilter.hashCode() can be specified too.

Whether this complexity is worthwhile is an open question, but my view is that, 
if 
you are looking to introduce BloomFilter as an interface rather than a concrete 
class, it's necessary to consider the conditions under which the instances of 
two 
different BloomFilter implementations are equal (as the java.util package 
attempts to 
do for collections). I also know that the first two capabilities in this list 
have 
been vital in one of my applications.

Original comment by tomgibara on 17 Mar 2010 at 10:34

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 Jul 2010 at 3:53

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 Jul 2010 at 3:56

GoogleCodeExporter commented 9 years ago
Support for set operations (determining the union, intersection or difference 
without exposing bitsets) over bloom filters would be useful.

Original comment by reachb...@gmail.com on 11 Aug 2010 at 11:50

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 27 Jan 2011 at 1:54

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 27 Jan 2011 at 1:59

GoogleCodeExporter commented 9 years ago
A very useful data structure indeed.

Original comment by jon.h.clark on 7 Feb 2011 at 5:31

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 16 Jul 2011 at 8:37

GoogleCodeExporter commented 9 years ago
There has been a lot of progress on this and we may see it in release 12, to 
put a guess out there.

Original comment by kevinb@google.com on 19 Jul 2011 at 2:24

GoogleCodeExporter commented 9 years ago
Any chance we could get this into Git, if not exposed, so I can play with it?

Original comment by wasserman.louis on 6 Dec 2011 at 4:09

GoogleCodeExporter commented 9 years ago
Since recently it became more or less feature complete (i.e. serialization), I 
wouldn't mind exposing it as-is in @Beta. Kevin?

Original comment by jim.andreou on 6 Dec 2011 at 5:10

GoogleCodeExporter commented 9 years ago
Seems like there is no roadblock to making this public (@Beta of course), along 
with the new hashing abstractions. Just a matter of someone actually doing it

Original comment by jim.andreou on 7 Dec 2011 at 2:01

GoogleCodeExporter commented 9 years ago
I get the impression folks are really busy, though, so I'd settle for 
open-sourcing it without exposing the API yet.

Original comment by wasserman.louis on 7 Dec 2011 at 2:37

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:12

GoogleCodeExporter commented 9 years ago

Original comment by wasserman.louis on 31 Dec 2011 at 5:17

GoogleCodeExporter commented 9 years ago
Didn't want to create a separate ticket for this question.

'put' has Key-Value association semantics by convention and 'add' has 
Set-element-addition semantics, (both in Java at least).

I am wondering why the BloomFilter implementation uses 'put' instead of 'add' 
even though the original API draft at the top of this page says 'add' a 
possible 'addAll'.

Original comment by vaya...@gmail.com on 7 Apr 2012 at 12:05

GoogleCodeExporter commented 9 years ago
The proposed addAll(BloomFilter<T>) method (alternately called "union") is 
essential for my needs, which involve populating several compatible filters on 
separate computers and then combining the (deserialized) filters back into one. 
At present, there's no way build up one filter from one or more other filters 
(despite the copy() method being provided).

Original comment by sehar...@gmail.com on 14 Jun 2012 at 2:37

GoogleCodeExporter commented 9 years ago
It's not clear to me that such a method could be defined, depending on the 
definition of "compatible" filters.

Original comment by wasserman.louis on 14 Jun 2012 at 2:39

GoogleCodeExporter commented 9 years ago
Comment #5 above touches on this challenge of deducing that two filters are 
compatible. I suspect that you won't buy this stance, but I'd be happy enough 
with trusting the caller. If the caller doesn't understand well enough the 
invalid consequences of pouring one bag of bits into another, there's not much 
we can do for him. I'd rather see it become /possible/ to perform this 
operation than continue to ban it, just because it's vulnerable to misuse.

Original comment by sehar...@gmail.com on 14 Jun 2012 at 4:53

GoogleCodeExporter commented 9 years ago
Hmmm.  How would you feel about throwing an exception if the target filter 
wasn't initialized with the same parameters?

Original comment by wasserman.louis on 14 Jun 2012 at 4:58

GoogleCodeExporter commented 9 years ago
Throwing an exception (IllegalArgumentException) is a reasonable response to 
trying to merge in a filter constructed with different parameters. The addAll 
(or merge() or union()) method would then involve first verifying equivalent 
configuration of the incoming filter and, upon confirming compatibility, ORin 
in its underlying bit vector, right?

Original comment by sehar...@gmail.com on 14 Jun 2012 at 5:44

GoogleCodeExporter commented 9 years ago
Yep, that's what I'd expect.

Original comment by wasserman.louis on 14 Jun 2012 at 5:46

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:16

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:10