Performance of ImmutableSet.contains

GoogleCodeExporter commented 9 years ago

In case an `ImmutableSet` gets filled by objects whose hashCode fills a 
continuous range, the performance of an unsuccessful lookup strongly 
degenerates (milliseconds for a single lookup). While each hash-based algorithm 
degenerates for some input, this happens just too easily.

http://stackoverflow.com/questions/12573463/performance-of-guavas-immutableset-c
ontains

The problem is caused by 1. big continuous filled area of the table, 2. using 
linear reprobing. Any lookup starting near the beginning of such a continuous 
area takes a very long time as it has to traverse to its end.

#Preventing continuous filled area

The hash code gets processed in `Hashing.smear` by a function propagating bits 
to the right only. This helps to prevent collisions when the higher bits get 
masked away.

But there are no collisions here. What happens in the extreme case (of the set 
containing N consecutive integers where N is a power of two)
is that all entries land in the same half of the table, thus resulting in an 
area of N slots without any hole to terminate the search.

With a smearing which propagates bits also to the left this could be avoided. A 
modification like

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
hashCode ^= (hashCode >>> 7) ^ (hashCode >>> 4);
hashCode *= MULTIPLIER;
return hashCode;

helps here a lot. Funnily enough, MULTIPLIER=2 seems to work perfectly, as it 
makes every other slot empty. Obviously, an even multiplier is wrong, as it 
increases the number of collision in other cases.

For any given set size some "optimal" MULTIPLIER can be found, e.g. 0x80001 
works nicely for set size = 500000, but not for the other two. Randomly chosen 
odd values work reasonably well on the average. Instead of the multiplication 
some xors and/or additions with left shift could be used, but the 
multiplication seems to be faster.

Obviously, the number of collision for the expression `hashCode & 
(table.length-1)` doesn't get changed by the final multiplication, as a 
multiplication by an odd number is a bijective operation, even when restricted 
to some number of least significant bits. So it doesn't undo the improvement 
achieved by the original `smear`.

#Better reprobing

IMHO, the linear reprobing will always amplify any problems caused by hash 
collisions or bad distribution. I'd suggest a combination with double hashing: 
After some number of failed attempts to find a free slot, a jump to some other 
part of the table should be done. This has absolutely no impact in case 
everything works fine and can be a huge win in case of problems. It's quite 
simple (maybe 50 lines).

Original issue reported on code.google.com by Maaarti...@gmail.com on 27 Sep 2012 at 10:48

GoogleCodeExporter commented 9 years ago

This is legit, and while I know a lot of work went into ImmutableSet, I do 
think this demonstrates that it's still too easy to trigger the worst case of 
linear probing.

I'm currently experimenting with the following smearing function:

    hashCode ^= Integer.rotateRight(hashCode, 20) ^ Integer.rotateRight(hashCode, 12);
    hashCode ^= Integer.rotateRight(hashCode, 7) ^ Integer.rotateRight(hashCode, 4);
    hashCode *= 0x5BD1E995;
    return hashCode;

(The multiplier constant is borrowed from murmur hash, which I suspected would 
probably yield good results.)

I'm experimenting with quadratic probing at the moment with this smear 
function, just because I want to avoid additional branches, and quadratic 
probing retains pretty good locality.

Original comment by lowas...@google.com on 27 Sep 2012 at 10:58

Changed state: Accepted
Added labels: Package-Collect, Type-Performance

GoogleCodeExporter commented 9 years ago

What's the latest rehash function being used in ConcurrentHashMapV8.java and so 
forth?  Doug changes it often and we might improve just by trying whatever he's 
doing nowadays.

I also think double hashing sounds very worth trying -- all those bits sitting 
there in the upper reaches of the "smeared" hash code, unused.  (Maybe double 
hashing isn't the right term here, not sure, perhaps this is cuckoo hashing 
with N=2.)

Original comment by kevinb@google.com on 27 Sep 2012 at 11:30

GoogleCodeExporter commented 9 years ago

From 
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/ConcurrentHashMa
pV8.java?revision=1.62&view=markup :

    /**
     * Spreads higher bits to lower, and also forces top 2 bits to 0.
     * Because the table uses power-of-two masking, sets of hashes
     * that vary only in bits above the current mask will always
     * collide. (Among known examples are sets of Float keys holding
     * consecutive whole numbers in small tables.)  To counter this,
     * we apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed across bits (so don't benefit
     * from spreading), and because we use trees to handle large sets
     * of collisions in bins, we don't need excessively high quality.
     */
    private static final int spread(int h) {
        h ^= (h >>> 18) ^ (h >>> 12);
        return (h ^ (h >>> 10)) & HASH_BITS;
    }

But honestly, part of the issue is that linear or quadratic open addressing is 
vulnerable not just to exact hash collisions, but to hashes that differ only in 
relatively low bits.  The smearing methods don't do anything at all to help 
there, and that's the issue that needs addressing.

(I'm pretty strongly inclined to keep linear or quadratic probing for the first 
couple of probes at least, just to take advantage of memory locality.)

The experimental smearing method I proposed above is doing pretty well in 
benchmarks at the moment, which is nice -- and it's doing well with both linear 
and quadratic probing.  Will keep y'all updated.

Original comment by lowas...@google.com on 27 Sep 2012 at 11:42

GoogleCodeExporter commented 9 years ago

@Louis: Note that unlike the right shifts, the rotations gives you no guarantee 
that "hashCodes that differ only by constant multiples at each bit position 
have a bounded number of collisions". Or maybe they do, but I can't see how.

> and quadratic probing retains pretty good locality.

I don't think so. It has good locality in the first few steps (which is good as 
you quite often stop there), but on the average it's quite bad. Assuming the 
probe sequence
0, 1, 3, 6, 10, 15
the distances are
1, 2, 3, 4, ...
which rarely hits the same cache line on the next step.

The classical double hashing has distances
N, N, N, ...
(where N is derived from the hash code and must be odd for power of two table 
sizes)
which for most values of N never hits the same cache line on the next step.

What I'd propose is using distances
1, 1, 1, ..., 1, N, 1, 1, 1, ..., 1, N, ....
which most of the time hits the same cache line on the next step. Moreover, in 
the beginning it's as efficient as the linear reprobing.

In order for the probe sequence to cover the whole table it's sufficient that

- the block length K is a power of two
- (K-1) + N is an odd multiple of K

Explanation: A block consists of K-1 ones and a single N, so the distance 
travelled after one block is (K-1)*1 + 1*N. Moving by an odd multiple of the 
block length guarantees that all blocks get visited (as an odd number is 
co-prime to the number of blocks which is a power of two).

@Kevin:
> I also think double hashing sounds very worth trying -- all those bits 
sitting there in the upper reaches of the "smeared" hash code, unused.
> (Maybe double hashing isn't the right term here, not sure, perhaps this is 
cuckoo hashing with N=2.)

Agreed, while most hashing algorithms use only `log2(table.length)` bits, both 
double and cuckoo hashing can use nearly twice as much.
This means that for tables bigger than 2**16, all bits gets used -- somehow.

I think you've meant double hashing, as cuckoo works quite differently.

Actually, I've just got a different idea, very simple and maybe optimal for the 
issue I described. I'll try it out and report later.

Original comment by Maaarti...@gmail.com on 29 Sep 2012 at 7:24

GoogleCodeExporter commented 9 years ago


Here the idea I've already announced: During `construct` the maximum number of 
probes needed and store it in field `maxProbes`. As there are no collisions in 
the case I reported, `maxProbes` equals to 1 and thus 
`RegularImmutableSet.contains` can return false after having inspected a single 
slot, so it works perfectly.

However, there's a similar problem including a lot of collision, and then it 
can't help at all (see the attachment). The benchmark is here:

https://dl.dropbox.com/u/4971686/published/maaartin/guava/ImmutableSetBenchmark1
.java

Note that with `collide==true` there are exactly 2 items per slot, but 
`maxProbes` get big due to linear reprobing. You may want to make `absent` 
shorter in case it aborts due to the caliper time limit.

I think I've got a nice solution for all these problems (soon to come).

Original comment by Maaarti...@gmail.com on 2 Oct 2012 at 1:41

Attachments:

results.txt

GoogleCodeExporter commented 9 years ago

> Here the idea I've already announced: During `construct` the maximum number 
of probes needed and store it in field `maxProbes`. As there are no collisions 
in the case I reported, `maxProbes` equals to 1 and thus 
`RegularImmutableSet.contains` can return false after having inspected a single 
slot, so it works perfectly.

I'm not seeing how this would be a doable thing without significant increased 
complication in the construction method -- we don't even remember hash codes at 
all, at the moment.

> Note that unlike the right shifts, the rotations gives you no guarantee that 
"hashCodes that differ only by constant multiples at each bit position have a 
bounded number of collisions". Or maybe they do, but I can't see how.

Hrrrrm.  I know you're quoting the HashMap source, but I don't really 
understand what  "differ only by constant multiples at each bit position" could 
possibly mean.  Do the bits differ by only a constant multiple?  Of what?

I'm going to poke around Google for advice, maybe.

Original comment by lowas...@google.com on 2 Oct 2012 at 11:35

GoogleCodeExporter commented 9 years ago

I observe that the JDK's IdentityHashMap uses open addressing, and its 
"smearing" function looks like

        int h = System.identityHashCode(x);
        // Multiply by -127, and left-shift to use least bit as part of hash
        return ((h << 1) - (h << 8)) & (length - 1);

Worth exploring, perhaps?

Original comment by lowas...@google.com on 2 Oct 2012 at 11:49

GoogleCodeExporter commented 9 years ago

> I'm not seeing how this would be a doable thing without significant increased 
complication in the construction method...

It's trivial and very cheap, just counting the iterations in the inner loop, 
see the patch.

But in general, there's a tradeoff with unknown parameters: Storing the hashes 
would speed up reprobing and so can spending more time in `smear` and/or in the 
construction. Is there any chance to find out e.g. how many `contains` calls 
per `ImmutableSet` are there on the average in your real world programs?

> > > Note that unlike the right shifts, the rotations...

> Hrrrrm.  I know you're quoting the HashMap source, but I don't really 
understand what  "differ only by constant multiples at each bit position" could 
possibly mean.  Do the bits differ by only a constant multiple?  Of what?

I can only guess what they mean, too. But whatever property it is, it can't be 
destroyed by a final multiplication, as there's no additional flow from the 
higher bits to the lower ones. An initial multiplication (i.e. before the right 
shift) is a different case, as them together create such a flow. It's hard to 
explain what I mean, as it's based of my fuzzy understanding of their stated 
goal.

> I'm going to poke around Google for advice, maybe.

You mean personally? I couldn't google out anything.

> I observe that the JDK's IdentityHashMap... Worth exploring, perhaps?

I don't think so. The multiplication puts nearby hashes far apart, which is a 
good thing because of the linear reprobing, but that's about all. The hash 
distrubution is different and the cost of reprobing much lower.

Original comment by Maaarti...@gmail.com on 3 Oct 2012 at 1:08

Attachments:

0001-Introduce-maxProbes-in-ImmutableSet.patch

GoogleCodeExporter commented 9 years ago

> I'm going to poke around Google for advice, maybe.

Perhaps a more accurate statement would be

> I'm going to poke around Googlers for advice, maybe.

In any event, I'm getting the impression that we agree that an initial 
multiplication strictly helps -- it's still possible to cause the worst-case, 
of course, but it requires being more deliberate about it.  Should I go ahead 
and try to push ahead on that change while we continue the discussion on the 
other changes?

Original comment by wasserman.louis on 3 Oct 2012 at 8:49

GoogleCodeExporter commented 9 years ago

> In any event, I'm getting the impression that we agree that an initial 
multiplication strictly helps...

Yes. Currently, I know of no better way how to find the cases when the 
multiplication is detrimental than bruteforcing it.

It's very cheap on modern CPUs:

   StandardSmearer 1.71
MultiplyingSmearer 1.97

As it is very cheap and improves the distribution, it might speed the search up 
even in commonly occuring cases, too.

> Should I go ahead and try to push ahead on that change while we continue the 
discussion on the other changes?

It depends on the probability that someone gets hit by this problem in the 
meantime. That's something I can't tell.

Original comment by Maaarti...@gmail.com on 4 Oct 2012 at 9:57

GoogleCodeExporter commented 9 years ago

This small program illustrates how I understand what "differ only by constant 
multiples at each bit position" means:
https://dl.dropbox.com/u/4971686/published/maaartin/guava/SmearerDemo.java

According to it:
- final multiplication is good
- multiplication before the shifts is bad
- replacing the second two shifts by rotations is good
- using rotations only is bad
- the simplified shifting from ConcurrentHashMapV8 is bad

But the program only counts collisions assuming closed hashing. Secondary 
collisions due to rehashing are a very different beast.

Original comment by Maaarti...@gmail.com on 11 Oct 2012 at 9:55

GoogleCodeExporter commented 9 years ago

I'm putting this down as fixed by 
https://code.google.com/p/guava-libraries/source/detail?r=69ad96b719d7cd3d872a94
8d7454f17b816a21c2 .

Even if the new smearing function can be defeated, it's at least much more 
difficult to make it blow up accidentally.  (Incidentally, it also beats the 
old Hashing.smear in benchmarks.)

Original comment by lowas...@google.com on 30 Oct 2012 at 7:20

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by lowas...@google.com on 30 Oct 2012 at 7:21

Added labels: Milestone-Release14

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08

Added labels: Restrict-AddIssueComment-Commit

yf0994 / guava-libraries

Performance of ImmutableSet.contains #1155