Open daurnimator opened 4 years ago
perfect hashing algorithm to be in the standard library
Based on https://andrewkelley.me/post/string-matching-comptime-perfect-hashing-zig.html
I think it would be an awesome thing to have built into the language, eg
switch (target) : (hashFn) { // Or some way to set which hash fn to use at comptime
"one" => ...,
"two" => ...,
// etc...
}
Some useful links:
BBHash, go-bbhash and rust-boomphf based on this paper: Fast and scalable minimal perfect hashing for massive key sets
And rust-phf uses CHD algorithm.
Or constructing a compile time trie...
Attempt №2
My goals: autoPerfectHash and autoPerfectHashMap.
const ok = for (cases) |c2| { if (std.meta.eql(c, c2)) break true; } else false;
This looks like the expensive comparison that perfect hashing is meant to avoid?
All questions to the creator. :smile:
I hope this loop executed in comptime only.
I hope this loop executed in comptime only.
In Andrew's blog post, that loop is wrapped in if (std.debug.runtime_safety)
:
if (std.debug.runtime_safety) {
const ok = for (strs) |str| {
if (std.mem.eql(u8, str, s))
break true;
} else false;
if (!ok) {
std.debug.panic("attempt to perfect hash {} which was not declared", s);
}
}
i.e. it's not included in ReleaseFast/ReleaseSmall.
For stringToEnum
you wouldn't want that loop in there either: return null
when the element is not in the perfect hash.
return
null
when the element is not in the perfect hash.
Is this feasible? How could the perfect hash know when something is not one of the original set of values?
As an example, if it turns out that in
and not_in
both hash to the same value with the chosen perfect hashing algorithm, wouldn't it just treat not_in
the same as in
? How could it know to return null
for not_in
instead?
Is this feasible? How could the perfect hash know when something is not one of the original set of values?
Once you hash to a given element, you then verify that the input matches that member.
Ah, I see; sounds similar to what the Zig tokenizer does now. Hashes for each keyword are computed at compile time and the lookup function checks the hash first before checking mem.eql
. Perfect hashing could remove the need for the loop in getKeyword
, though.
Yep! Seems like you found another place that would able to immediately make use of the new machinery :)
One thing to remember is to perf test. @hejsil did some experiments with this earlier and determined that, at least in release-fast mode, the optimizer given if-else chains was able to outperform a perfect hash implementation.
Dumb question: what does "perfect" hash mean? Got a quick answer, thanks haze :)
As discovered in https://github.com/Vexu/arocc/pull/524 (https://github.com/Vexu/arocc/pull/524#issuecomment-1762854941), stringToEnum
could have considerably better codegen for large enums if it were sorted by field length before the inline for
:
Here's a benchmark focusing on just different possible sorting of enum fields (this is with 3948 fields in the enum, shortest field length is 3 and longest is 43):
-------------------- unsorted ---------------------
always found: 3718ns per lookup
not found (random bytes): 6638ns per lookup
not found (1 char diff): 3819ns per lookup
----------- sorted by length (desc) ---------------
always found: 1176ns per lookup
not found (random bytes): 68ns per lookup
not found (1 char diff): 1173ns per lookup
----------- sorted by length (asc) ----------------
always found: 1054ns per lookup
not found (random bytes): 67ns per lookup
not found (1 char diff): 1053ns per lookup
-------- sorted lexicographically (asc) -----------
always found: 2764ns per lookup
not found (random bytes): 4615ns per lookup
not found (1 char diff): 2750ns per lookup
This would ultimately be a trade-off between compile time and runtime performance. I haven't tested to see how much of an impact on the compile time the comptime
sorting of the fields would incur. We might end up hitting https://github.com/ziglang/zig/issues/4055, in which case this optimization might need to wait a bit.
Note: Sorting would also make it easy to create a fast path that checks that the str.len
is within the bounds of the longest/shortest enum field.
The current implementation of
stringToEnum
is:This could be much more efficient if a perfect hash was created and is therefore one motivation for a perfect hashing algorithm to be in the standard library.
FWIW my current usecase for this is converting known HTTP field names to an enum.