standard library hash function performance should be better than every other programming language

andrewrk commented 1 year ago

Zig hash functions have no excuse to not be the best in the industry. What are we doing with sub-par hash function implementations? Let's get serious.

For starters, the "small key" APIs are failing to branch for a small length check, such as this one:

https://github.com/ziglang/zig/blob/76aa1fffb7a06f0be0d803cb3379f3102c0b2590/lib/std/hash/xxhash.zig#L129-L133

A good implementation will branch on small lengths, like these examples:

In order to close this issue, the following things must be done:

[ ] for every hash function in std.hash, notice the best implementations from other high-performance programming languages such as C and Rust. If anyone is aware of high-performance implementations of hash functions that are in zig's standard library (or should be in zig's standard library), feel free to comment on this issue to list them here.
[ ] match or surpass their performance on every reasonable benchmark, for every Tier 1 target and CPU feature combination

This task may be time-consuming, but it is not difficult. If you have an example program that outperforms the zig code, then you can simply look at the machine code and notice why it is faster, then change the zig code to match it - or better yet, take those lessons and iterate on it and beat it.

Note that one strategy to make progress on this issue is to delete less useful hash functions from the standard library.

Related: Perhaps the hash API should change so that optimizations such as this can be done directly with the standard library: https://github.com/tigerbeetledb/tigerbeetle/pull/796

Validark commented 1 year ago

A good candidate to consider is clhash. Here's table 5 from the 2015 paper by Daniel Lemire & Owen Kaser:

| scheme | 64 B input | 4 kB input | |:-:|:-:|:-:| |VHASH |1.0| 0.26| |CLHASH |**0.45**| **0.16**| |CityHash |0.48| 0.23| |SipHash |3.1| 2.1| |GHASH |2.3| 0.93| _Table 5: A comparison of estimated CPU cycles per byte on a Haswell Intel processor using 4 kB inputs. All schemes generate 64-bit hash values, except that GHASH generates 128-bit hash values._

Note that clhash was made for x86_64. I am not sure how well it translates to Arm or other architectures, but according to the paper, these are the operations and their respective performance characteristics that led to the clhash results above:

|intrinsic | instruction| description | latency | rec. thr. | |:--------:|:--------:|:--------:|:--------:|:--------:| |mm_clmulepi64_si128| pclmulqdq| 64-bit carry-less multiplication| 7 | 2| |mm_or_si128| por | bitwise OR | 1 | 0.33 | |mm_xor_si128 | pxor | bitwise XOR | 1 | 0.33| |mm_slli_epi64 | psllq | shift left two 64-bit integers | 1 | 1 | |mm_srli_si128 | psrldq | shift right by x bytes | 1 | 0.5 | |mm_shuffle_epi8 | pshufb | shuffle 16 bytes | 1 | 0.5 | |mm_cvtsi64_si128 | movq | 64-bit integer as 128-bit reg. | 1 | – | |mm_cvtsi128_si64 | movq | 64-bit integer from 128-bit reg. | 2 | – | |mm_load_si128 | movdqa | load a 128-bit reg. from memory (aligned) | 1 | 0.5 | |mm_lddqu_si128 | lddqu | load a 128-bit reg. from memory (unaligned) | 1 | 0.5 | |mm_setr_epi8 | – | construct 128-bit reg. from 16 bytes | – | – | |mm_set_epi64x | – | construct 128-bit reg. from two 64-bit integers | – | – | _Table 1: Relevant SIMD intrinsics and instructions on Haswell Intel processors, with latency and reciprocal throughput in CPU cycles per instruction [[16](https://www.agner.org/optimize/instruction_tables.pdf), [21](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)]._

It seems to do exceptionally well on large inputs in particular!

_Fig. 1: Performance comparison for various input lengths. For large inputs, CLHASH is faster, followed in order of decreasing speed by CityHash, VHASH, GHASH and SipHash._

I am not sure what the performance landscape would look like today but pclmulqdq is faster on some of the newer hardware (see: https://www.agner.org/optimize/instruction_tables.pdf).

jedisct1 commented 1 year ago

Anything depending on hardware-accelerated carryless multiplication will be super slow on baseline or any platform without acceleration.

jedisct1 commented 1 year ago

For something more recent, komihash is ridiculously fast, while not requiring specific CPU features.

https://github.com/avaneev/komihash

Validark commented 1 year ago

Anything depending on hardware-accelerated carryless multiplication will be super slow on baseline or any platform without acceleration.

Exactly, clhash should only be considered for platforms that have a fast carryless multiply builtin.

SeriousBusiness100 commented 1 year ago

UMash might be an improvement on the CLHash front.

tensorush commented 1 year ago

For something more recent, komihash is ridiculously fast, while not requiring specific CPU features.

https://github.com/avaneev/komihash

Just in case there'll be any further interest in it. I've ported komihash to Zig.

ziglang / zig

standard library hash function performance should be better than every other programming language #15916