ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.94k stars 2.55k forks source link

standard library hash function performance should be better than every other programming language #15916

Open andrewrk opened 1 year ago

andrewrk commented 1 year ago

Zig hash functions have no excuse to not be the best in the industry. What are we doing with sub-par hash function implementations? Let's get serious.

For starters, the "small key" APIs are failing to branch for a small length check, such as this one:

https://github.com/ziglang/zig/blob/76aa1fffb7a06f0be0d803cb3379f3102c0b2590/lib/std/hash/xxhash.zig#L129-L133

A good implementation will branch on small lengths, like these examples:

In order to close this issue, the following things must be done:

This task may be time-consuming, but it is not difficult. If you have an example program that outperforms the zig code, then you can simply look at the machine code and notice why it is faster, then change the zig code to match it - or better yet, take those lessons and iterate on it and beat it.

Note that one strategy to make progress on this issue is to delete less useful hash functions from the standard library.

Related: Perhaps the hash API should change so that optimizations such as this can be done directly with the standard library: https://github.com/tigerbeetledb/tigerbeetle/pull/796

Validark commented 1 year ago

A good candidate to consider is clhash. Here's table 5 from the 2015 paper by Daniel Lemire & Owen Kaser:

| scheme | 64 B input | 4 kB input | |:-:|:-:|:-:| |VHASH |1.0| 0.26| |CLHASH |**0.45**| **0.16**| |CityHash |0.48| 0.23| |SipHash |3.1| 2.1| |GHASH |2.3| 0.93| _Table 5: A comparison of estimated CPU cycles per byte on a Haswell Intel processor using 4 kB inputs. All schemes generate 64-bit hash values, except that GHASH generates 128-bit hash values._

Note that clhash was made for x86_64. I am not sure how well it translates to Arm or other architectures, but according to the paper, these are the operations and their respective performance characteristics that led to the clhash results above:

|intrinsic | instruction| description | latency | rec. thr. | |:--------:|:--------:|:--------:|:--------:|:--------:| |mm_clmulepi64_si128| pclmulqdq| 64-bit carry-less multiplication| 7 | 2| |mm_or_si128| por | bitwise OR | 1 | 0.33 | |mm_xor_si128 | pxor | bitwise XOR | 1 | 0.33| |mm_slli_epi64 | psllq | shift left two 64-bit integers | 1 | 1 | |mm_srli_si128 | psrldq | shift right by x bytes | 1 | 0.5 | |mm_shuffle_epi8 | pshufb | shuffle 16 bytes | 1 | 0.5 | |mm_cvtsi64_si128 | movq | 64-bit integer as 128-bit reg. | 1 | – | |mm_cvtsi128_si64 | movq | 64-bit integer from 128-bit reg. | 2 | – | |mm_load_si128 | movdqa | load a 128-bit reg. from memory (aligned) | 1 | 0.5 | |mm_lddqu_si128 | lddqu | load a 128-bit reg. from memory (unaligned) | 1 | 0.5 | |mm_setr_epi8 | – | construct 128-bit reg. from 16 bytes | – | – | |mm_set_epi64x | – | construct 128-bit reg. from two 64-bit integers | – | – | _Table 1: Relevant SIMD intrinsics and instructions on Haswell Intel processors, with latency and reciprocal throughput in CPU cycles per instruction [[16](https://www.agner.org/optimize/instruction_tables.pdf), [21](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)]._

It seems to do exceptionally well on large inputs in particular!

_Fig. 1: Performance comparison for various input lengths. For large inputs, CLHASH is faster, followed in order of decreasing speed by CityHash, VHASH, GHASH and SipHash._

I am not sure what the performance landscape would look like today but pclmulqdq is faster on some of the newer hardware (see: https://www.agner.org/optimize/instruction_tables.pdf).

jedisct1 commented 1 year ago

Anything depending on hardware-accelerated carryless multiplication will be super slow on baseline or any platform without acceleration.

jedisct1 commented 1 year ago

For something more recent, komihash is ridiculously fast, while not requiring specific CPU features.

https://github.com/avaneev/komihash

Validark commented 1 year ago

Anything depending on hardware-accelerated carryless multiplication will be super slow on baseline or any platform without acceleration.

Exactly, clhash should only be considered for platforms that have a fast carryless multiply builtin.

SeriousBusiness100 commented 1 year ago

UMash might be an improvement on the CLHash front.

tensorush commented 1 year ago

For something more recent, komihash is ridiculously fast, while not requiring specific CPU features.

https://github.com/avaneev/komihash

Just in case there'll be any further interest in it. I've ported komihash to Zig.