zeebo / xxh3

XXH3 algorithm in Go
BSD 2-Clause "Simplified" License
406 stars 20 forks source link

Small speedup in AVX2/AVX512 #22

Closed klauspost closed 1 year ago

klauspost commented 1 year ago

Benchmarks are a bit noisy from run to run, probably allocation alignment, but the overall trend appears positive.

name                                old time/op    new time/op    delta
Fixed128/1024-AVX2/default-32         28.9ns ± 1%    29.2ns ± 1%   +1.04%         (p=0.006 n=10+9)
Fixed128/1024-AVX2/seed-32            42.9ns ± 1%    42.8ns ± 1%     ~            (p=0.138 n=9+10)
Fixed128/8192-AVX2/default-32          137ns ± 1%     134ns ± 1%   -2.12%         (p=0.000 n=10+9)
Fixed128/8192-AVX2/seed-32             160ns ± 1%     160ns ± 1%     ~            (p=0.674 n=10+9)
Fixed128/102400-AVX2/default-32       1.56µs ± 1%    1.51µs ± 1%   -3.09%        (p=0.000 n=10+10)
Fixed128/102400-AVX2/seed-32          1.58µs ± 1%    1.53µs ± 0%   -3.12%        (p=0.000 n=10+10)
Fixed128/1024000-AVX2/default-32      15.4µs ± 1%    16.1µs ± 1%   +4.84%        (p=0.000 n=10+10)
Fixed128/1024000-AVX2/seed-32         15.6µs ± 1%    16.2µs ± 1%   +3.98%        (p=0.000 n=10+10)
Fixed128/10240000-AVX2/default-32      164µs ± 1%     157µs ± 0%   -4.23%        (p=0.000 n=10+10)
Fixed128/10240000-AVX2/seed-32         168µs ± 2%     158µs ± 1%   -6.16%        (p=0.000 n=10+10)
Fixed128/102400000-AVX2/default-32    3.91ms ± 1%    3.79ms ± 1%   -2.86%         (p=0.000 n=9+10)
Fixed128/102400000-AVX2/seed-32       3.92ms ± 1%    3.79ms ± 0%   -3.22%        (p=0.000 n=10+10)
Fixed64/1024-AVX2/default-32          26.0ns ± 1%    26.4ns ± 1%   +1.47%        (p=0.001 n=10+10)
Fixed64/1024-AVX2/seed-32             38.0ns ± 1%    38.2ns ± 1%   +0.48%        (p=0.037 n=10+10)
Fixed64/8192-AVX2/default-32           134ns ± 1%     133ns ± 0%   -1.11%         (p=0.000 n=10+8)
Fixed64/8192-AVX2/seed-32              146ns ± 1%     143ns ± 0%   -1.91%        (p=0.000 n=10+10)
Fixed64/102400-AVX2/default-32        1.56µs ± 1%    1.51µs ± 0%   -3.03%        (p=0.000 n=10+10)
Fixed64/102400-AVX2/seed-32           1.56µs ± 1%    1.52µs ± 1%   -2.85%        (p=0.000 n=10+10)
Fixed64/1024000-AVX2/default-32       16.0µs ± 1%    15.6µs ± 1%   -2.89%        (p=0.000 n=10+10)
Fixed64/1024000-AVX2/seed-32          16.1µs ± 1%    15.6µs ± 1%   -3.28%        (p=0.000 n=10+10)
Fixed64/10240000-AVX2/default-32       165µs ± 1%     157µs ± 0%   -4.60%        (p=0.000 n=10+10)
Fixed64/10240000-AVX2/seed-32          165µs ± 0%     157µs ± 0%   -4.71%         (p=0.000 n=9+10)
Fixed64/102400000-AVX2/default-32     3.91ms ± 1%    3.80ms ± 1%   -2.95%        (p=0.000 n=10+10)
Fixed64/102400000-AVX2/seed-32        3.90ms ± 1%    3.80ms ± 0%   -2.57%         (p=0.000 n=9+10)
Hasher64/16/avx2/plain-32             10.6ns ± 1%    10.6ns ± 1%   +0.58%        (p=0.012 n=10+10)
Hasher64/16/avx2/seed-32              10.6ns ± 1%    10.7ns ± 1%     ~           (p=0.069 n=10+10)
Hasher64/64/avx2/plain-32             13.3ns ± 1%    13.3ns ± 0%     ~           (p=0.867 n=10+10)
Hasher64/64/avx2/seed-32              13.4ns ± 0%    13.4ns ± 0%     ~             (p=0.093 n=8+9)
Hasher64/256/avx2/plain-32            25.2ns ± 1%    25.3ns ± 1%   +0.63%         (p=0.010 n=9+10)
Hasher64/256/avx2/seed-32             41.1ns ± 5%    37.8ns ± 1%   -8.08%        (p=0.000 n=10+10)
Hasher64/1024/avx2/plain-32           41.2ns ± 1%    41.4ns ± 2%     ~           (p=0.305 n=10+10)
Hasher64/1024/avx2/seed-32            53.8ns ± 1%    54.1ns ± 1%   +0.48%         (p=0.045 n=9+10)
Hasher64/4096/avx2/plain-32           92.7ns ± 1%    91.3ns ± 0%   -1.49%          (p=0.000 n=8+7)
Hasher64/4096/avx2/seed-32             106ns ± 2%      92ns ± 1%  -13.00%        (p=0.000 n=10+10)
Hasher64/16384/avx2/plain-32           300ns ± 1%     293ns ± 1%   -2.29%        (p=0.000 n=10+10)
Hasher64/16384/avx2/seed-32            316ns ± 1%     291ns ± 1%   -7.79%        (p=0.000 n=10+10)
Hasher64/65536/avx2/plain-32          1.16µs ± 0%    1.16µs ± 1%   -0.38%         (p=0.007 n=9+10)
Hasher64/65536/avx2/seed-32           1.17µs ± 1%    1.15µs ± 1%   -2.01%        (p=0.000 n=10+10)
Hasher64/262144/avx2/plain-32         4.58µs ± 1%    4.54µs ± 0%   -0.92%        (p=0.000 n=10+10)
Hasher64/262144/avx2/seed-32          4.52µs ± 1%    4.49µs ± 0%   -0.62%         (p=0.003 n=10+9)
Hasher64/1048576/avx2/plain-32        19.2µs ± 1%    19.1µs ± 1%   -0.57%        (p=0.030 n=10+10)
Hasher64/1048576/avx2/seed-32         19.2µs ± 1%    19.2µs ± 0%     ~           (p=0.481 n=10+10)
Hasher64/4194304/avx2/plain-32        76.8µs ± 1%    76.4µs ± 0%   -0.57%        (p=0.003 n=10+10)
Hasher64/4194304/avx2/seed-32         77.1µs ± 1%    76.7µs ± 0%     ~           (p=0.280 n=10+10)
Hasher64/16777216/avx2/plain-32        687µs ±19%     552µs ± 3%  -19.61%        (p=0.001 n=10+10)
Hasher64/16777216/avx2/seed-32         519µs ±11%     543µs ± 3%   +4.60%         (p=0.028 n=10+9)
Hasher64/67108864/avx2/plain-32       3.15ms ± 1%    3.11ms ± 0%   -1.40%         (p=0.000 n=10+8)
Hasher64/67108864/avx2/seed-32        3.12ms ± 1%    3.10ms ± 1%   -0.68%        (p=0.002 n=10+10)
Hasher64/268435456/avx2/plain-32      12.6ms ± 1%    12.5ms ± 1%   -0.71%         (p=0.004 n=9+10)
Hasher64/268435456/avx2/seed-32       12.5ms ± 0%    12.4ms ± 1%   -1.10%          (p=0.024 n=3+6)

name                                old speed      new speed      delta
Fixed128/1024-AVX2/default-32       35.5GB/s ± 1%  35.1GB/s ± 1%   -1.04%         (p=0.006 n=10+9)
Fixed128/1024-AVX2/seed-32          23.8GB/s ± 0%  23.9GB/s ± 1%   +0.48%         (p=0.043 n=8+10)
Fixed128/8192-AVX2/default-32       59.9GB/s ± 1%  61.2GB/s ± 1%   +2.17%         (p=0.000 n=10+9)
Fixed128/8192-AVX2/seed-32          51.1GB/s ± 1%  51.1GB/s ± 1%     ~            (p=0.661 n=10+9)
Fixed128/102400-AVX2/default-32     65.7GB/s ± 1%  67.8GB/s ± 1%   +3.20%        (p=0.000 n=10+10)
Fixed128/102400-AVX2/seed-32        64.8GB/s ± 1%  66.9GB/s ± 0%   +3.23%        (p=0.000 n=10+10)
Fixed128/1024000-AVX2/default-32    66.6GB/s ± 1%  63.6GB/s ± 1%   -4.61%        (p=0.000 n=10+10)
Fixed128/1024000-AVX2/seed-32       65.8GB/s ± 1%  63.3GB/s ± 1%   -3.82%        (p=0.000 n=10+10)
Fixed128/10240000-AVX2/default-32   62.3GB/s ± 1%  65.0GB/s ± 0%   +4.42%        (p=0.000 n=10+10)
Fixed128/10240000-AVX2/seed-32      60.9GB/s ± 2%  64.9GB/s ± 1%   +6.55%        (p=0.000 n=10+10)
Fixed128/102400000-AVX2/default-32  26.2GB/s ± 1%  27.0GB/s ± 1%   +3.11%        (p=0.000 n=10+10)
Fixed128/102400000-AVX2/seed-32     26.1GB/s ± 1%  27.0GB/s ± 0%   +3.33%        (p=0.000 n=10+10)
Fixed64/1024-AVX2/default-32        39.4GB/s ± 1%  38.8GB/s ± 1%   -1.46%        (p=0.002 n=10+10)
Fixed64/1024-AVX2/seed-32           26.9GB/s ± 1%  26.8GB/s ± 1%   -0.48%        (p=0.043 n=10+10)
Fixed64/8192-AVX2/default-32        61.1GB/s ± 1%  61.8GB/s ± 0%   +1.10%         (p=0.000 n=10+8)
Fixed64/8192-AVX2/seed-32           56.1GB/s ± 1%  57.2GB/s ± 0%   +1.95%        (p=0.000 n=10+10)
Fixed64/102400-AVX2/default-32      65.8GB/s ± 1%  67.9GB/s ± 0%   +3.12%        (p=0.000 n=10+10)
Fixed64/102400-AVX2/seed-32         65.5GB/s ± 1%  67.5GB/s ± 1%   +2.94%        (p=0.000 n=10+10)
Fixed64/1024000-AVX2/default-32     63.8GB/s ± 1%  65.7GB/s ± 1%   +2.97%        (p=0.000 n=10+10)
Fixed64/1024000-AVX2/seed-32        63.5GB/s ± 1%  65.7GB/s ± 1%   +3.38%        (p=0.000 n=10+10)
Fixed64/10240000-AVX2/default-32    62.1GB/s ± 1%  65.1GB/s ± 0%   +4.81%        (p=0.000 n=10+10)
Fixed64/10240000-AVX2/seed-32       62.1GB/s ± 0%  65.1GB/s ± 0%   +4.94%         (p=0.000 n=9+10)
Fixed64/102400000-AVX2/default-32   26.2GB/s ± 1%  27.0GB/s ± 1%   +3.04%        (p=0.000 n=10+10)
Fixed64/102400000-AVX2/seed-32      26.3GB/s ± 1%  26.9GB/s ± 0%   +2.64%         (p=0.000 n=9+10)
Hasher64/16/avx2/plain-32           1.52GB/s ± 1%  1.51GB/s ± 1%   -0.56%        (p=0.019 n=10+10)
Hasher64/16/avx2/seed-32            1.51GB/s ± 1%  1.50GB/s ± 1%     ~           (p=0.060 n=10+10)
Hasher64/64/avx2/plain-32           4.81GB/s ± 1%  4.81GB/s ± 0%     ~           (p=0.912 n=10+10)
Hasher64/64/avx2/seed-32            4.77GB/s ± 1%  4.78GB/s ± 0%     ~            (p=0.075 n=10+9)
Hasher64/256/avx2/plain-32          10.2GB/s ± 1%  10.1GB/s ± 1%   -0.62%         (p=0.013 n=9+10)
Hasher64/256/avx2/seed-32           6.24GB/s ± 5%  6.78GB/s ± 1%   +8.67%        (p=0.000 n=10+10)
Hasher64/1024/avx2/plain-32         24.8GB/s ± 1%  24.7GB/s ± 2%     ~           (p=0.315 n=10+10)
Hasher64/1024/avx2/seed-32          19.0GB/s ± 1%  18.9GB/s ± 1%     ~            (p=0.053 n=9+10)
Hasher64/4096/avx2/plain-32         44.2GB/s ± 1%  44.9GB/s ± 0%   +1.59%          (p=0.001 n=8+6)
Hasher64/4096/avx2/seed-32          38.6GB/s ± 3%  44.4GB/s ± 1%  +14.91%        (p=0.000 n=10+10)
Hasher64/16384/avx2/plain-32        54.7GB/s ± 1%  55.9GB/s ± 1%   +2.34%        (p=0.000 n=10+10)
Hasher64/16384/avx2/seed-32         51.8GB/s ± 1%  56.2GB/s ± 1%   +8.44%        (p=0.000 n=10+10)
Hasher64/65536/avx2/plain-32        56.3GB/s ± 0%  56.5GB/s ± 1%   +0.38%         (p=0.008 n=9+10)
Hasher64/65536/avx2/seed-32         56.0GB/s ± 1%  57.1GB/s ± 1%   +2.03%        (p=0.000 n=10+10)
Hasher64/262144/avx2/plain-32       57.3GB/s ± 1%  57.8GB/s ± 0%   +0.93%        (p=0.000 n=10+10)
Hasher64/262144/avx2/seed-32        58.0GB/s ± 1%  58.4GB/s ± 0%   +0.62%         (p=0.003 n=10+9)
Hasher64/1048576/avx2/plain-32      54.5GB/s ± 1%  54.8GB/s ± 1%   +0.57%        (p=0.029 n=10+10)
Hasher64/1048576/avx2/seed-32       54.6GB/s ± 1%  54.6GB/s ± 0%     ~           (p=0.481 n=10+10)
Hasher64/4194304/avx2/plain-32      54.6GB/s ± 1%  54.9GB/s ± 0%   +0.57%        (p=0.003 n=10+10)
Hasher64/4194304/avx2/seed-32       54.4GB/s ± 1%  54.7GB/s ± 0%     ~           (p=0.280 n=10+10)
Hasher64/16777216/avx2/plain-32     24.9GB/s ±21%  30.4GB/s ± 3%  +22.34%        (p=0.001 n=10+10)
Hasher64/16777216/avx2/seed-32      32.4GB/s ±10%  30.9GB/s ± 3%   -4.65%         (p=0.028 n=10+9)
Hasher64/67108864/avx2/plain-32     21.3GB/s ± 1%  21.6GB/s ± 0%   +1.42%         (p=0.000 n=10+8)
Hasher64/67108864/avx2/seed-32      21.5GB/s ± 1%  21.6GB/s ± 1%   +0.69%        (p=0.002 n=10+10)
Hasher64/268435456/avx2/plain-32    21.4GB/s ± 1%  21.5GB/s ± 1%   +0.71%         (p=0.004 n=9+10)
Hasher64/268435456/avx2/seed-32     21.4GB/s ± 0%  21.7GB/s ± 1%   +1.11%          (p=0.024 n=3+6)

My Zen 2 has memory -> register aliasing - this is not present on Intel, so they preloaded keys should help other platforms more.

I will have to run tests on AVX512. I will get back with whether they are good if CI doesn't pick it up.

Edit: It was picked up by CI:

=== RUN TestVectorCompat compat_vector_test.go:30: avx512: true compat_vector_test.go:31: avx2: true compat_vector_test.go:32: sse2: true

klauspost commented 1 year ago

Seems like AVX512 likes it a lot until it hits memory bandwidth limit.

minio@minio-k8s17:~/apps/xxh3$ go test -bench=Fixed64
go: downloading github.com/zeebo/assert v1.3.0
go: downloading github.com/klauspost/cpuid/v2 v2.0.9
goos: linux
goarch: amd64
pkg: github.com/zeebo/xxh3
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz

BenchmarkFixed64/1024-AVX512/default-128                18415684                64.75 ns/op     15815.42 MB/s
BenchmarkFixed64/1024-AVX512/seed-128                   11996121                86.43 ns/op     11847.06 MB/s
BenchmarkFixed64/1024-AVX2/default-128                  20169603                60.48 ns/op     16930.09 MB/s
BenchmarkFixed64/1024-AVX2/seed-128                     13220286                80.50 ns/op     12720.39 MB/s
BenchmarkFixed64/1024-SSE2/default-128                  13571812                90.09 ns/op     11366.81 MB/s
BenchmarkFixed64/1024-SSE2/seed-128                      9564871               116.9 ns/op      8757.15 MB/s
BenchmarkFixed64/1024/default-128                        6226238               181.4 ns/op      5645.35 MB/s
BenchmarkFixed64/1024/seed-128                           5396823               214.6 ns/op      4771.76 MB/s
BenchmarkFixed64/8192-AVX512/default-128                 3590719               325.5 ns/op      25169.43 MB/s
BenchmarkFixed64/8192-AVX512/seed-128                    3464196               341.5 ns/op      23991.18 MB/s
BenchmarkFixed64/8192-AVX2/default-128                   3679377               329.0 ns/op      24899.53 MB/s
BenchmarkFixed64/8192-AVX2/seed-128                      3295234               377.5 ns/op      21698.57 MB/s
BenchmarkFixed64/8192-SSE2/default-128                   1938860               614.4 ns/op      13333.91 MB/s
BenchmarkFixed64/8192-SSE2/seed-128                      1857984               635.3 ns/op      12894.95 MB/s
BenchmarkFixed64/8192/default-128                         843528              1398 ns/op        5861.34 MB/s
BenchmarkFixed64/8192/seed-128                            816256              1438 ns/op        5696.92 MB/s
BenchmarkFixed64/102400-AVX512/default-128                322002              3722 ns/op        27514.07 MB/s
BenchmarkFixed64/102400-AVX512/seed-128                   320775              3739 ns/op        27385.60 MB/s
BenchmarkFixed64/102400-AVX2/default-128                  300304              3909 ns/op        26197.70 MB/s
BenchmarkFixed64/102400-AVX2/seed-128                     297676              3915 ns/op        26156.07 MB/s
BenchmarkFixed64/102400-SSE2/default-128                  138745              8115 ns/op        12618.39 MB/s
BenchmarkFixed64/102400-SSE2/seed-128                     138241              7643 ns/op        13398.59 MB/s
BenchmarkFixed64/102400/default-128                        64280             17548 ns/op        5835.37 MB/s
BenchmarkFixed64/102400/seed-128                           64177             19197 ns/op        5334.07 MB/s
BenchmarkFixed64/1024000-AVX512/default-128                31641             36940 ns/op        27720.90 MB/s
BenchmarkFixed64/1024000-AVX512/seed-128                   31774             36948 ns/op        27714.79 MB/s
BenchmarkFixed64/1024000-AVX2/default-128                  30637             41318 ns/op        24783.33 MB/s
BenchmarkFixed64/1024000-AVX2/seed-128                     30248             40277 ns/op        25423.82 MB/s
BenchmarkFixed64/1024000-SSE2/default-128                  15472             74479 ns/op        13748.92 MB/s
BenchmarkFixed64/1024000-SSE2/seed-128                     23439             75607 ns/op        13543.75 MB/s
BenchmarkFixed64/1024000/default-128                       10538            175230 ns/op        5843.75 MB/s
BenchmarkFixed64/1024000/seed-128                           6722            177354 ns/op        5773.76 MB/s
BenchmarkFixed64/10240000-AVX512/default-128                2950            367137 ns/op        27891.50 MB/s
BenchmarkFixed64/10240000-AVX512/seed-128                   3084            367614 ns/op        27855.30 MB/s
BenchmarkFixed64/10240000-AVX2/default-128                  2866            408285 ns/op        25080.50 MB/s
BenchmarkFixed64/10240000-AVX2/seed-128                     2972            392823 ns/op        26067.70 MB/s
BenchmarkFixed64/10240000-SSE2/default-128                  1356            769873 ns/op        13300.90 MB/s
BenchmarkFixed64/10240000-SSE2/seed-128                     1364            750373 ns/op        13646.54 MB/s
BenchmarkFixed64/10240000/default-128                        650           1739687 ns/op        5886.12 MB/s
BenchmarkFixed64/10240000/seed-128                           640           1768122 ns/op        5791.46 MB/s
BenchmarkFixed64/102400000-AVX512/default-128                150           7976826 ns/op        12837.19 MB/s
BenchmarkFixed64/102400000-AVX512/seed-128                   148           7874286 ns/op        13004.35 MB/s
BenchmarkFixed64/102400000-AVX2/default-128                  132          10930617 ns/op        9368.18 MB/s
BenchmarkFixed64/102400000-AVX2/seed-128                     129           9892323 ns/op        10351.46 MB/s
BenchmarkFixed64/102400000-SSE2/default-128                  112          10339681 ns/op        9903.59 MB/s
BenchmarkFixed64/102400000-SSE2/seed-128                     109          13384707 ns/op        7650.52 MB/s
BenchmarkFixed64/102400000/default-128                        58          20316244 ns/op        5040.30 MB/s
BenchmarkFixed64/102400000/seed-128                           52          23992037 ns/op        4268.08 MB/s
PASS
[...]
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$ git checkout improve-avx2
Branch 'improve-avx2' set up to track remote branch 'improve-avx2' from 'origin'.
Switched to a new branch 'improve-avx2'
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$ go test -bench=Fixed64
goos: linux
goarch: amd64
pkg: github.com/zeebo/xxh3
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkFixed64/1024-AVX512/default-128                15388110                66.43 ns/op     15414.37 MB/s
BenchmarkFixed64/1024-AVX512/seed-128                   12324342                83.69 ns/op     12236.23 MB/s
BenchmarkFixed64/1024-AVX2/default-128                  20245687                58.49 ns/op     17508.25 MB/s
BenchmarkFixed64/1024-AVX2/seed-128                     12746275                79.86 ns/op     12822.67 MB/s
BenchmarkFixed64/1024-SSE2/default-128                  11872988                89.05 ns/op     11499.53 MB/s
BenchmarkFixed64/1024-SSE2/seed-128                      9660879               112.2 ns/op      9123.16 MB/s
BenchmarkFixed64/1024/default-128                        6191212               181.4 ns/op      5646.46 MB/s
BenchmarkFixed64/1024/seed-128                           5435251               212.9 ns/op      4810.73 MB/s
BenchmarkFixed64/8192-AVX512/default-128                 4728980               242.5 ns/op      33780.83 MB/s
BenchmarkFixed64/8192-AVX512/seed-128                    4174062               267.9 ns/op      30574.35 MB/s
BenchmarkFixed64/8192-AVX2/default-128                   3687218               314.8 ns/op      26021.09 MB/s
BenchmarkFixed64/8192-AVX2/seed-128                      3442676               343.6 ns/op      23841.16 MB/s
BenchmarkFixed64/8192-SSE2/default-128                   1998548               604.6 ns/op      13549.05 MB/s
BenchmarkFixed64/8192-SSE2/seed-128                      1872238               624.1 ns/op      13125.19 MB/s
BenchmarkFixed64/8192/default-128                         845390              1447 ns/op        5661.09 MB/s
BenchmarkFixed64/8192/seed-128                            850798              1451 ns/op        5644.00 MB/s
BenchmarkFixed64/102400-AVX512/default-128                435110              2752 ns/op        37212.48 MB/s
BenchmarkFixed64/102400-AVX512/seed-128                   430124              2783 ns/op        36792.95 MB/s
BenchmarkFixed64/102400-AVX2/default-128                  322018              3751 ns/op        27299.33 MB/s
BenchmarkFixed64/102400-AVX2/seed-128                     308383              3764 ns/op        27202.02 MB/s
BenchmarkFixed64/102400-SSE2/default-128                  136687              7435 ns/op        13772.94 MB/s
BenchmarkFixed64/102400-SSE2/seed-128                     161362              7441 ns/op        13761.87 MB/s
BenchmarkFixed64/102400/default-128                        63981             17674 ns/op        5793.77 MB/s
BenchmarkFixed64/102400/seed-128                           64335             17471 ns/op        5861.18 MB/s
BenchmarkFixed64/1024000-AVX512/default-128                42333             27085 ns/op        37807.05 MB/s
BenchmarkFixed64/1024000-AVX512/seed-128                   42436             27119 ns/op        37759.09 MB/s
BenchmarkFixed64/1024000-AVX2/default-128                  31738             36937 ns/op        27722.55 MB/s
BenchmarkFixed64/1024000-AVX2/seed-128                     31447             36923 ns/op        27733.75 MB/s
BenchmarkFixed64/1024000-SSE2/default-128                  15320             74755 ns/op        13698.03 MB/s
BenchmarkFixed64/1024000-SSE2/seed-128                     15184             73787 ns/op        13877.81 MB/s
BenchmarkFixed64/1024000/default-128                        6748            178480 ns/op        5737.33 MB/s
BenchmarkFixed64/1024000/seed-128                           6993            174973 ns/op        5852.33 MB/s
BenchmarkFixed64/10240000-AVX512/default-128                2907            387592 ns/op        26419.53 MB/s
BenchmarkFixed64/10240000-AVX512/seed-128                   3070            383256 ns/op        26718.46 MB/s
BenchmarkFixed64/10240000-AVX2/default-128                  2971            392930 ns/op        26060.62 MB/s
BenchmarkFixed64/10240000-AVX2/seed-128                     2980            397626 ns/op        25752.85 MB/s
BenchmarkFixed64/10240000-SSE2/default-128                  1347            749615 ns/op        13660.34 MB/s
BenchmarkFixed64/10240000-SSE2/seed-128                     1377            753282 ns/op        13593.85 MB/s
BenchmarkFixed64/10240000/default-128                        650           1750421 ns/op        5850.02 MB/s
BenchmarkFixed64/10240000/seed-128                           645           1756879 ns/op        5828.52 MB/s
BenchmarkFixed64/102400000-AVX512/default-128                152          10019955 ns/op        10219.61 MB/s
BenchmarkFixed64/102400000-AVX512/seed-128                   123           9603347 ns/op        10662.95 MB/s
BenchmarkFixed64/102400000-AVX2/default-128                  126           8767211 ns/op        11679.88 MB/s
BenchmarkFixed64/102400000-AVX2/seed-128                     130           8960707 ns/op        11427.67 MB/s
BenchmarkFixed64/102400000-SSE2/default-128                   73          14166372 ns/op        7228.39 MB/s
BenchmarkFixed64/102400000-SSE2/seed-128                      90          13080033 ns/op        7828.73 MB/s
BenchmarkFixed64/102400000/default-128                        49          21931573 ns/op        4669.07 MB/s
BenchmarkFixed64/102400000/seed-128                           63          23193454 ns/op        4415.04 MB/s
PASS
ok      github.com/zeebo/xxh3   172.529s
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$

(there is some load on the system I cannot turn off, but at least it doesn't seem to be a regression)

zeebo commented 1 year ago

I ran this on my machine with the benchmark program nice'd, taskset'd to a cpu, hyperthreading disabled, and cpu frequency scaling disabled to get less noisy benchmarks and used the newer benchstat tool that shows geomean and got these results:

AMD Ryzen 9 5950X 16-Core Processor ``` goos: linux goarch: amd64 pkg: github.com/zeebo/xxh3 cpu: AMD Ryzen 9 5950X 16-Core Processor │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Fixed64/241-AVX2/default 19.55n ± 0% 20.08n ± 0% +2.71% (p=0.000 n=8) Fixed64/512-AVX2/default 22.25n ± 0% 22.29n ± 0% +0.18% (p=0.024 n=8) Fixed64/1024-AVX2/default 32.08n ± 0% 32.48n ± 0% +1.23% (p=0.000 n=8) Fixed64/8192-AVX2/default 155.5n ± 0% 154.4n ± 0% -0.71% (p=0.000 n=8) Fixed64/102400-AVX2/default 1.809µ ± 0% 1.788µ ± 0% -1.13% (p=0.000 n=8) Fixed64/1024000-AVX2/default 18.74µ ± 0% 18.90µ ± 0% +0.87% (p=0.000 n=8) Fixed64/10240000-AVX2/default 189.0µ ± 0% 186.5µ ± 0% -1.27% (p=0.000 n=8) Fixed64/102400000-AVX2/default 3.095m ± 1% 3.125m ± 1% +0.98% (p=0.001 n=8) geomean 1.600µ 1.606µ +0.35% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Fixed64/241-AVX2/seed 30.82n ± 0% 30.79n ± 0% -0.10% (p=0.003 n=8) Fixed64/512-AVX2/seed 34.30n ± 0% 34.30n ± 0% ~ (p=0.816 n=8) Fixed64/1024-AVX2/seed 43.44n ± 0% 43.53n ± 0% +0.21% (p=0.036 n=8) Fixed64/8192-AVX2/seed 167.5n ± 0% 165.4n ± 0% -1.28% (p=0.000 n=8) Fixed64/102400-AVX2/seed 1.809µ ± 0% 1.778µ ± 0% -1.66% (p=0.000 n=8) Fixed64/1024000-AVX2/seed 18.74µ ± 0% 18.90µ ± 0% +0.90% (p=0.000 n=8) Fixed64/10240000-AVX2/seed 188.2µ ± 0% 186.1µ ± 0% -1.10% (p=0.000 n=8) Fixed64/102400000-AVX2/seed 3.090m ± 1% 3.106m ± 1% +0.50% (p=0.028 n=8) geomean 1.873µ 1.867µ -0.32% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Fixed128/241-AVX2/default 22.86n ± 0% 23.01n ± 0% +0.66% (p=0.000 n=8) Fixed128/512-AVX2/default 25.48n ± 0% 25.56n ± 0% +0.31% (p=0.000 n=8) Fixed128/1024-AVX2/default 35.52n ± 0% 35.86n ± 0% +0.94% (p=0.000 n=8) Fixed128/8192-AVX2/default 174.0n ± 0% 172.4n ± 0% -0.92% (p=0.000 n=8) Fixed128/102400-AVX2/default 1.834µ ± 0% 1.795µ ± 0% -2.13% (p=0.000 n=8) Fixed128/1024000-AVX2/default 18.65µ ± 0% 18.37µ ± 0% -1.53% (p=0.000 n=8) Fixed128/10240000-AVX2/default 188.6µ ± 0% 186.2µ ± 0% -1.24% (p=0.000 n=8) Fixed128/102400000-AVX2/default 3.084m ± 0% 3.102m ± 0% +0.60% (p=0.000 n=8) geomean 1.706µ 1.698µ -0.42% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Fixed128/241-AVX2/seed 36.83n ± 0% 36.95n ± 0% +0.31% (p=0.000 n=8) Fixed128/512-AVX2/seed 41.47n ± 0% 41.14n ± 0% -0.81% (p=0.000 n=8) Fixed128/1024-AVX2/seed 48.48n ± 0% 49.92n ± 0% +2.95% (p=0.000 n=8) Fixed128/8192-AVX2/seed 173.3n ± 0% 172.1n ± 0% -0.69% (p=0.000 n=8) Fixed128/102400-AVX2/seed 1.814µ ± 0% 1.792µ ± 0% -1.21% (p=0.000 n=8) Fixed128/1024000-AVX2/seed 18.59µ ± 0% 18.54µ ± 1% ~ (p=0.520 n=8) Fixed128/10240000-AVX2/seed 188.6µ ± 0% 186.8µ ± 0% -0.91% (p=0.000 n=8) Fixed128/102400000-AVX2/seed 3.098m ± 0% 3.119m ± 1% ~ (p=0.065 n=8) geomean 1.997µ 1.997µ -0.01% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Hasher64/16/avx2/plain 12.87n ± 0% 12.86n ± 0% ~ (p=0.305 n=8) Hasher64/64/avx2/plain 15.74n ± 0% 15.74n ± 0% ~ (p=0.724 n=8) Hasher64/256/avx2/plain 27.64n ± 0% 27.47n ± 0% -0.63% (p=0.000 n=8) Hasher64/1024/avx2/plain 51.06n ± 0% 51.50n ± 0% +0.86% (p=0.000 n=8) Hasher64/4096/avx2/plain 111.7n ± 0% 109.7n ± 0% -1.84% (p=0.000 n=8) Hasher64/16384/avx2/plain 356.5n ± 0% 346.2n ± 0% -2.90% (p=0.000 n=8) Hasher64/65536/avx2/plain 1.348µ ± 0% 1.317µ ± 0% -2.34% (p=0.000 n=8) Hasher64/262144/avx2/plain 5.357µ ± 1% 5.228µ ± 0% -2.41% (p=0.000 n=8) Hasher64/1048576/avx2/plain 21.82µ ± 0% 21.78µ ± 1% ~ (p=0.442 n=8) Hasher64/4194304/avx2/plain 87.48µ ± 0% 86.08µ ± 0% -1.60% (p=0.000 n=8) Hasher64/16777216/avx2/plain 359.4µ ± 2% 342.9µ ± 0% -4.58% (p=0.000 n=8) Hasher64/67108864/avx2/plain 2.401m ± 1% 2.344m ± 0% -2.37% (p=0.000 n=8) Hasher64/268435456/avx2/plain 9.559m ± 0% 9.270m ± 0% -3.02% (p=0.000 n=8) geomean 2.952µ 2.904µ -1.63% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Hasher64/16/avx2/seed 14.76n ± 2% 14.76n ± 0% ~ (p=0.980 n=8) Hasher64/64/avx2/seed 17.84n ± 0% 17.80n ± 0% -0.20% (p=0.005 n=8) Hasher64/256/avx2/seed 43.94n ± 0% 43.97n ± 0% ~ (p=0.145 n=8) Hasher64/1024/avx2/seed 64.68n ± 0% 64.91n ± 0% +0.36% (p=0.007 n=8) Hasher64/4096/avx2/seed 114.2n ± 0% 112.5n ± 0% -1.49% (p=0.000 n=8) Hasher64/16384/avx2/seed 361.1n ± 0% 349.5n ± 0% -3.20% (p=0.000 n=8) Hasher64/65536/avx2/seed 1.369µ ± 0% 1.320µ ± 0% -3.54% (p=0.000 n=8) Hasher64/262144/avx2/seed 5.376µ ± 1% 5.213µ ± 0% -3.03% (p=0.000 n=8) Hasher64/1048576/avx2/seed 21.99µ ± 0% 21.49µ ± 0% -2.25% (p=0.000 n=8) Hasher64/4194304/avx2/seed 87.98µ ± 0% 86.07µ ± 0% -2.18% (p=0.000 n=8) Hasher64/16777216/avx2/seed 361.1µ ± 0% 340.8µ ± 0% -5.62% (p=0.000 n=8) Hasher64/67108864/avx2/seed 2.393m ± 0% 2.340m ± 0% -2.21% (p=0.000 n=8) Hasher64/268435456/avx2/seed 9.513m ± 0% 9.282m ± 1% -2.43% (p=0.000 n=8) geomean 3.195µ 3.131µ -1.99% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Hasher128/16/avx2/plain 13.97n ± 0% 13.92n ± 0% -0.32% (p=0.000 n=8) Hasher128/64/avx2/plain 18.56n ± 0% 18.16n ± 0% -2.16% (p=0.000 n=8) Hasher128/256/avx2/plain 32.20n ± 0% 32.53n ± 0% +1.02% (p=0.000 n=8) Hasher128/1024/avx2/plain 58.72n ± 0% 58.91n ± 0% +0.33% (p=0.000 n=8) Hasher128/4096/avx2/plain 115.0n ± 0% 112.9n ± 0% -1.83% (p=0.000 n=8) Hasher128/16384/avx2/plain 360.2n ± 0% 349.9n ± 0% -2.86% (p=0.000 n=8) Hasher128/65536/avx2/plain 1.360µ ± 0% 1.323µ ± 0% -2.76% (p=0.000 n=8) Hasher128/262144/avx2/plain 5.306µ ± 0% 5.263µ ± 0% -0.81% (p=0.000 n=8) Hasher128/1048576/avx2/plain 22.31µ ± 0% 21.82µ ± 2% -2.18% (p=0.000 n=8) Hasher128/4194304/avx2/plain 87.29µ ± 0% 84.80µ ± 0% -2.85% (p=0.000 n=8) Hasher128/16777216/avx2/plain 354.3µ ± 1% 342.5µ ± 0% -3.33% (p=0.000 n=8) Hasher128/67108864/avx2/plain 2.404m ± 0% 2.349m ± 0% -2.31% (p=0.000 n=8) Hasher128/268435456/avx2/plain 9.586m ± 0% 9.275m ± 0% -3.24% (p=0.000 n=8) geomean 3.089µ 3.033µ -1.80% │ old.txt │ pr22.txt │ │ sec/op │ sec/op vs base │ Hasher128/16/avx2/seed 15.62n ± 2% 15.61n ± 0% ~ (p=0.331 n=8) Hasher128/64/avx2/seed 20.62n ± 1% 20.57n ± 0% ~ (p=0.411 n=8) Hasher128/256/avx2/seed 51.10n ± 0% 51.07n ± 0% ~ (p=0.422 n=8) Hasher128/1024/avx2/seed 72.28n ± 0% 73.20n ± 0% +1.29% (p=0.000 n=8) Hasher128/4096/avx2/seed 119.1n ± 0% 122.8n ± 5% ~ (p=1.000 n=8) Hasher128/16384/avx2/seed 366.0n ± 0% 353.9n ± 0% -3.31% (p=0.000 n=8) Hasher128/65536/avx2/seed 1.374µ ± 0% 1.328µ ± 0% -3.35% (p=0.000 n=8) Hasher128/262144/avx2/seed 5.325µ ± 0% 5.226µ ± 0% -1.84% (p=0.000 n=8) Hasher128/1048576/avx2/seed 22.36µ ± 2% 21.87µ ± 0% -2.20% (p=0.000 n=8) Hasher128/4194304/avx2/seed 87.53µ ± 0% 84.76µ ± 0% -3.17% (p=0.000 n=8) Hasher128/16777216/avx2/seed 358.5µ ± 1% 341.7µ ± 0% -4.71% (p=0.000 n=8) Hasher128/67108864/avx2/seed 2.398m ± 0% 2.344m ± 0% -2.26% (p=0.000 n=8) Hasher128/268435456/avx2/seed 9.540m ± 1% 9.268m ± 0% -2.84% (p=0.000 n=8) geomean 3.326µ 3.275µ -1.53% ```
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz ``` goos: linux goarch: amd64 pkg: github.com/zeebo/xxh3 cpu: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Fixed64/241-AVX2/default 19.57n ± 1% 20.20n ± 0% +3.22% (p=0.000 n=8) Fixed64/512-AVX2/default 24.13n ± 0% 24.27n ± 0% +0.56% (p=0.000 n=8) Fixed64/1024-AVX2/default 33.79n ± 0% 33.76n ± 0% ~ (p=0.197 n=8) Fixed64/8192-AVX2/default 175.2n ± 0% 168.1n ± 0% -4.05% (p=0.000 n=8) Fixed64/102400-AVX2/default 2.028µ ± 0% 1.937µ ± 0% -4.49% (p=0.000 n=8) Fixed64/1024000-AVX2/default 20.25µ ± 0% 19.22µ ± 0% -5.06% (p=0.000 n=8) Fixed64/10240000-AVX2/default 318.9µ ± 1% 279.1µ ± 0% -12.46% (p=0.000 n=8) Fixed64/102400000-AVX2/default 3.919m ± 1% 3.708m ± 0% -5.37% (p=0.000 n=8) geomean 1.860µ 1.794µ -3.57% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Fixed64/241-AVX2/seed 40.91n ± 0% 41.49n ± 0% +1.43% (p=0.000 n=8) Fixed64/512-AVX2/seed 44.81n ± 0% 45.05n ± 1% +0.54% (p=0.010 n=8) Fixed64/1024-AVX2/seed 54.84n ± 0% 54.80n ± 1% ~ (p=0.184 n=8) Fixed64/8192-AVX2/seed 195.8n ± 0% 190.3n ± 0% -2.81% (p=0.000 n=8) Fixed64/102400-AVX2/seed 2.058µ ± 0% 1.960µ ± 0% -4.76% (p=0.000 n=8) Fixed64/1024000-AVX2/seed 20.39µ ± 0% 19.23µ ± 0% -5.68% (p=0.000 n=8) Fixed64/10240000-AVX2/seed 319.5µ ± 0% 278.8µ ± 0% -12.73% (p=0.000 n=8) Fixed64/102400000-AVX2/seed 3.927m ± 0% 3.721m ± 0% -5.25% (p=0.000 n=8) geomean 2.382µ 2.292µ -3.77% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Fixed128/241-AVX2/default 25.79n ± 1% 26.07n ± 0% +1.07% (p=0.000 n=8) Fixed128/512-AVX2/default 30.62n ± 0% 31.15n ± 0% +1.71% (p=0.000 n=8) Fixed128/1024-AVX2/default 39.77n ± 0% 39.57n ± 0% -0.52% (p=0.000 n=8) Fixed128/8192-AVX2/default 179.8n ± 0% 176.0n ± 0% -2.09% (p=0.000 n=8) Fixed128/102400-AVX2/default 2.006µ ± 0% 1.943µ ± 0% -3.14% (p=0.000 n=8) Fixed128/1024000-AVX2/default 19.98µ ± 0% 19.21µ ± 0% -3.84% (p=0.000 n=8) Fixed128/10240000-AVX2/default 319.6µ ± 0% 279.2µ ± 0% -12.62% (p=0.000 n=8) Fixed128/102400000-AVX2/default 4.102m ± 0% 3.928m ± 0% -4.24% (p=0.000 n=8) geomean 2.037µ 1.975µ -3.05% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Fixed128/241-AVX2/seed 47.72n ± 0% 48.02n ± 0% +0.62% (p=0.000 n=8) Fixed128/512-AVX2/seed 52.46n ± 0% 52.46n ± 0% ~ (p=0.979 n=8) Fixed128/1024-AVX2/seed 61.72n ± 0% 61.52n ± 0% -0.32% (p=0.000 n=8) Fixed128/8192-AVX2/seed 201.5n ± 0% 197.9n ± 0% -1.81% (p=0.000 n=8) Fixed128/102400-AVX2/seed 2.031µ ± 0% 1.980µ ± 0% -2.51% (p=0.000 n=8) Fixed128/1024000-AVX2/seed 20.03µ ± 0% 19.45µ ± 0% -2.90% (p=0.000 n=8) Fixed128/10240000-AVX2/seed 317.9µ ± 0% 280.2µ ± 0% -11.86% (p=0.000 n=8) Fixed128/102400000-AVX2/seed 4.103m ± 0% 3.933m ± 0% -4.13% (p=0.000 n=8) geomean 2.525µ 2.451µ -2.94% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Hasher64/16/avx2/plain 24.54n ± 0% 24.41n ± 1% ~ (p=0.124 n=8) Hasher64/64/avx2/plain 29.24n ± 0% 29.10n ± 0% -0.48% (p=0.000 n=8) Hasher64/256/avx2/plain 40.87n ± 0% 40.60n ± 1% -0.66% (p=0.000 n=8) Hasher64/1024/avx2/plain 60.36n ± 0% 59.94n ± 0% -0.70% (p=0.000 n=8) Hasher64/4096/avx2/plain 128.2n ± 0% 128.7n ± 0% +0.35% (p=0.000 n=8) Hasher64/16384/avx2/plain 420.6n ± 0% 426.6n ± 0% +1.43% (p=0.000 n=8) Hasher64/65536/avx2/plain 1.799µ ± 0% 1.781µ ± 0% -1.00% (p=0.000 n=8) Hasher64/262144/avx2/plain 7.664µ ± 1% 7.510µ ± 2% ~ (p=0.152 n=8) Hasher64/1048576/avx2/plain 33.05µ ± 0% 32.85µ ± 0% -0.59% (p=0.000 n=8) Hasher64/4194304/avx2/plain 132.0µ ± 1% 130.4µ ± 0% -1.18% (p=0.000 n=8) Hasher64/16777216/avx2/plain 825.1µ ± 0% 794.5µ ± 1% -3.71% (p=0.000 n=8) Hasher64/67108864/avx2/plain 3.490m ± 0% 3.484m ± 0% -0.18% (p=0.050 n=8) Hasher64/268435456/avx2/plain 14.45m ± 0% 14.16m ± 0% -2.01% (p=0.000 n=8) geomean 4.410µ 4.372µ -0.88% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Hasher64/16/avx2/seed 28.56n ± 0% 28.51n ± 0% ~ (p=0.123 n=8) Hasher64/64/avx2/seed 33.73n ± 1% 33.52n ± 0% -0.62% (p=0.009 n=8) Hasher64/256/avx2/seed 66.88n ± 0% 66.98n ± 0% ~ (p=0.053 n=8) Hasher64/1024/avx2/seed 86.52n ± 0% 85.50n ± 0% -1.18% (p=0.000 n=8) Hasher64/4096/avx2/seed 133.6n ± 0% 133.4n ± 0% -0.15% (p=0.004 n=8) Hasher64/16384/avx2/seed 443.6n ± 0% 440.3n ± 0% -0.76% (p=0.000 n=8) Hasher64/65536/avx2/seed 1.851µ ± 0% 1.836µ ± 0% -0.81% (p=0.000 n=8) Hasher64/262144/avx2/seed 7.736µ ± 0% 7.679µ ± 0% -0.73% (p=0.000 n=8) Hasher64/1048576/avx2/seed 33.52µ ± 0% 33.29µ ± 0% -0.66% (p=0.000 n=8) Hasher64/4194304/avx2/seed 132.9µ ± 0% 132.3µ ± 0% -0.45% (p=0.000 n=8) Hasher64/16777216/avx2/seed 828.3µ ± 0% 797.2µ ± 0% -3.76% (p=0.000 n=8) Hasher64/67108864/avx2/seed 3.490m ± 0% 3.490m ± 0% ~ (p=0.959 n=8) Hasher64/268435456/avx2/seed 14.49m ± 0% 14.18m ± 0% -2.17% (p=0.000 n=8) geomean 4.877µ 4.834µ -0.88% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Hasher128/16/avx2/plain 26.90n ± 1% 26.75n ± 1% -0.58% (p=0.014 n=8) Hasher128/64/avx2/plain 32.59n ± 0% 32.57n ± 0% ~ (p=0.642 n=8) Hasher128/256/avx2/plain 47.16n ± 0% 47.15n ± 0% ~ (p=0.368 n=8) Hasher128/1024/avx2/plain 67.01n ± 0% 66.54n ± 0% -0.69% (p=0.000 n=8) Hasher128/4096/avx2/plain 134.2n ± 0% 134.1n ± 0% -0.11% (p=0.033 n=8) Hasher128/16384/avx2/plain 428.0n ± 0% 431.6n ± 0% +0.85% (p=0.000 n=8) Hasher128/65536/avx2/plain 1.855µ ± 0% 1.786µ ± 0% -3.69% (p=0.000 n=8) Hasher128/262144/avx2/plain 7.913µ ± 0% 7.645µ ± 0% -3.39% (p=0.000 n=8) Hasher128/1048576/avx2/plain 33.86µ ± 0% 32.78µ ± 0% -3.20% (p=0.000 n=8) Hasher128/4194304/avx2/plain 134.7µ ± 0% 130.1µ ± 0% -3.38% (p=0.000 n=8) Hasher128/16777216/avx2/plain 834.3µ ± 0% 797.2µ ± 0% -4.45% (p=0.000 n=8) Hasher128/67108864/avx2/plain 3.543m ± 0% 3.484m ± 0% -1.67% (p=0.000 n=8) Hasher128/268435456/avx2/plain 14.35m ± 0% 14.08m ± 0% -1.90% (p=0.007 n=8) geomean 4.632µ 4.552µ -1.73% │ intel-old.txt │ intel-pr22.txt │ │ sec/op │ sec/op vs base │ Hasher128/16/avx2/seed 31.06n ± 0% 31.08n ± 0% ~ (p=0.863 n=8) Hasher128/64/avx2/seed 37.65n ± 0% 37.59n ± 0% ~ (p=0.123 n=8) Hasher128/256/avx2/seed 72.87n ± 0% 73.27n ± 0% +0.56% (p=0.000 n=8) Hasher128/1024/avx2/seed 92.25n ± 0% 91.10n ± 0% -1.24% (p=0.000 n=8) Hasher128/4096/avx2/seed 139.6n ± 0% 138.4n ± 0% -0.86% (p=0.000 n=8) Hasher128/16384/avx2/seed 454.8n ± 0% 446.2n ± 0% -1.89% (p=0.000 n=8) Hasher128/65536/avx2/seed 1.909µ ± 0% 1.842µ ± 0% -3.51% (p=0.000 n=8) Hasher128/262144/avx2/seed 8.076µ ± 0% 7.837µ ± 2% -2.97% (p=0.000 n=8) Hasher128/1048576/avx2/seed 34.29µ ± 0% 33.20µ ± 0% -3.16% (p=0.000 n=8) Hasher128/4194304/avx2/seed 136.5µ ± 0% 132.1µ ± 0% -3.24% (p=0.000 n=8) Hasher128/16777216/avx2/seed 836.5µ ± 0% 800.9µ ± 0% -4.26% (p=0.000 n=8) Hasher128/67108864/avx2/seed 3.548m ± 0% 3.489m ± 0% -1.66% (p=0.000 n=8) Hasher128/268435456/avx2/seed 14.38m ± 0% 14.07m ± 0% -2.10% (p=0.000 n=8) geomean 5.089µ 4.993µ -1.89% ```

So seems like ~0% for amd+fixed, ~-1.5% for amd+hasher, ~-3.5% for intel+fixed and ~-1% for intel+hasher, with most of the gains during large sizes. Looks good!

klauspost commented 1 year ago

Thanks for confirming. At least I didn't oversell it in the title :D