Closed klauspost closed 1 year ago
Seems like AVX512 likes it a lot until it hits memory bandwidth limit.
minio@minio-k8s17:~/apps/xxh3$ go test -bench=Fixed64
go: downloading github.com/zeebo/assert v1.3.0
go: downloading github.com/klauspost/cpuid/v2 v2.0.9
goos: linux
goarch: amd64
pkg: github.com/zeebo/xxh3
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkFixed64/1024-AVX512/default-128 18415684 64.75 ns/op 15815.42 MB/s
BenchmarkFixed64/1024-AVX512/seed-128 11996121 86.43 ns/op 11847.06 MB/s
BenchmarkFixed64/1024-AVX2/default-128 20169603 60.48 ns/op 16930.09 MB/s
BenchmarkFixed64/1024-AVX2/seed-128 13220286 80.50 ns/op 12720.39 MB/s
BenchmarkFixed64/1024-SSE2/default-128 13571812 90.09 ns/op 11366.81 MB/s
BenchmarkFixed64/1024-SSE2/seed-128 9564871 116.9 ns/op 8757.15 MB/s
BenchmarkFixed64/1024/default-128 6226238 181.4 ns/op 5645.35 MB/s
BenchmarkFixed64/1024/seed-128 5396823 214.6 ns/op 4771.76 MB/s
BenchmarkFixed64/8192-AVX512/default-128 3590719 325.5 ns/op 25169.43 MB/s
BenchmarkFixed64/8192-AVX512/seed-128 3464196 341.5 ns/op 23991.18 MB/s
BenchmarkFixed64/8192-AVX2/default-128 3679377 329.0 ns/op 24899.53 MB/s
BenchmarkFixed64/8192-AVX2/seed-128 3295234 377.5 ns/op 21698.57 MB/s
BenchmarkFixed64/8192-SSE2/default-128 1938860 614.4 ns/op 13333.91 MB/s
BenchmarkFixed64/8192-SSE2/seed-128 1857984 635.3 ns/op 12894.95 MB/s
BenchmarkFixed64/8192/default-128 843528 1398 ns/op 5861.34 MB/s
BenchmarkFixed64/8192/seed-128 816256 1438 ns/op 5696.92 MB/s
BenchmarkFixed64/102400-AVX512/default-128 322002 3722 ns/op 27514.07 MB/s
BenchmarkFixed64/102400-AVX512/seed-128 320775 3739 ns/op 27385.60 MB/s
BenchmarkFixed64/102400-AVX2/default-128 300304 3909 ns/op 26197.70 MB/s
BenchmarkFixed64/102400-AVX2/seed-128 297676 3915 ns/op 26156.07 MB/s
BenchmarkFixed64/102400-SSE2/default-128 138745 8115 ns/op 12618.39 MB/s
BenchmarkFixed64/102400-SSE2/seed-128 138241 7643 ns/op 13398.59 MB/s
BenchmarkFixed64/102400/default-128 64280 17548 ns/op 5835.37 MB/s
BenchmarkFixed64/102400/seed-128 64177 19197 ns/op 5334.07 MB/s
BenchmarkFixed64/1024000-AVX512/default-128 31641 36940 ns/op 27720.90 MB/s
BenchmarkFixed64/1024000-AVX512/seed-128 31774 36948 ns/op 27714.79 MB/s
BenchmarkFixed64/1024000-AVX2/default-128 30637 41318 ns/op 24783.33 MB/s
BenchmarkFixed64/1024000-AVX2/seed-128 30248 40277 ns/op 25423.82 MB/s
BenchmarkFixed64/1024000-SSE2/default-128 15472 74479 ns/op 13748.92 MB/s
BenchmarkFixed64/1024000-SSE2/seed-128 23439 75607 ns/op 13543.75 MB/s
BenchmarkFixed64/1024000/default-128 10538 175230 ns/op 5843.75 MB/s
BenchmarkFixed64/1024000/seed-128 6722 177354 ns/op 5773.76 MB/s
BenchmarkFixed64/10240000-AVX512/default-128 2950 367137 ns/op 27891.50 MB/s
BenchmarkFixed64/10240000-AVX512/seed-128 3084 367614 ns/op 27855.30 MB/s
BenchmarkFixed64/10240000-AVX2/default-128 2866 408285 ns/op 25080.50 MB/s
BenchmarkFixed64/10240000-AVX2/seed-128 2972 392823 ns/op 26067.70 MB/s
BenchmarkFixed64/10240000-SSE2/default-128 1356 769873 ns/op 13300.90 MB/s
BenchmarkFixed64/10240000-SSE2/seed-128 1364 750373 ns/op 13646.54 MB/s
BenchmarkFixed64/10240000/default-128 650 1739687 ns/op 5886.12 MB/s
BenchmarkFixed64/10240000/seed-128 640 1768122 ns/op 5791.46 MB/s
BenchmarkFixed64/102400000-AVX512/default-128 150 7976826 ns/op 12837.19 MB/s
BenchmarkFixed64/102400000-AVX512/seed-128 148 7874286 ns/op 13004.35 MB/s
BenchmarkFixed64/102400000-AVX2/default-128 132 10930617 ns/op 9368.18 MB/s
BenchmarkFixed64/102400000-AVX2/seed-128 129 9892323 ns/op 10351.46 MB/s
BenchmarkFixed64/102400000-SSE2/default-128 112 10339681 ns/op 9903.59 MB/s
BenchmarkFixed64/102400000-SSE2/seed-128 109 13384707 ns/op 7650.52 MB/s
BenchmarkFixed64/102400000/default-128 58 20316244 ns/op 5040.30 MB/s
BenchmarkFixed64/102400000/seed-128 52 23992037 ns/op 4268.08 MB/s
PASS
[...]
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$ git checkout improve-avx2
Branch 'improve-avx2' set up to track remote branch 'improve-avx2' from 'origin'.
Switched to a new branch 'improve-avx2'
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$ go test -bench=Fixed64
goos: linux
goarch: amd64
pkg: github.com/zeebo/xxh3
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkFixed64/1024-AVX512/default-128 15388110 66.43 ns/op 15414.37 MB/s
BenchmarkFixed64/1024-AVX512/seed-128 12324342 83.69 ns/op 12236.23 MB/s
BenchmarkFixed64/1024-AVX2/default-128 20245687 58.49 ns/op 17508.25 MB/s
BenchmarkFixed64/1024-AVX2/seed-128 12746275 79.86 ns/op 12822.67 MB/s
BenchmarkFixed64/1024-SSE2/default-128 11872988 89.05 ns/op 11499.53 MB/s
BenchmarkFixed64/1024-SSE2/seed-128 9660879 112.2 ns/op 9123.16 MB/s
BenchmarkFixed64/1024/default-128 6191212 181.4 ns/op 5646.46 MB/s
BenchmarkFixed64/1024/seed-128 5435251 212.9 ns/op 4810.73 MB/s
BenchmarkFixed64/8192-AVX512/default-128 4728980 242.5 ns/op 33780.83 MB/s
BenchmarkFixed64/8192-AVX512/seed-128 4174062 267.9 ns/op 30574.35 MB/s
BenchmarkFixed64/8192-AVX2/default-128 3687218 314.8 ns/op 26021.09 MB/s
BenchmarkFixed64/8192-AVX2/seed-128 3442676 343.6 ns/op 23841.16 MB/s
BenchmarkFixed64/8192-SSE2/default-128 1998548 604.6 ns/op 13549.05 MB/s
BenchmarkFixed64/8192-SSE2/seed-128 1872238 624.1 ns/op 13125.19 MB/s
BenchmarkFixed64/8192/default-128 845390 1447 ns/op 5661.09 MB/s
BenchmarkFixed64/8192/seed-128 850798 1451 ns/op 5644.00 MB/s
BenchmarkFixed64/102400-AVX512/default-128 435110 2752 ns/op 37212.48 MB/s
BenchmarkFixed64/102400-AVX512/seed-128 430124 2783 ns/op 36792.95 MB/s
BenchmarkFixed64/102400-AVX2/default-128 322018 3751 ns/op 27299.33 MB/s
BenchmarkFixed64/102400-AVX2/seed-128 308383 3764 ns/op 27202.02 MB/s
BenchmarkFixed64/102400-SSE2/default-128 136687 7435 ns/op 13772.94 MB/s
BenchmarkFixed64/102400-SSE2/seed-128 161362 7441 ns/op 13761.87 MB/s
BenchmarkFixed64/102400/default-128 63981 17674 ns/op 5793.77 MB/s
BenchmarkFixed64/102400/seed-128 64335 17471 ns/op 5861.18 MB/s
BenchmarkFixed64/1024000-AVX512/default-128 42333 27085 ns/op 37807.05 MB/s
BenchmarkFixed64/1024000-AVX512/seed-128 42436 27119 ns/op 37759.09 MB/s
BenchmarkFixed64/1024000-AVX2/default-128 31738 36937 ns/op 27722.55 MB/s
BenchmarkFixed64/1024000-AVX2/seed-128 31447 36923 ns/op 27733.75 MB/s
BenchmarkFixed64/1024000-SSE2/default-128 15320 74755 ns/op 13698.03 MB/s
BenchmarkFixed64/1024000-SSE2/seed-128 15184 73787 ns/op 13877.81 MB/s
BenchmarkFixed64/1024000/default-128 6748 178480 ns/op 5737.33 MB/s
BenchmarkFixed64/1024000/seed-128 6993 174973 ns/op 5852.33 MB/s
BenchmarkFixed64/10240000-AVX512/default-128 2907 387592 ns/op 26419.53 MB/s
BenchmarkFixed64/10240000-AVX512/seed-128 3070 383256 ns/op 26718.46 MB/s
BenchmarkFixed64/10240000-AVX2/default-128 2971 392930 ns/op 26060.62 MB/s
BenchmarkFixed64/10240000-AVX2/seed-128 2980 397626 ns/op 25752.85 MB/s
BenchmarkFixed64/10240000-SSE2/default-128 1347 749615 ns/op 13660.34 MB/s
BenchmarkFixed64/10240000-SSE2/seed-128 1377 753282 ns/op 13593.85 MB/s
BenchmarkFixed64/10240000/default-128 650 1750421 ns/op 5850.02 MB/s
BenchmarkFixed64/10240000/seed-128 645 1756879 ns/op 5828.52 MB/s
BenchmarkFixed64/102400000-AVX512/default-128 152 10019955 ns/op 10219.61 MB/s
BenchmarkFixed64/102400000-AVX512/seed-128 123 9603347 ns/op 10662.95 MB/s
BenchmarkFixed64/102400000-AVX2/default-128 126 8767211 ns/op 11679.88 MB/s
BenchmarkFixed64/102400000-AVX2/seed-128 130 8960707 ns/op 11427.67 MB/s
BenchmarkFixed64/102400000-SSE2/default-128 73 14166372 ns/op 7228.39 MB/s
BenchmarkFixed64/102400000-SSE2/seed-128 90 13080033 ns/op 7828.73 MB/s
BenchmarkFixed64/102400000/default-128 49 21931573 ns/op 4669.07 MB/s
BenchmarkFixed64/102400000/seed-128 63 23193454 ns/op 4415.04 MB/s
PASS
ok github.com/zeebo/xxh3 172.529s
minio@minio-k8s17:~/apps/xxh3/klaus/xxh3$
(there is some load on the system I cannot turn off, but at least it doesn't seem to be a regression)
I ran this on my machine with the benchmark program nice'd, taskset'd to a cpu, hyperthreading disabled, and cpu frequency scaling disabled to get less noisy benchmarks and used the newer benchstat tool that shows geomean and got these results:
So seems like ~0% for amd+fixed, ~-1.5% for amd+hasher, ~-3.5% for intel+fixed and ~-1% for intel+hasher, with most of the gains during large sizes. Looks good!
Thanks for confirming. At least I didn't oversell it in the title :D
Benchmarks are a bit noisy from run to run, probably allocation alignment, but the overall trend appears positive.
My Zen 2 has memory -> register aliasing - this is not present on Intel, so they preloaded keys should help other platforms more.
I will have to run tests on AVX512. I will get back with whether they are good if CI doesn't pick it up.
Edit: It was picked up by CI:
=== RUN TestVectorCompat compat_vector_test.go:30: avx512: true compat_vector_test.go:31: avx2: true compat_vector_test.go:32: sse2: true