zeebo / xxh3

XXH3 algorithm in Go
BSD 2-Clause "Simplified" License
403 stars 20 forks source link

Improve AVX2 hash speed. #9

Closed klauspost closed 3 years ago

klauspost commented 3 years ago

Toyed a bit around and it seems like the biggest change came from loading the key value directly into the XOR.

Maybe an AMD specific optimization.

BenchmarkFixed128/241-AVX2-32           87.3          20.1          -76.98%
BenchmarkFixed128/512-AVX2-32           94.8          22.9          -75.82%
BenchmarkFixed128/1024-AVX2-32          104           30.1          -71.11%
BenchmarkFixed128/8192-AVX2-32          219           138           -36.97%
BenchmarkFixed128/102400-AVX2-32        1833          1562          -14.78%
BenchmarkFixed128/1024000-AVX2-32       18288         16166         -11.60%
BenchmarkFixed128/10240000-AVX2-32      191503        165534        -13.56%
BenchmarkFixed128/102400000-AVX2-32     5007156       4204041       -16.04%
BenchmarkFixed128/241-AVX2-32           2760.94      11996.27     4.34x
BenchmarkFixed128/512-AVX2-32           5403.02      22350.38     4.14x
BenchmarkFixed128/1024-AVX2-32          9844.49      34081.01     3.46x
BenchmarkFixed128/8192-AVX2-32          37382.01     59318.68     1.59x
BenchmarkFixed128/102400-AVX2-32        55875.41     65556.58     1.17x
BenchmarkFixed128/1024000-AVX2-32       55991.65     63343.83     1.13x
BenchmarkFixed128/10240000-AVX2-32      53471.70     61860.32     1.16x
BenchmarkFixed128/102400000-AVX2-32     20450.73     24357.52     1.19x 

cpu: AMD Ryzen 9 3950X 16-Core Processor

Moved the generation into the avo folder, since I couldn't make the other work.

zeebo commented 3 years ago

I'm sorry that I totally ignored this PR for this long. I'll work on getting these improvements in.

zeebo commented 3 years ago

Well, I've hit a bit of an issue. First, I locally changed the avx2 accum code to look like

y00, y01, y02 := YMM(), YMM(), YMM()
y10, y11, y12 := YMM(), YMM(), YMM()

VMOVDQU(data.Offset(doff+0x00), y00)
VMOVDQU(data.Offset(doff+0x20), y10)
VPXOR(key.Offset(koff+0x00), y00, y01)
VPXOR(key.Offset(koff+0x20), y10, y11)
VPSHUFD(Imm(49), y01, y02)
VPSHUFD(Imm(49), y11, y12)
VPMULUDQ(y01, y02, y01)
VPMULUDQ(y11, y12, y11)
VPSHUFD(Imm(78), y00, y00)
VPSHUFD(Imm(78), y10, y10)
VPADDQ(a[0], y01, a[0])
VPADDQ(a[0], y00, a[0])
VPADDQ(a[1], y11, a[1])
VPADDQ(a[1], y10, a[1])

similar to the PR. When I run the benchmarks, though, I get these results:

  name                  old time/op    new time/op    delta
  Fixed128/241-AVX2       23.7ns ± 0%    23.4ns ± 0%  -1.34%  (p=0.000 n=9+9)
  Fixed128/512-AVX2       28.8ns ± 0%    28.0ns ± 0%  -2.79%  (p=0.000 n=10+10)
  Fixed128/1024-AVX2      39.4ns ± 0%    37.7ns ± 0%  -4.11%  (p=0.000 n=8+10)
  Fixed128/8192-AVX2       183ns ± 0%     178ns ± 0%  -2.82%  (p=0.000 n=9+9)
> Fixed128/102400-AVX2    2.30µs ± 0%    2.39µs ± 0%  +3.79%  (p=0.000 n=10+9)
  Fixed/241-AVX2          18.7ns ± 0%    18.4ns ± 0%  -1.58%  (p=0.000 n=7+10)
  Fixed/512-AVX2          23.8ns ± 0%    22.9ns ± 0%  -3.78%  (p=0.000 n=9+10)
  Fixed/1024-AVX2         34.4ns ± 0%    32.7ns ± 0%  -4.94%  (p=0.000 n=10+7)
  Fixed/8192-AVX2          178ns ± 0%     172ns ± 0%  -3.42%  (p=0.000 n=9+9)
> Fixed/102400-AVX2       2.30µs ± 0%    2.39µs ± 0%  +4.10%  (p=0.000 n=8+10)

  name                  old speed      new speed      delta
  Fixed128/241-AVX2     10.2GB/s ± 0%  10.3GB/s ± 0%  +1.36%  (p=0.000 n=8+9)
  Fixed128/512-AVX2     17.8GB/s ± 0%  18.3GB/s ± 0%  +2.87%  (p=0.000 n=10+10)
  Fixed128/1024-AVX2    26.0GB/s ± 0%  27.1GB/s ± 0%  +4.27%  (p=0.000 n=9+10)
  Fixed128/8192-AVX2    44.7GB/s ± 0%  45.9GB/s ± 0%  +2.88%  (p=0.000 n=8+10)
> Fixed128/102400-AVX2  44.5GB/s ± 0%  42.9GB/s ± 0%  -3.66%  (p=0.000 n=10+10)
  Fixed/241-AVX2        12.9GB/s ± 0%  13.1GB/s ± 0%  +1.58%  (p=0.000 n=9+10)
  Fixed/512-AVX2        21.5GB/s ± 0%  22.4GB/s ± 0%  +3.93%  (p=0.000 n=9+10)
  Fixed/1024-AVX2       29.8GB/s ± 0%  31.3GB/s ± 0%  +5.19%  (p=0.000 n=8+9)
  Fixed/8192-AVX2       45.9GB/s ± 0%  47.5GB/s ± 0%  +3.55%  (p=0.000 n=9+9)
> Fixed/102400-AVX2     44.6GB/s ± 0%  42.8GB/s ± 0%  -3.95%  (p=0.000 n=8+10)

It's slower for the very large sizes for some reason. The accum code operates on 64 bytes at a time and is unrolled 16 times for blocks, which I think is why the speed starts to regress for sizes larger than 16 * 64 = 1024 bytes.

Plugging one of the current 64 byte loop bodies into https://uica.uops.info (a great tool) shows a predicted throughput of ~4.38 (https://bit.ly/3gGOg3B), but the new loop only has a throughput of ~4.09 (https://bit.ly/3kqaqbs).

edit: apparently, throughput is better when it's lower ("Throughput (in cycles per iteration)"). I'm extra confused now.

I don't know why the loop version runs faster when the unrolled version gets slower, though. I'll play around with some different orderings.

klauspost commented 3 years ago

apparently, throughput is better when it's lower

Yeah, that's how I read it as well.

I don't know why the loop version runs faster when the unrolled version gets slower, though. I'll play around with some different orderings.

This is the first that moves beyond L1 size, so maybe some prefetching isn't that optimal on the Intel.

I had suspected that for any superscalar CPU the rolled/unrolled would be the same, but was surprised when it wasn't.

klauspost commented 3 years ago

@zeebo Tweaked some more stuff after rebasing. Experimented a bit. Found that for once prefetching was faster.

VZEROUPPER was missing. Huge improvement on small payloads.

Added longer benchmarks for different cache scenarios. Top benchmark updated.

zeebo commented 3 years ago

I saw your comment in the gophers slack to remember to add VZEROUPPER, so I went and did that without first seeing this PR. :smile:

So I rebased and fixed the conflict. This is great! I also saw improvements on the order of ~12% for the 100K payloads. Thanks!

klauspost commented 3 years ago

@zeebo I saw you <3 there, so I assumed you had seen the PR already :D

Great it improves Intel as well 👍🏼