Closed klauspost closed 3 years ago
I'm sorry that I totally ignored this PR for this long. I'll work on getting these improvements in.
Well, I've hit a bit of an issue. First, I locally changed the avx2 accum code to look like
y00, y01, y02 := YMM(), YMM(), YMM()
y10, y11, y12 := YMM(), YMM(), YMM()
VMOVDQU(data.Offset(doff+0x00), y00)
VMOVDQU(data.Offset(doff+0x20), y10)
VPXOR(key.Offset(koff+0x00), y00, y01)
VPXOR(key.Offset(koff+0x20), y10, y11)
VPSHUFD(Imm(49), y01, y02)
VPSHUFD(Imm(49), y11, y12)
VPMULUDQ(y01, y02, y01)
VPMULUDQ(y11, y12, y11)
VPSHUFD(Imm(78), y00, y00)
VPSHUFD(Imm(78), y10, y10)
VPADDQ(a[0], y01, a[0])
VPADDQ(a[0], y00, a[0])
VPADDQ(a[1], y11, a[1])
VPADDQ(a[1], y10, a[1])
similar to the PR. When I run the benchmarks, though, I get these results:
name old time/op new time/op delta
Fixed128/241-AVX2 23.7ns ± 0% 23.4ns ± 0% -1.34% (p=0.000 n=9+9)
Fixed128/512-AVX2 28.8ns ± 0% 28.0ns ± 0% -2.79% (p=0.000 n=10+10)
Fixed128/1024-AVX2 39.4ns ± 0% 37.7ns ± 0% -4.11% (p=0.000 n=8+10)
Fixed128/8192-AVX2 183ns ± 0% 178ns ± 0% -2.82% (p=0.000 n=9+9)
> Fixed128/102400-AVX2 2.30µs ± 0% 2.39µs ± 0% +3.79% (p=0.000 n=10+9)
Fixed/241-AVX2 18.7ns ± 0% 18.4ns ± 0% -1.58% (p=0.000 n=7+10)
Fixed/512-AVX2 23.8ns ± 0% 22.9ns ± 0% -3.78% (p=0.000 n=9+10)
Fixed/1024-AVX2 34.4ns ± 0% 32.7ns ± 0% -4.94% (p=0.000 n=10+7)
Fixed/8192-AVX2 178ns ± 0% 172ns ± 0% -3.42% (p=0.000 n=9+9)
> Fixed/102400-AVX2 2.30µs ± 0% 2.39µs ± 0% +4.10% (p=0.000 n=8+10)
name old speed new speed delta
Fixed128/241-AVX2 10.2GB/s ± 0% 10.3GB/s ± 0% +1.36% (p=0.000 n=8+9)
Fixed128/512-AVX2 17.8GB/s ± 0% 18.3GB/s ± 0% +2.87% (p=0.000 n=10+10)
Fixed128/1024-AVX2 26.0GB/s ± 0% 27.1GB/s ± 0% +4.27% (p=0.000 n=9+10)
Fixed128/8192-AVX2 44.7GB/s ± 0% 45.9GB/s ± 0% +2.88% (p=0.000 n=8+10)
> Fixed128/102400-AVX2 44.5GB/s ± 0% 42.9GB/s ± 0% -3.66% (p=0.000 n=10+10)
Fixed/241-AVX2 12.9GB/s ± 0% 13.1GB/s ± 0% +1.58% (p=0.000 n=9+10)
Fixed/512-AVX2 21.5GB/s ± 0% 22.4GB/s ± 0% +3.93% (p=0.000 n=9+10)
Fixed/1024-AVX2 29.8GB/s ± 0% 31.3GB/s ± 0% +5.19% (p=0.000 n=8+9)
Fixed/8192-AVX2 45.9GB/s ± 0% 47.5GB/s ± 0% +3.55% (p=0.000 n=9+9)
> Fixed/102400-AVX2 44.6GB/s ± 0% 42.8GB/s ± 0% -3.95% (p=0.000 n=8+10)
It's slower for the very large sizes for some reason. The accum code operates on 64 bytes at a time and is unrolled 16 times for blocks, which I think is why the speed starts to regress for sizes larger than 16 * 64 = 1024
bytes.
Plugging one of the current 64 byte loop bodies into https://uica.uops.info (a great tool) shows a predicted throughput of ~4.38 (https://bit.ly/3gGOg3B), but the new loop only has a throughput of ~4.09 (https://bit.ly/3kqaqbs).
edit: apparently, throughput is better when it's lower ("Throughput (in cycles per iteration)"). I'm extra confused now.
I don't know why the loop version runs faster when the unrolled version gets slower, though. I'll play around with some different orderings.
apparently, throughput is better when it's lower
Yeah, that's how I read it as well.
I don't know why the loop version runs faster when the unrolled version gets slower, though. I'll play around with some different orderings.
This is the first that moves beyond L1 size, so maybe some prefetching isn't that optimal on the Intel.
I had suspected that for any superscalar CPU the rolled/unrolled would be the same, but was surprised when it wasn't.
@zeebo Tweaked some more stuff after rebasing. Experimented a bit. Found that for once prefetching was faster.
VZEROUPPER
was missing. Huge improvement on small payloads.
Added longer benchmarks for different cache scenarios. Top benchmark updated.
I saw your comment in the gophers slack to remember to add VZEROUPPER
, so I went and did that without first seeing this PR. :smile:
So I rebased and fixed the conflict. This is great! I also saw improvements on the order of ~12% for the 100K payloads. Thanks!
@zeebo I saw you <3 there, so I assumed you had seen the PR already :D
Great it improves Intel as well 👍🏼
Toyed a bit around and it seems like the biggest change came from loading the key value directly into the XOR.
Maybe an AMD specific optimization.
cpu: AMD Ryzen 9 3950X 16-Core Processor
Moved the generation into the avo folder, since I couldn't make the other work.