zeebo / xxh3

XXH3 algorithm in Go
BSD 2-Clause "Simplified" License
406 stars 20 forks source link

Add avx512 #13

Closed klauspost closed 3 years ago

klauspost commented 3 years ago

I set up a workflow on my own repo. They seem to have avx512:

https://github.com/klauspost/xxh3/pull/2

Tests appears to pass now.

Benchmark, when running on CI:

cpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
BenchmarkFixed128/1024-AVX512-2             25477058            46.75 ns/op 21902.82 MB/s
BenchmarkFixed128/1024-AVX2-2               23998638            49.33 ns/op 20759.84 MB/s
BenchmarkFixed128/1024-SSE2-2               14191345            83.37 ns/op 12283.06 MB/s
BenchmarkFixed128/1024-2                     6647786           180.5 ns/op  5672.83 MB/s
BenchmarkFixed128/8192-AVX512-2              5789636           207.5 ns/op  39485.04 MB/s
BenchmarkFixed128/8192-AVX2-2                5057226           237.4 ns/op  34510.84 MB/s
BenchmarkFixed128/8192-SSE2-2                2478674           484.0 ns/op  16926.95 MB/s
BenchmarkFixed128/8192-2                      819734          1375 ns/op    5956.05 MB/s
BenchmarkFixed128/102400-AVX512-2             500712          2377 ns/op    43072.96 MB/s
BenchmarkFixed128/102400-AVX2-2               436971          2745 ns/op    37298.82 MB/s
BenchmarkFixed128/102400-SSE2-2               206989          5797 ns/op    17664.60 MB/s
BenchmarkFixed128/102400-2                     71216         16778 ns/op    6103.12 MB/s
BenchmarkFixed128/1024000-AVX512-2             48441         24730 ns/op    41407.80 MB/s
BenchmarkFixed128/1024000-AVX2-2               41143         29175 ns/op    35097.99 MB/s
BenchmarkFixed128/1024000-SSE2-2               19495         61862 ns/op    16553.06 MB/s
BenchmarkFixed128/1024000-2                     6568        176122 ns/op    5814.14 MB/s
BenchmarkFixed128/10240000-AVX512-2             2852        412248 ns/op    24839.41 MB/s
BenchmarkFixed128/10240000-AVX2-2               2764        419606 ns/op    24403.84 MB/s
BenchmarkFixed128/10240000-SSE2-2               1500        767759 ns/op    13337.52 MB/s
BenchmarkFixed128/10240000-2                     688       1712014 ns/op    5981.26 MB/s
BenchmarkFixed128/102400000-AVX512-2             133       9240533 ns/op    11081.61 MB/s
BenchmarkFixed128/102400000-AVX2-2               100      10172756 ns/op    10066.10 MB/s
BenchmarkFixed128/102400000-SSE2-2                66      16844407 ns/op    6079.17 MB/s
BenchmarkFixed128/102400000-2                     50      22082671 ns/op    4637.12 MB/s
[...]
BenchmarkFixed/1024-AVX512-2                28326884            42.31 ns/op 24201.51 MB/s
BenchmarkFixed/1024-AVX2-2                  26693994            43.70 ns/op 23429.97 MB/s
BenchmarkFixed/1024-SSE2-2                  15549420            77.23 ns/op 13258.25 MB/s
BenchmarkFixed/1024-2                        6886405           174.9 ns/op  5856.29 MB/s
BenchmarkFixed/8192-AVX512-2                 5863425           204.6 ns/op  40041.69 MB/s
BenchmarkFixed/8192-AVX2-2                   5157264           231.8 ns/op  35347.49 MB/s
BenchmarkFixed/8192-SSE2-2                   2507733           478.3 ns/op  17125.57 MB/s
BenchmarkFixed/8192-2                         854848          1365 ns/op    6002.96 MB/s
BenchmarkFixed/102400-AVX512-2                480313          2364 ns/op    43319.13 MB/s
BenchmarkFixed/102400-AVX2-2                  429700          2732 ns/op    37481.48 MB/s
BenchmarkFixed/102400-SSE2-2                  207128          5796 ns/op    17666.72 MB/s
BenchmarkFixed/102400-2                        71164         16805 ns/op    6093.45 MB/s
BenchmarkFixed/1024000-AVX512-2                48628         24529 ns/op    41747.27 MB/s
BenchmarkFixed/1024000-AVX2-2                  41851         28633 ns/op    35762.61 MB/s
BenchmarkFixed/1024000-SSE2-2                  18650         63908 ns/op    16022.94 MB/s
BenchmarkFixed/1024000-2                        6819        170248 ns/op    6014.75 MB/s
BenchmarkFixed/10240000-AVX512-2                2847        408223 ns/op    25084.30 MB/s
BenchmarkFixed/10240000-AVX2-2                  2857        407519 ns/op    25127.66 MB/s
BenchmarkFixed/10240000-SSE2-2                  1575        737495 ns/op    13884.85 MB/s
BenchmarkFixed/10240000-2                        700       1708315 ns/op    5994.21 MB/s
BenchmarkFixed/102400000-AVX512-2                139       8488345 ns/op    12063.60 MB/s
BenchmarkFixed/102400000-AVX2-2                  129       9197751 ns/op    11133.16 MB/s
BenchmarkFixed/102400000-SSE2-2                   69      16553970 ns/op    6185.83 MB/s
BenchmarkFixed/102400000-2                        54      22114815 ns/op    4630.38 MB/s
PASS
ok      github.com/zeebo/xxh3   137.840s
klauspost commented 3 years ago

@zeebo - ~It seems the avo PR is based on an earlier master that doesn't have the BP register fix, so you probably shouldn't merge.~

Reverted and kept the generated avx512

klauspost commented 3 years ago

Needs https://github.com/mmcloughlin/avo/pull/163 to generate.

klauspost commented 3 years ago

I cleaned it up and removed the replace. It will not be able to generate the avx512 code without replacing avo, but the file is added statically, so it should be file.

If there is anything else I should do, let me know.