10x Performance with SIMD Vectorized Code in C#/.NET

xoofx commented 1 year ago

Comments about 10x Performance with SIMD Vectorized Code in C#/.NET

manofstick commented 12 months ago

In Using Ref you comment that due to some patterns that bounds checking can't be removed, which is true. But here we should be doing something pretty simple, so I was thinking that this shouldn't be an issue...

So something like:

        var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);

        for (var vi = 0; vi < pv.Length - 4 + 1; vi += 4)
        {
            var r1 = Vector256.Equals(pv[vi + 0], compareValue);
            var r2 = Vector256.Equals(pv[vi + 1], compareValue);
            var r3 = Vector256.Equals(pv[vi + 2], compareValue);
            var r4 = Vector256.Equals(pv[vi + 3], compareValue);

shouldn't have an issue, yet it does run a significant amount slower. So I was intrigued. So had I a look with Disasmo that you had mentioned...

       458BD9               mov      r11d, r9d
       49C1E305             shl      r11, 5
       C4A17D760C18         vpcmpeqd ymm1, ymm0, ymmword ptr[rax+r11]
       458D5901             lea      r11d, [r9+01H]
       49C1E305             shl      r11, 5
       C4A17D761418         vpcmpeqd ymm2, ymm0, ymmword ptr[rax+r11]
       458D5902             lea      r11d, [r9+02H]
       49C1E305             shl      r11, 5
       C4A17D761C18         vpcmpeqd ymm3, ymm0, ymmword ptr[rax+r11]
       458D5903             lea      r11d, [r9+03H]
       49C1E305             shl      r11, 5
       C4A17D762418         vpcmpeqd ymm4, ymm0, ymmword ptr[rax+r11]

And noticed that it wasn't using a simple offset to calculate the offset of the element (which I think it should be able to do... maybe I'll give that a crack if I get excited...) so I thought I'd give it a helping hand...

    [StructLayout(LayoutKind.Sequential)]
    struct QuadVector256<T>
        where T : struct
    {
        public Vector256<T> _1;
        public Vector256<T> _2;
        public Vector256<T> _3;
        public Vector256<T> _4;
        ...
    }

...

        var qv = MemoryMarshal.Cast<int, QuadVector256<int>>(data);

        for (var qi = 0; qi < qv.Length; ++qi)
        {
            var r1 = Vector256.Equals(qv[qi]._1, compareValue);
            var r2 = Vector256.Equals(qv[qi]._2, compareValue);
            var r3 = Vector256.Equals(qv[qi]._3, compareValue);
            var r4 = Vector256.Equals(qv[qi]._4, compareValue);

And now the assembly looks much nicer

       418BD6               mov      edx, r14d
       48C1E207             shl      rdx, 7
       C5CD763C17           vpcmpeqd ymm7, ymm6, ymmword ptr[rdi+rdx]
       4803D7               add      rdx, rdi
       C54D764220           vpcmpeqd ymm8, ymm6, ymmword ptr[rdx+20H]
       C54D764A40           vpcmpeqd ymm9, ymm6, ymmword ptr[rdx+40H]
       C54D765260           vpcmpeqd ymm10, ymm6, ymmword ptr[rdx+60H]

And so now with this change we have the niceness of the "normal" ReadOnlySpan<> without extra nastiness, and we get basically the same performance of your version.

Hoorah!

Full code:


    [StructLayout(LayoutKind.Sequential)]
    struct QuadVector256<T>
        where T : struct
    {
        public Vector256<T> _1;
        public Vector256<T> _2;
        public Vector256<T> _3;
        public Vector256<T> _4;

        public static int VectorCount = 4;
        public static int Count = Vector<T>.Count * VectorCount;
    }

    /// <summary>
    /// Find a value from a span of integers.
    /// </summary>
    /// <returns>Index of the value, or -1 if not found.</returns>
    public static int QuadVector(ReadOnlySpan<int> data, int value)
    {
        var qv = MemoryMarshal.Cast<int, QuadVector256<int>>(data);

        var compareValue = Vector256.Create(value);

        for (var qi = 0; qi < qv.Length; ++qi)
        {
            var r1 = Vector256.Equals(qv[qi]._1, compareValue);
            var r2 = Vector256.Equals(qv[qi]._2, compareValue);
            var r3 = Vector256.Equals(qv[qi]._3, compareValue);
            var r4 = Vector256.Equals(qv[qi]._4, compareValue);

            var r5 = r1 | r2 | r3 | r4;
            if (r5 != Vector256<int>.Zero)
            {
                // r12 = pack 32 to 16 of r1/r2
                // r34 = pack 32 to 16 of r3/r4
                // but it's working on 128 bit lanes, so we need to reorder them
                var r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
                var r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();

                // Reorder r12 & r34 correctly
                r12 = Avx2.Permute4x64(r12.AsInt64(), 0b_11_01_10_00).AsInt16();
                r34 = Avx2.Permute4x64(r34.AsInt64(), 0b_11_01_10_00).AsInt16();

                // pack 16 to 8 of r12/r34
                Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);

                // Reorder r correctly
                r = Avx2.Permute4x64(r.AsInt64(), 0b_11_01_10_00).AsSByte();

                // Get the mask from <8 x byte>
                var mask = Avx2.MoveMask(r);
                var offset = BitOperations.TrailingZeroCount(mask);
                return qi * QuadVector256<int>.Count + offset;
            }
        }

        var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);
        // Process 8 int per loop (32 bytes)
        for (var vi = qv.Length * QuadVector256<int>.VectorCount; vi < pv.Length; ++vi)
        {
            var r1 = Vector256.Equals(pv[vi], compareValue);
            if (r1 != Vector256<int>.Zero)
            {
                // Get the mask from <8 x int> to byte
                var mask = Avx.MoveMask(r1.AsSingle());
                // And get the local index
                var offset = BitOperations.TrailingZeroCount((uint)mask);
                return vi * Vector256<int>.Count + offset;
            }
        }

        // Process remaining
        for (var i = pv.Length * Vector256<int>.Count; i < data.Length; i++)
        {
            if (data[i] == value)
                return (int)i;
        }

        return -1;
    }

    public static int Vector(ReadOnlySpan<int> data, int value)
    {
        var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);

        var compareValue = Vector256.Create(value);

        var vi = 0;
        for (; vi < pv.Length - 4 + 1; vi += 4)
        {
            var r1 = Vector256.Equals(pv[vi + 0], compareValue);
            var r2 = Vector256.Equals(pv[vi + 1], compareValue);
            var r3 = Vector256.Equals(pv[vi + 2], compareValue);
            var r4 = Vector256.Equals(pv[vi + 3], compareValue);

            var r5 = r1 | r2 | r3 | r4;
            if (r5 != Vector256<int>.Zero)
            {
                // r12 = pack 32 to 16 of r1/r2
                // r34 = pack 32 to 16 of r3/r4
                // but it's working on 128 bit lanes, so we need to reorder them
                var r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
                var r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();

                // Reorder r12 & r34 correctly
                r12 = Avx2.Permute4x64(r12.AsInt64(), 0b_11_01_10_00).AsInt16();
                r34 = Avx2.Permute4x64(r34.AsInt64(), 0b_11_01_10_00).AsInt16();

                // pack 16 to 8 of r12/r34
                Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);

                // Reorder r correctly
                r = Avx2.Permute4x64(r.AsInt64(), 0b_11_01_10_00).AsSByte();

                // Get the mask from <8 x byte>
                var mask = Avx2.MoveMask(r);
                var offset = BitOperations.TrailingZeroCount(mask);
                return vi * Vector256<int>.Count + offset;
            }
        }

        // Process 8 int per loop (32 bytes)
        for (; vi < pv.Length; ++vi)
        {
            var r1 = Vector256.Equals(pv[vi], compareValue);
            if (r1 != Vector256<int>.Zero)
            {
                // Get the mask from <8 x int> to byte
                var mask = Avx.MoveMask(r1.AsSingle());
                // And get the local index
                var offset = BitOperations.TrailingZeroCount((uint)mask);
                return vi * Vector256<int>.Count + offset;
            }
        }

        // Process remaining
        for (var i = pv.Length * Vector256<int>.Count; i < data.Length; i++)
        {
            if (data[i] == value)
                return i;
        }

        return -1;
    }

There is some extra overhead that I haven't investigated, but the loop appears good (i.e. for the large sizes we're on par with your version)

BenchmarkDotNet v0.13.6, Windows 10 (10.0.19045.3208/22H2/2022Update) Intel Core i7-6770HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores .NET SDK 7.0.304 [Host] : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2 .NET 7.0 : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2

Job=.NET 7.0 Runtime=.NET 7.0

Method	N	Mean	Error	StdDev	Ratio	RatioSD
Find_AVX_256_Optimized	32	4.119 ns	0.0323 ns	0.0302 ns	1.00	0.00
Vector	32	4.667 ns	0.0185 ns	0.0173 ns	1.13	0.01
QuadVector	32	4.867 ns	0.0150 ns	0.0140 ns	1.18	0.01

Find_AVX_256_Optimized	64	6.022 ns	0.0302 ns	0.0282 ns	1.00	0.00
Vector	64	6.760 ns	0.0875 ns	0.0818 ns	1.12	0.01
QuadVector	64	7.072 ns	0.0840 ns	0.0786 ns	1.17	0.02

Find_AVX_256_Optimized	128	10.391 ns	0.0876 ns	0.0820 ns	1.00	0.00
Vector	128	10.321 ns	0.0467 ns	0.0414 ns	0.99	0.01
QuadVector	128	10.630 ns	0.1063 ns	0.0994 ns	1.02	0.01

Find_AVX_256_Optimized	256	15.513 ns	0.1746 ns	0.1548 ns	1.00	0.00
Vector	256	17.958 ns	0.1897 ns	0.1775 ns	1.16	0.02
QuadVector	256	16.422 ns	0.1258 ns	0.1177 ns	1.06	0.01

Find_AVX_256_Optimized	512	27.678 ns	0.3519 ns	0.2938 ns	1.00	0.00
Vector	512	33.298 ns	0.2605 ns	0.2436 ns	1.20	0.01
QuadVector	512	28.848 ns	0.3439 ns	0.3217 ns	1.04	0.01

Find_AVX_256_Optimized	1024	51.817 ns	0.1934 ns	0.1510 ns	1.00	0.00
Vector	1024	64.295 ns	0.7848 ns	0.7341 ns	1.24	0.02
QuadVector	1024	53.982 ns	0.4364 ns	0.4082 ns	1.04	0.01

Find_AVX_256_Optimized	2048	103.174 ns	0.6088 ns	0.5397 ns	1.00	0.00
Vector	2048	127.910 ns	1.7113 ns	1.6008 ns	1.24	0.02
QuadVector	2048	104.534 ns	0.7179 ns	0.6364 ns	1.01	0.01

Find_AVX_256_Optimized	3072	164.511 ns	1.0379 ns	0.8667 ns	1.00	0.00
Vector	3072	198.704 ns	1.4620 ns	1.3675 ns	1.21	0.01
QuadVector	3072	163.195 ns	2.1183 ns	1.9814 ns	0.99	0.02

Find_AVX_256_Optimized	4096	213.900 ns	1.6582 ns	1.5511 ns	1.00	0.00
Vector	4096	262.153 ns	2.6469 ns	2.4759 ns	1.23	0.02
QuadVector	4096	212.232 ns	1.5244 ns	1.3513 ns	0.99	0.01

Find_AVX_256_Optimized	5120	262.888 ns	3.2031 ns	2.9962 ns	1.00	0.00
Vector	5120	318.298 ns	3.7469 ns	3.5048 ns	1.21	0.02
QuadVector	5120	264.532 ns	2.3371 ns	2.1861 ns	1.01	0.01

Find_AVX_256_Optimized	6144	312.455 ns	3.5792 ns	3.3480 ns	1.00	0.00
Vector	6144	379.527 ns	2.0646 ns	1.8302 ns	1.22	0.01
QuadVector	6144	312.909 ns	3.1891 ns	2.9831 ns	1.00	0.01

Find_AVX_256_Optimized	7168	359.087 ns	2.5173 ns	2.3547 ns	1.00	0.00
Vector	7168	440.072 ns	2.7371 ns	2.4263 ns	1.23	0.01
QuadVector	7168	363.538 ns	3.3341 ns	3.1188 ns	1.01	0.01

Find_AVX_256_Optimized	8192	429.126 ns	1.4646 ns	1.2230 ns	1.00	0.00
Vector	8192	535.986 ns	4.9168 ns	4.5992 ns	1.25	0.01
QuadVector	8192	446.624 ns	3.3010 ns	2.9263 ns	1.04	0.01

xoofx commented 11 months ago

And so now with this change we have the niceness of the "normal" ReadOnlySpan<> without extra nastiness, and we get basically the same performance of your version.

Indeed, it's a nice trick!

But also, as you noticed, this is introducing a slight regression for small workloads, I believe because the usage of the QuadVector struct is causing a 136 bytes stack utilization (while the AVX 256 does not use the stack at all, it doesn't overuse SIMD registers)

G_M000_IG01:                ;; offset=0000H
       4157                 push     r15
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4881EC88000000       sub      rsp, 136
       C5F877               vzeroupper 
       C5F829742470         vmovaps  qword ptr [rsp+70H], xmm6
       C5F8297C2460         vmovaps  qword ptr [rsp+60H], xmm7
       C57829442450         vmovaps  qword ptr [rsp+50H], xmm8
       C578294C2440         vmovaps  qword ptr [rsp+40H], xmm9
       C57829542430         vmovaps  qword ptr [rsp+30H], xmm10
       C578295C2420         vmovaps  qword ptr [rsp+20H], xmm11
       8BF2                 mov      esi, edx

For such methods that are safeguarded - as you control the code, there is still no way to do out of bounds for a user, so I'm ok to trade Span to Unsafe ref if it gives the extra perf bits. This is what is often done in the .NET BCL.

Also, the initial code could be slightly improved in terms of readability by replacing the ref int with a small wrapper ref struct and an indexer. The code would be then similar to your QuadVector version.

ref struct Vector256Ref
{
    private readonly ref int _pv;
    public Vector256Ref(ref int pv) => _pv = pv;

    public Vector<int> this[nint index]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref _pv, index));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int GetElement(nint index) => Unsafe.Add(ref _pv, index);
}

manofstick commented 11 months ago

Hmmm. It shouldn't even need a QuadVector to be on the stack, as it's just referenced directly from the Span pointer...

So I'm not certain that the following is the reason, but I think it's plausible, and follows empirical evidence...

Reading the CLR ABI guide it says that a stackframe is setup if various heuristics based on a call to Compiler::rpMustCreateEBPFrame(), so I pondered if, although the property for the QuadVector size was being inlined, maybe it wasn't known at that point.

So I manually inlined the QuadVector256 properties Count/VectorCount (i.e. replacing QuadVector256<int>.Count with Vector256<int>.Count * 4 and QuadVector256<int>.VectorCount with 4) and then checked, and it indeed removed the creation of the stackframe that you had noted...

Hooray!

...BUT... Didn't actually improve times - on my CPU anyway... Maybe still caused by the initial bounds check on the Span, which is probably still a good an desirable thing anyway...

Anyway, I think both of the things that I have found could just fixed with better JIT code... i.e. per first post it should be able to just offset directly from the span pointer without the use of the additional register, and hence no need for the QuadVector struct, and with smarter heuristics it should be able to determine it doesn't need to build a stack frame...

xoofx commented 11 months ago

Anyway, I think both of the things that I have found could just fixed with better JIT code... i

Right, but there is also a stark reality: these could probably never get fixed for the reasons that some patterns are too niche to pay the cost of a new pattern matching in the JIT optimizer. Even if tiered compilation is a thing, the JIT team still pays attention to compiler throughput, so you need a really compelling case for bringing a particular optimization. That's the difference between a JIT and a full C++ compiler, where the C++ compiler will accumulate, over time, a huge number of specialized optimizations, at the expense of compilation times.

If you look at simple things like string.IndexOf you will see that they use extensively ref pointers and Unsafe.Add, so it is an approach that is effectively used a lot in the BCL these days, and that's why I mentioned it in this blog post for the reason that it is quite a common optimization that many folks are not aware of...

My recommendation for optimizing: Starts with a Span implementation, benchmark, look at the generated code. If it doesn't look perfect or you are not satisfied by it, and you still want to hit the most optimized version, then go after using more low level stuffs like ref/Unsafe.Add. It's also fine to stay with a Span implementation. I just want to highlight that, no matter what, we will need sometimes to workaround some weaknesses of the .NET JIT compiler to bring such optimizations. 😉

TJHeuvel commented 6 months ago

Using a ref here allows the GC to update this pointer if necessary (if the pointed data is being moved by the GC) while a fixed/pinned statement would block the GC for moving this data around and might cause GC fragmentation.

Given the optimalisations we're doing here we can assume this is a hot path, perhaps the hottest in our application. Wouldnt it make sense from a caching perspective to not allow the GC to move this data around?

Presumably that would cause a cache miss, which we really want to avoid here. By pinning we can avoid that completely no?

xoofx / xoofx.github.io

10x Performance with SIMD Vectorized Code in C#/.NET #12