Open xoofx opened 1 year ago
In Using Ref you comment that due to some patterns that bounds checking can't be removed, which is true. But here we should be doing something pretty simple, so I was thinking that this shouldn't be an issue...
So something like:
var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);
for (var vi = 0; vi < pv.Length - 4 + 1; vi += 4)
{
var r1 = Vector256.Equals(pv[vi + 0], compareValue);
var r2 = Vector256.Equals(pv[vi + 1], compareValue);
var r3 = Vector256.Equals(pv[vi + 2], compareValue);
var r4 = Vector256.Equals(pv[vi + 3], compareValue);
shouldn't have an issue, yet it does run a significant amount slower. So I was intrigued. So had I a look with Disasmo that you had mentioned...
458BD9 mov r11d, r9d
49C1E305 shl r11, 5
C4A17D760C18 vpcmpeqd ymm1, ymm0, ymmword ptr[rax+r11]
458D5901 lea r11d, [r9+01H]
49C1E305 shl r11, 5
C4A17D761418 vpcmpeqd ymm2, ymm0, ymmword ptr[rax+r11]
458D5902 lea r11d, [r9+02H]
49C1E305 shl r11, 5
C4A17D761C18 vpcmpeqd ymm3, ymm0, ymmword ptr[rax+r11]
458D5903 lea r11d, [r9+03H]
49C1E305 shl r11, 5
C4A17D762418 vpcmpeqd ymm4, ymm0, ymmword ptr[rax+r11]
And noticed that it wasn't using a simple offset to calculate the offset of the element (which I think it should be able to do... maybe I'll give that a crack if I get excited...) so I thought I'd give it a helping hand...
[StructLayout(LayoutKind.Sequential)]
struct QuadVector256<T>
where T : struct
{
public Vector256<T> _1;
public Vector256<T> _2;
public Vector256<T> _3;
public Vector256<T> _4;
...
}
...
var qv = MemoryMarshal.Cast<int, QuadVector256<int>>(data);
for (var qi = 0; qi < qv.Length; ++qi)
{
var r1 = Vector256.Equals(qv[qi]._1, compareValue);
var r2 = Vector256.Equals(qv[qi]._2, compareValue);
var r3 = Vector256.Equals(qv[qi]._3, compareValue);
var r4 = Vector256.Equals(qv[qi]._4, compareValue);
And now the assembly looks much nicer
418BD6 mov edx, r14d
48C1E207 shl rdx, 7
C5CD763C17 vpcmpeqd ymm7, ymm6, ymmword ptr[rdi+rdx]
4803D7 add rdx, rdi
C54D764220 vpcmpeqd ymm8, ymm6, ymmword ptr[rdx+20H]
C54D764A40 vpcmpeqd ymm9, ymm6, ymmword ptr[rdx+40H]
C54D765260 vpcmpeqd ymm10, ymm6, ymmword ptr[rdx+60H]
And so now with this change we have the niceness of the "normal" ReadOnlySpan<>
without extra nastiness, and we get basically the same performance of your version.
Hoorah!
Full code:
[StructLayout(LayoutKind.Sequential)]
struct QuadVector256<T>
where T : struct
{
public Vector256<T> _1;
public Vector256<T> _2;
public Vector256<T> _3;
public Vector256<T> _4;
public static int VectorCount = 4;
public static int Count = Vector<T>.Count * VectorCount;
}
/// <summary>
/// Find a value from a span of integers.
/// </summary>
/// <returns>Index of the value, or -1 if not found.</returns>
public static int QuadVector(ReadOnlySpan<int> data, int value)
{
var qv = MemoryMarshal.Cast<int, QuadVector256<int>>(data);
var compareValue = Vector256.Create(value);
for (var qi = 0; qi < qv.Length; ++qi)
{
var r1 = Vector256.Equals(qv[qi]._1, compareValue);
var r2 = Vector256.Equals(qv[qi]._2, compareValue);
var r3 = Vector256.Equals(qv[qi]._3, compareValue);
var r4 = Vector256.Equals(qv[qi]._4, compareValue);
var r5 = r1 | r2 | r3 | r4;
if (r5 != Vector256<int>.Zero)
{
// r12 = pack 32 to 16 of r1/r2
// r34 = pack 32 to 16 of r3/r4
// but it's working on 128 bit lanes, so we need to reorder them
var r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
var r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();
// Reorder r12 & r34 correctly
r12 = Avx2.Permute4x64(r12.AsInt64(), 0b_11_01_10_00).AsInt16();
r34 = Avx2.Permute4x64(r34.AsInt64(), 0b_11_01_10_00).AsInt16();
// pack 16 to 8 of r12/r34
Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);
// Reorder r correctly
r = Avx2.Permute4x64(r.AsInt64(), 0b_11_01_10_00).AsSByte();
// Get the mask from <8 x byte>
var mask = Avx2.MoveMask(r);
var offset = BitOperations.TrailingZeroCount(mask);
return qi * QuadVector256<int>.Count + offset;
}
}
var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);
// Process 8 int per loop (32 bytes)
for (var vi = qv.Length * QuadVector256<int>.VectorCount; vi < pv.Length; ++vi)
{
var r1 = Vector256.Equals(pv[vi], compareValue);
if (r1 != Vector256<int>.Zero)
{
// Get the mask from <8 x int> to byte
var mask = Avx.MoveMask(r1.AsSingle());
// And get the local index
var offset = BitOperations.TrailingZeroCount((uint)mask);
return vi * Vector256<int>.Count + offset;
}
}
// Process remaining
for (var i = pv.Length * Vector256<int>.Count; i < data.Length; i++)
{
if (data[i] == value)
return (int)i;
}
return -1;
}
public static int Vector(ReadOnlySpan<int> data, int value)
{
var pv = MemoryMarshal.Cast<int, Vector256<int>>(data);
var compareValue = Vector256.Create(value);
var vi = 0;
for (; vi < pv.Length - 4 + 1; vi += 4)
{
var r1 = Vector256.Equals(pv[vi + 0], compareValue);
var r2 = Vector256.Equals(pv[vi + 1], compareValue);
var r3 = Vector256.Equals(pv[vi + 2], compareValue);
var r4 = Vector256.Equals(pv[vi + 3], compareValue);
var r5 = r1 | r2 | r3 | r4;
if (r5 != Vector256<int>.Zero)
{
// r12 = pack 32 to 16 of r1/r2
// r34 = pack 32 to 16 of r3/r4
// but it's working on 128 bit lanes, so we need to reorder them
var r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
var r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();
// Reorder r12 & r34 correctly
r12 = Avx2.Permute4x64(r12.AsInt64(), 0b_11_01_10_00).AsInt16();
r34 = Avx2.Permute4x64(r34.AsInt64(), 0b_11_01_10_00).AsInt16();
// pack 16 to 8 of r12/r34
Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);
// Reorder r correctly
r = Avx2.Permute4x64(r.AsInt64(), 0b_11_01_10_00).AsSByte();
// Get the mask from <8 x byte>
var mask = Avx2.MoveMask(r);
var offset = BitOperations.TrailingZeroCount(mask);
return vi * Vector256<int>.Count + offset;
}
}
// Process 8 int per loop (32 bytes)
for (; vi < pv.Length; ++vi)
{
var r1 = Vector256.Equals(pv[vi], compareValue);
if (r1 != Vector256<int>.Zero)
{
// Get the mask from <8 x int> to byte
var mask = Avx.MoveMask(r1.AsSingle());
// And get the local index
var offset = BitOperations.TrailingZeroCount((uint)mask);
return vi * Vector256<int>.Count + offset;
}
}
// Process remaining
for (var i = pv.Length * Vector256<int>.Count; i < data.Length; i++)
{
if (data[i] == value)
return i;
}
return -1;
}
There is some extra overhead that I haven't investigated, but the loop appears good (i.e. for the large sizes we're on par with your version)
BenchmarkDotNet v0.13.6, Windows 10 (10.0.19045.3208/22H2/2022Update) Intel Core i7-6770HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores .NET SDK 7.0.304 [Host] : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2 .NET 7.0 : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
Job=.NET 7.0 Runtime=.NET 7.0
Method | N | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
Find_AVX_256_Optimized | 32 | 4.119 ns | 0.0323 ns | 0.0302 ns | 1.00 | 0.00 |
Vector | 32 | 4.667 ns | 0.0185 ns | 0.0173 ns | 1.13 | 0.01 |
QuadVector | 32 | 4.867 ns | 0.0150 ns | 0.0140 ns | 1.18 | 0.01 |
Find_AVX_256_Optimized | 64 | 6.022 ns | 0.0302 ns | 0.0282 ns | 1.00 | 0.00 |
Vector | 64 | 6.760 ns | 0.0875 ns | 0.0818 ns | 1.12 | 0.01 |
QuadVector | 64 | 7.072 ns | 0.0840 ns | 0.0786 ns | 1.17 | 0.02 |
Find_AVX_256_Optimized | 128 | 10.391 ns | 0.0876 ns | 0.0820 ns | 1.00 | 0.00 |
Vector | 128 | 10.321 ns | 0.0467 ns | 0.0414 ns | 0.99 | 0.01 |
QuadVector | 128 | 10.630 ns | 0.1063 ns | 0.0994 ns | 1.02 | 0.01 |
Find_AVX_256_Optimized | 256 | 15.513 ns | 0.1746 ns | 0.1548 ns | 1.00 | 0.00 |
Vector | 256 | 17.958 ns | 0.1897 ns | 0.1775 ns | 1.16 | 0.02 |
QuadVector | 256 | 16.422 ns | 0.1258 ns | 0.1177 ns | 1.06 | 0.01 |
Find_AVX_256_Optimized | 512 | 27.678 ns | 0.3519 ns | 0.2938 ns | 1.00 | 0.00 |
Vector | 512 | 33.298 ns | 0.2605 ns | 0.2436 ns | 1.20 | 0.01 |
QuadVector | 512 | 28.848 ns | 0.3439 ns | 0.3217 ns | 1.04 | 0.01 |
Find_AVX_256_Optimized | 1024 | 51.817 ns | 0.1934 ns | 0.1510 ns | 1.00 | 0.00 |
Vector | 1024 | 64.295 ns | 0.7848 ns | 0.7341 ns | 1.24 | 0.02 |
QuadVector | 1024 | 53.982 ns | 0.4364 ns | 0.4082 ns | 1.04 | 0.01 |
Find_AVX_256_Optimized | 2048 | 103.174 ns | 0.6088 ns | 0.5397 ns | 1.00 | 0.00 |
Vector | 2048 | 127.910 ns | 1.7113 ns | 1.6008 ns | 1.24 | 0.02 |
QuadVector | 2048 | 104.534 ns | 0.7179 ns | 0.6364 ns | 1.01 | 0.01 |
Find_AVX_256_Optimized | 3072 | 164.511 ns | 1.0379 ns | 0.8667 ns | 1.00 | 0.00 |
Vector | 3072 | 198.704 ns | 1.4620 ns | 1.3675 ns | 1.21 | 0.01 |
QuadVector | 3072 | 163.195 ns | 2.1183 ns | 1.9814 ns | 0.99 | 0.02 |
Find_AVX_256_Optimized | 4096 | 213.900 ns | 1.6582 ns | 1.5511 ns | 1.00 | 0.00 |
Vector | 4096 | 262.153 ns | 2.6469 ns | 2.4759 ns | 1.23 | 0.02 |
QuadVector | 4096 | 212.232 ns | 1.5244 ns | 1.3513 ns | 0.99 | 0.01 |
Find_AVX_256_Optimized | 5120 | 262.888 ns | 3.2031 ns | 2.9962 ns | 1.00 | 0.00 |
Vector | 5120 | 318.298 ns | 3.7469 ns | 3.5048 ns | 1.21 | 0.02 |
QuadVector | 5120 | 264.532 ns | 2.3371 ns | 2.1861 ns | 1.01 | 0.01 |
Find_AVX_256_Optimized | 6144 | 312.455 ns | 3.5792 ns | 3.3480 ns | 1.00 | 0.00 |
Vector | 6144 | 379.527 ns | 2.0646 ns | 1.8302 ns | 1.22 | 0.01 |
QuadVector | 6144 | 312.909 ns | 3.1891 ns | 2.9831 ns | 1.00 | 0.01 |
Find_AVX_256_Optimized | 7168 | 359.087 ns | 2.5173 ns | 2.3547 ns | 1.00 | 0.00 |
Vector | 7168 | 440.072 ns | 2.7371 ns | 2.4263 ns | 1.23 | 0.01 |
QuadVector | 7168 | 363.538 ns | 3.3341 ns | 3.1188 ns | 1.01 | 0.01 |
Find_AVX_256_Optimized | 8192 | 429.126 ns | 1.4646 ns | 1.2230 ns | 1.00 | 0.00 |
Vector | 8192 | 535.986 ns | 4.9168 ns | 4.5992 ns | 1.25 | 0.01 |
QuadVector | 8192 | 446.624 ns | 3.3010 ns | 2.9263 ns | 1.04 | 0.01 |
And so now with this change we have the niceness of the "normal"
ReadOnlySpan<>
without extra nastiness, and we get basically the same performance of your version.
Indeed, it's a nice trick!
But also, as you noticed, this is introducing a slight regression for small workloads, I believe because the usage of the QuadVector struct is causing a 136 bytes stack utilization (while the AVX 256 does not use the stack at all, it doesn't overuse SIMD registers)
G_M000_IG01: ;; offset=0000H
4157 push r15
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4881EC88000000 sub rsp, 136
C5F877 vzeroupper
C5F829742470 vmovaps qword ptr [rsp+70H], xmm6
C5F8297C2460 vmovaps qword ptr [rsp+60H], xmm7
C57829442450 vmovaps qword ptr [rsp+50H], xmm8
C578294C2440 vmovaps qword ptr [rsp+40H], xmm9
C57829542430 vmovaps qword ptr [rsp+30H], xmm10
C578295C2420 vmovaps qword ptr [rsp+20H], xmm11
8BF2 mov esi, edx
For such methods that are safeguarded - as you control the code, there is still no way to do out of bounds for a user, so I'm ok to trade Span to Unsafe ref if it gives the extra perf bits. This is what is often done in the .NET BCL.
Also, the initial code could be slightly improved in terms of readability by replacing the ref int
with a small wrapper ref struct and an indexer. The code would be then similar to your QuadVector version.
ref struct Vector256Ref
{
private readonly ref int _pv;
public Vector256Ref(ref int pv) => _pv = pv;
public Vector<int> this[nint index]
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref _pv, index));
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int GetElement(nint index) => Unsafe.Add(ref _pv, index);
}
Hmmm. It shouldn't even need a QuadVector to be on the stack, as it's just referenced directly from the Span pointer...
So I'm not certain that the following is the reason, but I think it's plausible, and follows empirical evidence...
Reading the CLR ABI guide it says that a stackframe is setup if various heuristics based on a call to Compiler::rpMustCreateEBPFrame()
, so I pondered if, although the property for the QuadVector size was being inlined, maybe it wasn't known at that point.
So I manually inlined the QuadVector256
properties Count
/VectorCount
(i.e. replacing QuadVector256<int>.Count
with Vector256<int>.Count * 4
and QuadVector256<int>.VectorCount
with 4
) and then checked, and it indeed removed the creation of the stackframe that you had noted...
Hooray!
...BUT... Didn't actually improve times - on my CPU anyway... Maybe still caused by the initial bounds check on the Span, which is probably still a good an desirable thing anyway...
Anyway, I think both of the things that I have found could just fixed with better JIT code... i.e. per first post it should be able to just offset directly from the span pointer without the use of the additional register, and hence no need for the QuadVector struct, and with smarter heuristics it should be able to determine it doesn't need to build a stack frame...
Anyway, I think both of the things that I have found could just fixed with better JIT code... i
Right, but there is also a stark reality: these could probably never get fixed for the reasons that some patterns are too niche to pay the cost of a new pattern matching in the JIT optimizer. Even if tiered compilation is a thing, the JIT team still pays attention to compiler throughput, so you need a really compelling case for bringing a particular optimization. That's the difference between a JIT and a full C++ compiler, where the C++ compiler will accumulate, over time, a huge number of specialized optimizations, at the expense of compilation times.
If you look at simple things like string.IndexOf
you will see that they use extensively ref
pointers and Unsafe.Add
, so it is an approach that is effectively used a lot in the BCL these days, and that's why I mentioned it in this blog post for the reason that it is quite a common optimization that many folks are not aware of...
My recommendation for optimizing: Starts with a Span implementation, benchmark, look at the generated code. If it doesn't look perfect or you are not satisfied by it, and you still want to hit the most optimized version, then go after using more low level stuffs like ref
/Unsafe.Add
. It's also fine to stay with a Span implementation. I just want to highlight that, no matter what, we will need sometimes to workaround some weaknesses of the .NET JIT compiler to bring such optimizations. 😉
Using a ref here allows the GC to update this pointer if necessary (if the pointed data is being moved by the GC) while a fixed/pinned statement would block the GC for moving this data around and might cause GC fragmentation.
Given the optimalisations we're doing here we can assume this is a hot path, perhaps the hottest in our application. Wouldnt it make sense from a caching perspective to not allow the GC to move this data around?
Presumably that would cause a cache miss, which we really want to avoid here. By pinning we can avoid that completely no?
Comments about 10x Performance with SIMD Vectorized Code in C#/.NET