improve CRC32C_Update_SSE42 for x86-64.

pawelsopensource commented 1 year ago

hi, the x86-64 provides crc32q instruction (aka _mm_crc32_u64) which consumes 8 bytes at once. with this opcode CRC32C_Update_SSE42() works ~2x faster on my machine (amd ryzen7 1700 3GHz).

pawelsopensource commented 1 year ago

void CRC32C_Update_SSE42(const byte *s, size_t n, word32& c)
{
    word32 u = c; // no need to update cache memory in each loop cycle
    for(; !IsAligned<uint64_t>(s) && n > 0; s++, n--)
        u = _mm_crc32_u8(u, *s);
    uint64_t q = u; // crc32q result is stored on 64-bits (zero-extended)
#if defined(__GNUC__) || defined(__clang__)
    #pragma GCC unroll 8
#endif
    for(; n >= 8; s+=8, n-=8)
        q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));
    u = static_cast<word32>(q);
    for(; n > 0; s++, n--)
        u = _mm_crc32_u8(u, *s);
    c = u;
}

noloader commented 1 year ago

Thanks @pawelsopensource .

Looking at the source at https://github.com/weidai11/cryptopp/blob/master/crc_simd.cpp#L151, I'm wondering why we favor/prefer the 32-bit version.

It was probably related to porting the C++ version to intrinsics. But it may have been due to 32-bit Core2's. They had SSE4.2, but they were 32-bit.

I wrote the code, but I don't remember that far back. The curse of old age...

noloader commented 1 year ago

    for(; n >= 8; s+=8, n-=8)
        q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));

You are probably going to get better code generation by chaining the _mm_crc32_u64 together. Also see Commit c40b43346b66.

Prior to c40b43346b66, the code generated used 3 or 4 intermediate stores, if I recall correctly.

pawelsopensource commented 1 year ago

    for(; n >= 8; s+=8, n-=8)
        q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));
You are probably going to get better code generation by chaining the _mm_crc32_u64 together. Also see Commit c40b43346b66.

Prior to c40b433, the code generated used 3 or 4 intermediate stores, if I recall correctly.

chaining looks like a workaround for 'word32& c' updating (stores). my version uses local variable (register) for current crc state and updates the result at the end. clang-16/gcc-11 loops unrolling seem to produce quite optimal code.

noloader commented 1 year ago

Thanks @pawelsopensource.

clang-16/gcc-11 loops unrolling seem to produce quite optimal code.

Yeah, so we have other compilers to think about, too.

noloader commented 1 year ago

@pawelsopensource,

I'm getting good results with the following patch. It enjoys the 64-bit speedup, it is safe on 32-bit platforms, and it does not penalize some compilers like MSVC.

What do you think?

$ git diff
diff --git a/crc_simd.cpp b/crc_simd.cpp
index c1a0725f..0014a2d5 100644
--- a/crc_simd.cpp
+++ b/crc_simd.cpp
@@ -33,6 +33,7 @@
 #endif

 #define CONST_WORD32_CAST(x) ((const word32 *)(void*)(x))
+#define CONST_WORD64_CAST(x) ((const word64 *)(void*)(x))

 // Squash MS LNK4221 and libtool warnings
 extern const char CRC_SIMD_FNAME[] = __FILE__;
@@ -151,21 +152,41 @@ void CRC32C_Update_ARMV8(const byte *s, size_t n, word32& c)
 #if (CRYPTOPP_SSE42_AVAILABLE)
 void CRC32C_Update_SSE42(const byte *s, size_t n, word32& c)
 {
+    // Temporary due to https://github.com/weidai11/cryptopp/issues/1202
+    word32 v = c;
+
+    // 64-bit code path due to https://github.com/weidai11/cryptopp/issues/1202
+#if CRYPTOPP_BOOL_X64
+    for(; !IsAligned<word64>(s) && n > 0; s++, n--)
+        v = _mm_crc32_u8(v, *s);
+#else
     for(; !IsAligned<word32>(s) && n > 0; s++, n--)
-        c = _mm_crc32_u8(c, *s);
+        v = _mm_crc32_u8(v, *s);
+#endif
+
+#if CRYPTOPP_BOOL_X64
+    for(; n >= 32; s+=32, n-=32)
+       {
+        v = _mm_crc32_u64(_mm_crc32_u64(_mm_crc32_u64(_mm_crc32_u64(v,
+            *CONST_WORD64_CAST(s+ 0)), *CONST_WORD64_CAST(s+ 8)),
+            *CONST_WORD64_CAST(s+16)), *CONST_WORD64_CAST(s+24));
+       }
+#endif

     for(; n >= 16; s+=16, n-=16)
        {
-        c = _mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(c,
+        v = _mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(v,
             *CONST_WORD32_CAST(s+ 0)), *CONST_WORD32_CAST(s+ 4)),
             *CONST_WORD32_CAST(s+ 8)), *CONST_WORD32_CAST(s+12));
        }

     for(; n >= 4; s+=4, n-=4)
-        c = _mm_crc32_u32(c, *CONST_WORD32_CAST(s));
+        v = _mm_crc32_u32(v, *CONST_WORD32_CAST(s));

     for(; n > 0; s++, n--)
-        c = _mm_crc32_u8(c, *s);
+        v = _mm_crc32_u8(v, *s);
+
+    c = v;
 }
 #endif

pawelsopensource commented 1 year ago

patch looks fine. 64-bit path with manual 4x loop unroll improves CRC32C_Update_SSE42 from 3635MiB/s to 7072MiB/s on my machine (according to ./cryptest.exe b 5 3). manual 8x loop unroll gives 7663MiB/s, better 64B L1 data cache utilization and less loops?.

noloader commented 1 year ago

Thanks @pawelsopensource.

This should be cleared at Commit ee247f86a289.

I did not add the GCC extension for loop unrolling. Would you mind carrying that patch locally?

pawelsopensource commented 1 year ago

one more minor change, _mm_crc32_u64 result needs a cast to fix msvc warning: crc_simd.cpp(170): warning C4244: '=': conversion from 'unsigned __int64' to 'CryptoPP::word32', possible loss of data

noloader commented 1 year ago

Thanks again @pawelsopensource.

The Intel Intrinsic Guide says the value is supposed to be 64-bit, not 32-bit. I think we should probably change the declare, and make a cast.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_crc32_u64

pawelsopensource commented 1 year ago

crc32 opcode can store result in 64-bit (register) but only lower 32-bits are used because it's still a crc32 algorithm. cast is safe. https://www.felixcloutier.com/x86/crc32

noloader commented 1 year ago

Ok, lets see if we can get away with it.

Cleared at Commit e1c149f9a435.

pawelsopensource commented 1 year ago

problem was in line 170: v = _mm_crc32_u64(....

noloader commented 1 year ago

Thanks again @pawelsopensource.

This should be cleared under MSVC with Commit 358d0cfecd59.

pawelsopensource commented 1 year ago

someone could use https://clang.llvm.org/docs/DiagnosticsReference.html#wshorten-64-to-32 and -Werror, so please just do the correct cast: v = (word32)_mm_crc32_u64(...

weidai11 / cryptopp

improve CRC32C_Update_SSE42 for x86-64. #1202