Closed pawelsopensource closed 1 year ago
void CRC32C_Update_SSE42(const byte *s, size_t n, word32& c)
{
word32 u = c; // no need to update cache memory in each loop cycle
for(; !IsAligned<uint64_t>(s) && n > 0; s++, n--)
u = _mm_crc32_u8(u, *s);
uint64_t q = u; // crc32q result is stored on 64-bits (zero-extended)
#if defined(__GNUC__) || defined(__clang__)
#pragma GCC unroll 8
#endif
for(; n >= 8; s+=8, n-=8)
q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));
u = static_cast<word32>(q);
for(; n > 0; s++, n--)
u = _mm_crc32_u8(u, *s);
c = u;
}
Thanks @pawelsopensource .
Looking at the source at https://github.com/weidai11/cryptopp/blob/master/crc_simd.cpp#L151, I'm wondering why we favor/prefer the 32-bit version.
It was probably related to porting the C++ version to intrinsics. But it may have been due to 32-bit Core2's. They had SSE4.2, but they were 32-bit.
I wrote the code, but I don't remember that far back. The curse of old age...
for(; n >= 8; s+=8, n-=8)
q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));
You are probably going to get better code generation by chaining the _mm_crc32_u64
together. Also see Commit c40b43346b66.
Prior to c40b43346b66, the code generated used 3 or 4 intermediate stores, if I recall correctly.
for(; n >= 8; s+=8, n-=8) q = _mm_crc32_u64(q, *reinterpret_cast<const uint64_t*>(s));
You are probably going to get better code generation by chaining the
_mm_crc32_u64
together. Also see Commit c40b43346b66.Prior to c40b433, the code generated used 3 or 4 intermediate stores, if I recall correctly.
chaining looks like a workaround for 'word32& c' updating (stores). my version uses local variable (register) for current crc state and updates the result at the end. clang-16/gcc-11 loops unrolling seem to produce quite optimal code.
Thanks @pawelsopensource.
clang-16/gcc-11 loops unrolling seem to produce quite optimal code.
Yeah, so we have other compilers to think about, too.
@pawelsopensource,
I'm getting good results with the following patch. It enjoys the 64-bit speedup, it is safe on 32-bit platforms, and it does not penalize some compilers like MSVC.
What do you think?
$ git diff
diff --git a/crc_simd.cpp b/crc_simd.cpp
index c1a0725f..0014a2d5 100644
--- a/crc_simd.cpp
+++ b/crc_simd.cpp
@@ -33,6 +33,7 @@
#endif
#define CONST_WORD32_CAST(x) ((const word32 *)(void*)(x))
+#define CONST_WORD64_CAST(x) ((const word64 *)(void*)(x))
// Squash MS LNK4221 and libtool warnings
extern const char CRC_SIMD_FNAME[] = __FILE__;
@@ -151,21 +152,41 @@ void CRC32C_Update_ARMV8(const byte *s, size_t n, word32& c)
#if (CRYPTOPP_SSE42_AVAILABLE)
void CRC32C_Update_SSE42(const byte *s, size_t n, word32& c)
{
+ // Temporary due to https://github.com/weidai11/cryptopp/issues/1202
+ word32 v = c;
+
+ // 64-bit code path due to https://github.com/weidai11/cryptopp/issues/1202
+#if CRYPTOPP_BOOL_X64
+ for(; !IsAligned<word64>(s) && n > 0; s++, n--)
+ v = _mm_crc32_u8(v, *s);
+#else
for(; !IsAligned<word32>(s) && n > 0; s++, n--)
- c = _mm_crc32_u8(c, *s);
+ v = _mm_crc32_u8(v, *s);
+#endif
+
+#if CRYPTOPP_BOOL_X64
+ for(; n >= 32; s+=32, n-=32)
+ {
+ v = _mm_crc32_u64(_mm_crc32_u64(_mm_crc32_u64(_mm_crc32_u64(v,
+ *CONST_WORD64_CAST(s+ 0)), *CONST_WORD64_CAST(s+ 8)),
+ *CONST_WORD64_CAST(s+16)), *CONST_WORD64_CAST(s+24));
+ }
+#endif
for(; n >= 16; s+=16, n-=16)
{
- c = _mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(c,
+ v = _mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(_mm_crc32_u32(v,
*CONST_WORD32_CAST(s+ 0)), *CONST_WORD32_CAST(s+ 4)),
*CONST_WORD32_CAST(s+ 8)), *CONST_WORD32_CAST(s+12));
}
for(; n >= 4; s+=4, n-=4)
- c = _mm_crc32_u32(c, *CONST_WORD32_CAST(s));
+ v = _mm_crc32_u32(v, *CONST_WORD32_CAST(s));
for(; n > 0; s++, n--)
- c = _mm_crc32_u8(c, *s);
+ v = _mm_crc32_u8(v, *s);
+
+ c = v;
}
#endif
patch looks fine. 64-bit path with manual 4x loop unroll improves CRC32C_Update_SSE42 from 3635MiB/s to 7072MiB/s on my machine (according to ./cryptest.exe b 5 3). manual 8x loop unroll gives 7663MiB/s, better 64B L1 data cache utilization and less loops?.
Thanks @pawelsopensource.
This should be cleared at Commit ee247f86a289.
I did not add the GCC extension for loop unrolling. Would you mind carrying that patch locally?
one more minor change, _mm_crc32_u64 result needs a cast to fix msvc warning: crc_simd.cpp(170): warning C4244: '=': conversion from 'unsigned __int64' to 'CryptoPP::word32', possible loss of data
Thanks again @pawelsopensource.
The Intel Intrinsic Guide says the value is supposed to be 64-bit, not 32-bit. I think we should probably change the declare, and make a cast.
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_crc32_u64
crc32 opcode can store result in 64-bit (register) but only lower 32-bits are used because it's still a crc32 algorithm. cast is safe. https://www.felixcloutier.com/x86/crc32
Ok, lets see if we can get away with it.
Cleared at Commit e1c149f9a435.
problem was in line 170: v = _mm_crc32_u64(....
Thanks again @pawelsopensource.
This should be cleared under MSVC with Commit 358d0cfecd59.
someone could use https://clang.llvm.org/docs/DiagnosticsReference.html#wshorten-64-to-32 and -Werror, so please just do the correct cast: v = (word32)_mm_crc32_u64(...
hi, the x86-64 provides crc32q instruction (aka _mm_crc32_u64) which consumes 8 bytes at once. with this opcode CRC32C_Update_SSE42() works ~2x faster on my machine (amd ryzen7 1700 3GHz).