SHA512_HashBlock_SSE2 inefficient (and potentially buggy) asm for x32. MMX is unnecessary.

https://github.com/weidai11/cryptopp/blob/28e20d6e5f920e48688c71456f9392a4a5f78135/sha.cpp#L963

#if CRYPTOPP_SSE2_ASM_AVAILABLE && (CRYPTOPP_BOOL_X86 || CRYPTOPP_BOOL_X32)

CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const word64 *data)

looks like it's mostly for 32-bit mode, but it's also used for x86-64 x32.

push rbx is not safe in 64-bit mode: you clobber the red-zone without telling the compiler, potentially breaking things if this inlines into a function with any locals, or even without optimization if this functions spills its args. (https://stackoverflow.com/questions/34520013/using-base-pointer-register-in-c-inline-asm)

Better to simply declare a clobber on ebx and ask for the input you want in "b" instead of "a", and avoid a useless mov. (gcc4.9 and earlier don't support EBX as an operand with -m32 -fPIC, but gcc5 has no problem: https://godbolt.org/g/Ddzp1o).

AS2( lea ebx, SHA512_K) This line only runs for MSVC 32-bit mode, so it can't use RIP-relative addressing. LEA is pointless, use mov ebx, OFFSET SHA512_K to save a byte.

Inside #if CRYPTOPP_BOOL_X32, you should use RSP instead of ESP. RSP is always correctly zero-extended, so you can save the address-size prefix. You could probably hack around that with a #define for ESP, and maybe another like EDI_PTR or something that expands to EDI for 32-bit mode, or RDI for 64-bit mode. x32 is unpopular enough that you can kind of justify emitting worse machine code for it just to gain some source convenience, though, when it's just 1 extra byte in a few places with no performance penalty other than code-size.

More importantly, MMX looks pointless here. x32 has 64-bit integer registers, and plenty of them. It also has 16 XMM registers.

MMX paddq will hurt a lot on Bulldozer-family vs. integer add (because each pair of cores shares one SIMD unit), and maybe hurt slightly on most Intel where add runs on more ports. The extra movq2dq instructions don't help either.

    AS2(    movdq2q  mm0, xmm0)
    AS2(    movhlps  xmm1, xmm0)
    AS2(    paddq    mm0, [ebx+eax*8])
    AS2(    movlps   [esi], xmm0)
    AS2(    movlps   [esi+8], xmm1)
    AS2(    movlps   [esi+8*16], xmm0)
    AS2(    movlps   [esi+8*17], xmm1)

This is weird; is it optimized to avoid 16-byte unaligned stores or something? On modern HW, movups will be better. On Nehalem and newer, movhps [esi+8], xmm0 is a single uop (but a single movups is probably still better even on a cache-line split, on CPUs where its single-uop).

movhps-stores are 2 uops (including a shuffle) on Core2, Bulldozer-family, and Ryzen, though. So probably just movups unless you really care about Core2 / K8 where that's slower even if the address is aligned.

As I commented on the recent commit, don't define a function inside another function. Define it separately, e.g. with an asm() statement at global scope.

@pcordes,

Thanks for the analysis. If you want to workup a patch to test your proposed changes then I am happy to test it.

I had some time to look at SHA512_Round and untangle what is going on (I think). Here is what is happening at a high level in pseudo code. Recall SHA-512 uses 80 rounds. Rounds 0-15 mix the message and state. Rounds 16-79 mix state. The end state is the hash (or additional messages can be processed).

void SHA512_HashBlock_SSE2(state, message)
{
    DO_PROLOGUE

    round=0
    jmp PREPARE_RNDS_00_15

    SHA152_Round
        ...  // process state
        return

    PREPARE_RNDS_00_15
        ... // arrange state+msg
        if (++round != 16)
            call SHA152_Round

    PREPARE_RNDS_16_79
        ...  // arrange state alone
        if (++round != 80)
            call SHA152_Round

    DO_EPILOGUE
}

So it looks like Wei is using the call to perform the branch. The push of eip and return ensures the branch back to the body (either PREPARE_RNDS_00_15 or PREPARE_RNDS_16_79) has the right address.

Without the call and return the library would need to manage the return address and branch to the return address from SHA152_Round. The return address would land in either PREPARE_RNDS_00_15 or PREPARE_RNDS_16_79.

I'm guessing Wei chose this pattern for two or three reasons. First is simplicity. The call and return is pretty easy. Second is register pressures. Especially on i386 where registers are scarce. Third is performance on a critical path. SHA512_Round is the heart of the algorithm.

Looking at the routine I can't tell if register pressure applies in this case. I think it is suffering register pressure for i386. It looks like EAX-EDX, ESI and EDI are used in the routine.

If a freestanding SHA512_Round was added then it would be on the critical path. The function would be called 80 times, the epilogue takes 7 instructions, and the prologue takes 3 instructions. If my calculations are correct then the freestanding version of the function would execute an additional 800 instructions on a critical path.

Now, onto remediations. We can't add a freestanding SHA512_Round due to the performance hit. That means we need to manage the return branch from SHA512_Round. My question is, what do you recommend to manage the return address? That is, how do we manage GO_BACK efficiently?

void SHA512_HashBlock_SSE2(state, message)
{
    DO_PROLOGUE

    round=0
    jmp PREPARE_RNDS_00_15

    SHA152_Round
        ...  // process state
        branch GO_BACK

    PREPARE_RNDS_00_15
        ... // arrange state+msg
        if (++round != 16)
            branch SHA152_Round

    PREPARE_RNDS_16_79
        ...  // arrange state alone
        if (++round != 80)
            branch SHA152_Round

    DO_EPILOGUE
}

One of the things that concerns me is the code uses EBX so we may not be able to count on the PIC register being valid until the asm is complete. I think that means a temporary could encounter troubles:

void SHA512_HashBlock_SSE2(state, message)
{
    void* addr;
    __asm__
   (
    SHA152_Round
        ...  // process state
        jmp addr

    PREPARE_RNDS_00_15
        ... // arrange state+msg
        if (++round != 16)
            addr = PREPARE_RNDS_16_79
            jmp SHA152_Round

    PREPARE_RNDS_16_79
        ...  // arrange state alone
        if (++round != 80)
            addr = PREPARE_RNDS_16_79
            jmp SHA152_Round
   );
}

@pcordes,

I took a shot at converting the call to jmp. The diff is below. The change is producing errors under GCC on my native i686 hardware (x86_64 does not use it). I think it can be reproduced with $ CXXFLAGS="-DNDEBUG -g2 -O3 -m32" make -j 5.

g++ -o cryptest.exe -DNDEBUG -g2 -O3 -pthread -pipe adhoc.o test.o bench1.o bench2.o bench3.o datatest.o dlltest.o fipsalgt.o validat0.o validat1.o validat2.o validat3.o validat4.o validat5.o validat6.o validat7.o validat8.o validat9.o validat10.o regtest1.o regtest2.o regtest3.o regtest4.o ./libcryptopp.a
./libcryptopp.a(sha.o): In function `CryptoPP::SHA512_HashBlock_SSE2(unsigned long long*, unsigned long long const*)':
/home/jwalton/cryptopp/sha.cpp:1168: undefined reference to `rnds_00_15'
/home/jwalton/cryptopp/sha.cpp:1168: undefined reference to `rnds_00_15'
/home/jwalton/cryptopp/sha.cpp:1168: undefined reference to `rnds_00_15'
/home/jwalton/cryptopp/sha.cpp:1168: undefined reference to `SHA512_Round'
collect2: error: ld returned 1 exit status
GNUmakefile:1019: recipe for target 'cryptest.exe' failed
make: *** [cryptest.exe] Error 1

Any ideas why GCC is failing to compile and link things?

$ git diff
diff --git a/sha.cpp b/sha.cpp
index 846afd91..cb77cef2 100644
--- a/sha.cpp
+++ b/sha.cpp
@@ -968,6 +968,7 @@ const word64 SHA512_K[80] = {

 CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const word64 *data)
 {
+    word32 rnds_00_15;^M
 #ifdef __GNUC__
     __asm__ __volatile__
     (
@@ -1042,6 +1043,12 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const
     AS2(    pxor     xmm7, xmm6)\
     AS2(    psrlq    r, c-b)\
     AS2(    pxor     r, xmm7)
+^M
+// Make these local labels^M
+#define SHA512_Round   3^M
+#define SHA512_00_15   4^M
+#define SHA512_16_80   5^M
+^M
     ASL(SHA512_Round)

     // k + w is in mm0, a is in mm4, e is in mm5
@@ -1068,7 +1075,11 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const
     AS2(    paddq    mm4, mm1)            // a = temp + S0(a)
     AS2(    movq     [edi-8], mm4)
     AS2(    movq     [edi+7*8], mm4)
-    AS1(    ret)
+^M
+    //AS1(    ret)^M
+    AS2(    cmp      DWORD PTR [rnds_00_15], 1)^M
+    ASJ(    je,      SHA512_00_15, f)^M
+    ASJ(    jmp,     SHA512_16_80, f)^M

     // first 16 rounds
     ASL(0)
@@ -1076,7 +1087,11 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const
     AS2(    movq     [esi+eax*8], mm0)
     AS2(    movq     [esi+eax*8+16*8], mm0)
     AS2(    paddq    mm0, [ebx+eax*8])
-    ASC(    call,    SHA512_Round)
+^M
+    // ASC(    call,    SHA512_Round)^M
+    AS2(    mov      DWORD PTR [rnds_00_15], 1)^M
+    ASJ(    jmp,     SHA512_Round, b)^M
+    ASL(SHA512_00_15)^M

     AS1(    inc      eax)
     AS2(    sub      edi, 8)
@@ -1088,6 +1103,7 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const

     // rest of the rounds
     AS2(    movdqu   xmm0, [esi+(16-2)*8])
+^M
     ASL(1)
     // data expansion, W[i-2] already in xmm0
     AS2(    movdqu   xmm3, [esi])
@@ -1104,8 +1120,13 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const
     AS2(    movlps   [esi+8], xmm1)
     AS2(    movlps   [esi+8*16], xmm0)
     AS2(    movlps   [esi+8*17], xmm1)
+^M
     // 2 rounds
-    ASC(    call,    SHA512_Round)
+    // ASC(    call,    SHA512_Round)^M
+    AS2(    mov      DWORD PTR [rnds_00_15], 0)^M
+    ASJ(    jmp,     SHA512_Round, b)^M
+    ASL(SHA512_16_80)^M
+^M
     AS2(    sub      edi, 8)
     AS2(    movdq2q  mm0, xmm1)
     AS2(    paddq    mm0, [ebx+eax*8+8])
@@ -1141,7 +1162,7 @@ CRYPTOPP_NAKED void CRYPTOPP_FASTCALL SHA512_HashBlock_SSE2(word64 *state, const
     AS_POP_IF86(    bx)
     ATT_PREFIX
         :
-        : "a" (SHA512_K), "c" (state), "d" (data)
+        : "a" (SHA512_K), "c" (state), "d" (data), [rnds_00_15] "m" (rnds_00_15)^M
         : "%esi", "%edi", "memory", "cc"
     );
 #else

Give up. I'm not wasting any more time on GCC's broken tools. I'd be embarrassed to create something so fucked up a simple move can't be accomplished without reading the manual (yet again), hours of research and questions on Stack Overflow.

@pcordes, I'd be happy to test your suggested changes. Thanks again for taking the time to look at it. I owe you a cold beer.

If a freestanding SHA512_Round was added then it would be on the critical path. The function would be called 80 times, the epilogue takes 7 instructions, and the prologue takes 3 instructions.

What the heck are you talking about? It's already a called function, just put the inline-asm that defines the label at global scope. (Or inside a naked function on MSVC). Anywhere but inside the same asm block that calls it, and thus has to jump over it outside the loop.

I wasn't talking about changing the body of SHA512_Round at all; you want the same instructions to run, just avoiding duplicating definitions if the containing C function is inlined. (Also you can avoid jumping over it in the containing inline-asm statement.)

You can't make it use a normal calling convention anyway without huge inefficiency, because currently all the vector registers are inputs and outputs, totally unlike any normal calling convention.

IDK if replacing call/ret is actually useful. call is only 2 uops, and ret is 1 (on Haswell). They will predict perfectly, unlike faking it with a conditional branch to choose one of two destinations.

Having two different loops and two different bodies, one for the early rounds and one for later rounds, could make sense.

Disabling inline asm for x32 is a bit drastic, and presumably cripples performance there. I hope this is a temporary measure? Like I said, you can use a "b" input to avoid push/pop.

You could use an #ifdef to make it conditional if you want to keep compat with old gcc that doesn't let you use ebx in 32-bit PIC code. Either for the input itself, or mov r8, rbx (and declare a clobber on r8 if it exists), if you want to share the same asm for i386 and x32.

(It would make more sense to share the same asm for normal x86-64 and x32, though. Just make sure pointers are zero-extended if necessary on x32. Or maybe ask for them as uint64_t inputs to get the compiler to zero-extend your pointers on x32, and pass them unchanged on x86-64.)

weidai11 / cryptopp

SHA512_HashBlock_SSE2 inefficient (and potentially buggy) asm for x32. MMX is unnecessary. #686