weidai11 / cryptopp

free C++ class library of cryptographic schemes
https://cryptopp.com
Other
4.89k stars 1.51k forks source link

XLC and TEA hang during benchmarks at -O3 #503

Closed noloader closed 7 years ago

noloader commented 7 years ago

At -O3 it looks like IBM's XL C/C++ is not generating the code we expect. We are hanging after RC6, which is the TEA benchmark.

# GCC112
$ ./cryptest.exe b2 2 3.2
...
<TR><TH>RC5 (r=16)<TD>59<TD>52.0<TD>1.212<TD>3877
!! Hang Here !!

And:

# GCC119
$ ./cryptest.exe b2 2 4.1
...
<TR><TH>RC5 (r=16)<TD>59<TD>52.0<TD>1.212<TD>3877
!! Hang Here !!

Below, Rijndael_Enc_AdvancedProcessBlocks_POWER8 is the AES/OFB random number generator. Don't get distracted by it. The issue lies in TEA::Enc. while (sum != m_limit) is the loop control for TEA::Enc::ProcessAndXorBlock.

<TR><TH>IDEA/CTR (128-bit key)<TD>56<TD>51.0<TD>0.335<TD>1005
<TR><TH>RC5 (r=16)<TD>59<TD>48.7<TD>1.192<TD>3577
^C
Program received signal SIGINT, Interrupt.
0x000000001075a7e8 in CryptoPP::TEA::Enc::ProcessAndXorBlock (
    this=0x3fffffffc250, inBlock=0x4400 <Address 0x4400 out of bounds>,
    xorBlock=0x107b4c18 <CryptoPP::Rijndael_Enc_AdvancedProcessBlocks_POWER8(unsigned int const*, unsigned long, unsigned char const*, unsigned char const*, unsigned char*, unsigned long, unsigned int)+88> "",
    outBlock=0x1 <Address 0x1 out of bounds>, this=0x3fffffffc250,
    inBlock=0x4400 <Address 0x4400 out of bounds>,
    xorBlock=0x107b4c18 <CryptoPP::Rijndael_Enc_AdvancedProcessBlocks_POWER8(unsigned int const*, unsigned long, unsigned char const*, unsigned char const*, unsigned char*, unsigned long, unsigned int)+88> "",
    outBlock=0x1 <Address 0x1 out of bounds>, this=0x3fffffffc250,
    inBlock=0x4400 <Address 0x4400 out of bounds>,
    xorBlock=0x107b4c18 <CryptoPP::Rijndael_Enc_AdvancedProcessBlocks_POWER8(unsigned int const*, unsigned long, unsigned char const*, unsigned char const*, unsigned char*, unsigned long, unsigned int)+88> "",
    outBlock=0x1 <Address 0x1 out of bounds>) at ./secblock.h:534
534                     {return m_ptr;}
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.ppc64le libgcc-4.8.5-11.el7.ppc64le libstdc++-4.8.5-11.el7.ppc64le
(gdb) n
26              while (sum != m_limit)
(gdb)
534                     {return m_ptr;}
(gdb)
26              while (sum != m_limit)
(gdb)
534                     {return m_ptr;}
(gdb)
26              while (sum != m_limit)
(gdb)
534                     {return m_ptr;}
...
noloader commented 7 years ago

We were not able to get XL C/C++ to generate good code for us at -O3. We reverted Commit aa348abd1532 for the moment.

I would prefer to compile tea.cpp at -O2 though use of a pragma, but I cannot find the pragma in the IBM compiler manual. Now open on Stack Overflow: IBM XL C/C++ equivalent to #pragma GCC optimize.

noloader commented 7 years ago

The issue was cleared at Commit fc0867827e55 with the following change. The change was made in four places to TEA and XTEA encryption and decryption.

-   word32 y, z;
+   word32 y, z, sum = 0;
    Block::Get(inBlock)(y)(z);

-   word32 sum = 0;
-   while (sum != m_limit)
+   // http://github.com/weidai11/cryptopp/issues/503
+   while (*const_cast<volatile word32*>(&sum) != m_limit)
    {
        sum += DELTA;
        y += ((z << 4) + m_k[0]) ^ (z + sum) ^ ((z >> 5) + m_k[1]);

Somewhat ironically, changing sum to volatile did not fix the issue. Because of it, we spent about 4 hours trying to rework the loop body when the problem was in loop control.


Changing the code to the following:

word32 sum = 0;
while (sum != m_limit)
{
    sum += DELTA;
    volatile word32 t1 = ((z << 4) + m_k[0]);
    y += t1 ^ (z + sum) ^ ((z >> 5) + m_k[1]);
    volatile word32 t2 = ((y << 4) + m_k[2]);
    z += t2 ^ (y + sum) ^ ((y >> 5) + m_k[3]);
}

Results in a segmentation fault:

$ ./cryptest.exe  tv all
Using seed: 1505560067
...

Testing SymmetricCipher algorithm TEA/ECB.
Segmentation fault
noloader commented 7 years ago

Benchmarks are in for GCC on a modern Skylake I use for testing. Don't ask me how or why, but TEA and XTEA run faster with the volatile accesses. Prior to the change TEA was pushing data at 48.6 cpb. After the change performance rose to 41.3 cpb.

BEFORE

<TR><TH>TEA/CTR (128-bit key)<TD>62<TD>48.6<TD>0.258<TD>811
<TR><TH>XTEA/CTR (128-bit key)<TD>56<TD>54.0<TD>0.260<TD>816

AFTER

<TR><TH>TEA/CTR (128-bit key)<TD>73<TD>41.3<TD>0.200<TD>630
<TR><TH>XTEA/CTR (128-bit key)<TD>63<TD>47.6<TD>0.201<TD>634