weidai11 / cryptopp

free C++ class library of cryptographic schemes
https://cryptopp.com
Other
4.92k stars 1.51k forks source link

BLAKE2b NEON suffers poor performance on ARMv8/Aarch64 with Cortex-A57 #367

Closed noloader closed 7 years ago

noloader commented 7 years ago

Here's an ugly result for Blake2 testing with Crypto++ and Botan on ARMv8/Aarch64 with Cortex-A57. Cortex-A53 is OK, meaning it does not slow down. A53 runs at about the same speed for both CXX and NEON.

A57, Crypto++ (3 second benchmark):

A57, Botan (speed test, 3000 ms):

The astute reader will realize those numbers should be inverted :(

noloader commented 7 years ago

From a private email when talking with the BLAKE2 team:

I got a private email about poor performance for BLAKE2. I could not get all the details, but I was able to duplicate it at the GCC compile farm using GCC117. GCC117 is an ARMVv8/Aarch64 with an 8-core Cortex A57. ...

The problem is/was, CXX outperforms NEON on a Cortex-A57 (ARM severs, like SofitIron Overdrive 1000). CXX and NEON run the same on a Cortex-A53 (Pine64, HiKey, etc). A7, A8 and A9 perform as expected. It was as if CXX and NEON were inverted under A57. ...

There's a small wildcard still in play. Aarch32 is ARMv8 but in 32-bit mode. NEON is still enabled for Aarch32 because we effectively disable NEON for 64-bit ARM. We don't know how its going perform because we don't have a test device.

From the A57 optimization guide (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.uan0015b/index.html), NEON addition and xor have throughput 2, just like regular instructions. But unlike regular instructions, where rotation is nearly free, NEON needs to waste time with slow-ish (only one per cycle) shift instructions. On balance, NEON loses out, not to mention the extra message permutation stuff. Makes sense. I can't find any detailed information for A53, though. It probably can't do 2 independent instructions per cycle, I guess.

The Aarch32 case is interesting; it doubles all of the necessary instructions for the BLAKE2b general purpose register case, so I would expect that performance will be more evenly paired between NEON and C, but it may still not be worth enabling NEON. On the other hand, it will probably make no difference whatsoever for BLAKE2s, so NEON should remain disabled there.

noloader commented 7 years ago

We disabled NEON for Cortex-A53 and A57 processors. Also see Commit 9dd2744419181e9c and Commit 6e1a02151174a28e.

The problem surfaced again when we cut-in SPECK-128, so we added a define CRYPTOPP_SLOW_ARMV8_SHIFT to help isolate the code and control the slower code paths. Also see Commit b08596da4466.