wolfSSL / wolfssl

The wolfSSL library is a small, fast, portable implementation of TLS/SSL for embedded devices to the cloud. wolfSSL supports up to TLS 1.3 and DTLS 1.3!
https://www.wolfssl.com
GNU General Public License v2.0
2.22k stars 798 forks source link

Kyber: Improve performance #7681

Closed SparkiDev closed 1 week ago

SparkiDev commented 1 week ago

Description

Unroll loops and use larger types. Allow benchmark to run each kyber parameter separately. Allow benchmark to have -ml-dsa specified which runs all parameters. Fix thumb2 ASM C code to not have duplicate includes and ifdef checks. Fix thumb2 ASM C code to include error-crypt.h to ensure no empty translation unit. Check for WOLFSSL_SHA3 before including Thumb2 SHA-3 assembly code.

Testing

Tested cross-compile with -pedantic for hosts armv7m and armv8 with --enable-armasm and --enable-armasm=inline.

Checklist

JacobBarthelmeh commented 1 week ago

Am I testing the performance improvement correctly? It could just be noise on the machine but a lot of the performance numbers looked slower after the change when I ran it this way.

Master wolfSSL branch on a Mac M1

wolfssl % ./configure --enable-armasm --enable-kyber --enable-experimental -q
ld: warning: -single_module is obsolete
wolfssl % make &> /dev/null
wolfssl % ./wolfcrypt/benchmark/benchmark -kyber

------------------------------------------------------------------------------
 wolfSSL version 5.7.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) no-dyn-stack word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
KYBER512    128  key gen    110800 ops took 1.001 sec, avg 0.009 ms, 110743.101 ops/sec
KYBER512    128    encap     84400 ops took 1.000 sec, avg 0.012 ms, 84374.010 ops/sec
KYBER512    128    decap     66900 ops took 1.001 sec, avg 0.015 ms, 66820.026 ops/sec
KYBER768    192  key gen     65300 ops took 1.000 sec, avg 0.015 ms, 65298.614 ops/sec
KYBER768    192    encap     51900 ops took 1.002 sec, avg 0.019 ms, 51807.583 ops/sec
KYBER768    192    decap     41900 ops took 1.001 sec, avg 0.024 ms, 41850.111 ops/sec
KYBER1024   256  key gen     41300 ops took 1.002 sec, avg 0.024 ms, 41206.746 ops/sec
KYBER1024   256    encap     33700 ops took 1.002 sec, avg 0.030 ms, 33628.507 ops/sec
KYBER1024   256    decap     28600 ops took 1.002 sec, avg 0.035 ms, 28530.556 ops/sec
Benchmark complete

Pulling in these changes:

wolfssl % git checkout -b SparkiDev-kyber_improv_1 master
git pull https://github.com/SparkiDev/wolfssl.git kyber_improv_1
From https://github.com/SparkiDev/wolfssl
 * branch                kyber_improv_1 -> FETCH_HEAD
Successfully rebased and updated refs/heads/SparkiDev-kyber_improv_1.
wolfssl % make &> /dev/null                              
wolfssl % ./wolfcrypt/benchmark/benchmark -kyber         
------------------------------------------------------------------------------
 wolfSSL version 5.7.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) no-dyn-stack word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
KYBER512    128  key gen    108700 ops took 1.000 sec, avg 0.009 ms, 108652.517 ops/sec
KYBER512    128    encap     85300 ops took 1.000 sec, avg 0.012 ms, 85286.112 ops/sec
KYBER512    128    decap     66500 ops took 1.000 sec, avg 0.015 ms, 66479.664 ops/sec
KYBER768    192  key gen     61900 ops took 1.001 sec, avg 0.016 ms, 61833.895 ops/sec
KYBER768    192    encap     50900 ops took 1.001 sec, avg 0.020 ms, 50844.831 ops/sec
KYBER768    192    decap     42400 ops took 1.000 sec, avg 0.024 ms, 42382.751 ops/sec
KYBER1024   256  key gen     40300 ops took 1.001 sec, avg 0.025 ms, 40242.538 ops/sec
KYBER1024   256    encap     33600 ops took 1.002 sec, avg 0.030 ms, 33530.960 ops/sec
KYBER1024   256    decap     28200 ops took 1.000 sec, avg 0.035 ms, 28189.287 ops/sec
Benchmark complete
SparkiDev commented 1 week ago

Hi Jacob,

The benchmark testing for this algorithm is very jittery. My testing was on Intel x64. The ARM should not be impacted too much.

Sean