Biggest thing is trying to reduce how often the [] overload is called. It's called SO MUCH. Also Offset and Size secretly get called ALL THE TIME.
Biggest refactor here though is AesGcm.MultiplyGF128Elements. This method can call ByteSpan Get/Set item ~450k times per packet. Especially with that awful "SecureClear" implementation. I haven't load tested it yet, but this might actually shave milliseconds per packet sent/received.
On the allocations side, it's really just a bunch of temp buffers everywhere. Nothing huge, but maybe like 100 bytes per packet sorta stuff. The biggest one is probably PrfSha256.ExpandSecret which gets called fairly often and allocates a bunch of temp buffers. But that's mostly during the handshaking process, so it's probably not a huge deal.
This probably still isn't everything. This is just high targets during profiling and stuff I noticed.
Might need to edit this later, but early benchmarking suggests that this and the reduce-allocs branch improve DTLS RTT by 50%
Biggest thing is trying to reduce how often the [] overload is called. It's called SO MUCH. Also Offset and Size secretly get called ALL THE TIME.
Biggest refactor here though is
AesGcm.MultiplyGF128Elements
. This method can call ByteSpan Get/Set item ~450k times per packet. Especially with that awful "SecureClear" implementation. I haven't load tested it yet, but this might actually shave milliseconds per packet sent/received.On the allocations side, it's really just a bunch of temp buffers everywhere. Nothing huge, but maybe like 100 bytes per packet sorta stuff. The biggest one is probably
PrfSha256.ExpandSecret
which gets called fairly often and allocates a bunch of temp buffers. But that's mostly during the handshaking process, so it's probably not a huge deal.This probably still isn't everything. This is just high targets during profiling and stuff I noticed.
Might need to edit this later, but early benchmarking suggests that this and the reduce-allocs branch improve DTLS RTT by 50%