webserver-llc / angie

Angie - drop-in replacement for Nginx
https://angie.software/en/
BSD 2-Clause "Simplified" License
1.12k stars 64 forks source link

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #51

Open zamazan4ik opened 8 months ago

zamazan4ik commented 8 months ago

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy, HAProxy (link), httpd (link), Nginx (link, but the more mature tests should be performed for it). According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the CPU parts of Angie with PGO can be a good idea.

I can suggest the following action points:

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

Here are some examples of how PGO optimization is integrated in other projects:

VBart commented 8 months ago

Hi,

First of all, thanks for sharing these results.

Angie shares the same architecture and workload cases with nginx. I did before various PGO benchmarks with nginx and found out, that possible performance benefits just didn't worth the effort. In real case scenarios with real configurations, PGO gave less than 1% difference on peak loads (most of the time the performance gain was even statistically indistinguishable). The same is true for attempts to tune compilation flags (like using -march=native and -O3 or even -Ofast), so we use system default flags for our builds.

The reason you see some performance gain is because of testing methodology: you use very synthetic micro-benchmark and profile the build directly for this scenario. But this is very far away from any real use cases for most of our users.

Angie isn't a good choice for such optimization because it's not CPU-bound. In real case scenarios most of the time worker processes are spent waiting on syscalls in kernel. The only CPU intensive tasks (like compression, image processing, cryptography) are done by other external libraries. The code you try to optimize actually adds just a few percent to a typical request processing time. So, making few percent of the processing faster by a few percent gives less than one percent benefits overall. You can even gain more by tuning some configuration directives.

While some architectural changes or syscall optimizations can give multiple performance gains. Here's an example of optimizations I did in the past: https://www.nginx.com/blog/thread-pools-boost-performance-9x/

Sure, tuning of the compilation process is relatively "low hanging fruit" here, but the real gains from it are missing in practice.

Seirdy commented 8 months ago

A build manifest of some sort (CI manifest, Docker container, etc) that builds Nginx and all it dependencies from source (perhaps with static linking) would be a useful place to implement PGO.

IME with BoringSSL at least, cryptography doesn't actually benefit tremendously from PGO. Perhaps it's because the cryptographic primitives are mostly written in generated ASM these days instead of optimizeable C?

I'd be interested in if/how PCRE2, libcrypt, and zlib (zlib-ng?) benefit though!