awesome-pgo

Various materials about Profile Guided Optimization (PGO) and other similar stuff like AutoFDO, Bolt, etc.

!!!ARTICLE!!!

There is an (unfinished) article about all the details about PGO, PLO, etc. - link. With high chance, it will answer (almost) all your questions about PGO and PLO.

How to fail with PGO?

The PGO implementer

Theory (a little bit)

What is PGO:
- Wiki
- Microsoft docs

Also, you could find PDO (Profile Directed Optimization), FDO (Feedback Driven Optimization), FBO (Feedback Based Optimization), PDF (Profile Directed Feedback), PBO (Profile Based Optimization) - do not worry, that's just a PGO but with a different name.

Additionally, I need to mention Link-Time Optimization (LTO) since usually PGO is applied after LTO (since usually LTO is easier to enable and it brings significant performance and/or binary size improvements). PGO does not replace LTO but complements it. More information about LTO can be found in lto.md.

PGO Showcases

Here I collect links to the articles/benchmarks/etc. with PGO on multiple projects (with numbers!).

Browsers

Compilers and interpreters

Rust (the rustc compiler):
Clang:
- Official documentation
- KDE blog
- Libclang on Windows: Article
- Homebrew benchmarks: one, two
- ScyllaDB benchmarks: GitHub issue
- Clang on Windows: Phoronix post
- Fedora experiments: GitHub repo
GCC:
- ArchLinux bugtracker. Numbers for GCC 3.3 - could be outdated.
- NixOS experiments
- PGO effects on devirtualization in C++
- According to the experiments from a person in a local Telegram chat with optimization GCC in Gentoo: +4% to compilation speed with LTO, +10% to compilation speed with PGO
Python:
- Blog
- GitHub PR
Go (go compiler):
- Official blog
- Go compiler performance numbers
D:
- DMD: GitHub issue
- LDC: GitHub comment, this and this articles
Julia: GitHub PR
PHP:
- Alibaba post
- Phoronix benchmarks
Perl (cperl): Blog
Ruby: Ruby Forum (post from 2006 with GCC 4.1)
Lua: Lua interpeter results - Reddit
tfcompile: GitHub comment
SWI-Prolog: GitHub comment
Sage: GitHub issue

Developer tooling

Operating systems

Linux kernel:
- Paper
- Microsoft presentation
- ASOS (Application Specific Operating System)
- TCP Stream perf
- Phoronix post
- Yet another attempt to PGO Linux kernel: http://coolypf.com/kpgo.htm
- Gentoo Wiki
- Optimizing Linux kernel with Clang. An article(in Russian) and results
- From my experience and tests, PGO with Linux kernel could be tricky to perform and does not bring huge results for 3rd party applications(tested on Redis and PostgreSQL). Further testing is needed. One possible idea - PGO was not applied right with GCC due to some .gcda find path issues. The test must be repeated with GCC and Clang.
Windows: 5-20% improvement according to the presentation

Virtual machines

QEMU: Blog
CrosVM: Intel blog

Databases

PostgreSQL:
- See "postgresql_results.md" file in the repo.
- Mailing list thread
MariaDB:
- Official MariaDB article
- ClearLinux benchmarks
- Blog
MySQL:
- oneAPI report
- A user report
ClickHouse: GitHub issue
MongoDB: See "mongodb.md" file in the repo
Redis: See "redis.md" file in the repo
SQLite:
- See "sqlite.md" file in the repo for the detailed report
- SQLite forum discussion
- sqlite-parquet-vtable PGO results
YDB: GitHub issue
FoundationDB: GitHub issue
DuckDB: GitHub comment
Memcached: GitHub issue
DragonflyDB: GitHub comment
YugabyteDB: GitHub commit
ScyllaDB: GitHub PR
GreptimeDB: GitHub issue
Databend: GitHub issue
Skytable: GitHub issue
Tarantool: GitHub issue
RonDB: GitHub comment
ReDB:
- GitHub comment in the main repo
- GitHub comment in NativeDB repo (has PGO results for ReDB too)
Nebula: Docs
Qdrant:
- Microbenchmarks
OceanBase: GitHub comment
NativeDB: GitHub issue
Dolt: Blog
bbolt-rs: GitHub comment
libmdbx: GitFlic issue (in Russian)
candystore: GitHub comment

Logging

Proxy

Envoy: GitHub comment
HAProxy:
Nginx: see "nginx.md" file in the repo
Rathole: GitHub discussion
httpd: see "httpd.md" file in the repo

Other

Unreal Engine:
- Release notes (search for "PGO" on the page)
- Some notes on GitHub
Suricata: Slides
Handbrake: GitHub issue comment
CP2K: Docs
Bevy: PGO-run (first) vs non-PGO (second) - Pastebin. In these results you need to interpret performance decrease as "Release version is slower than PGOed" and performance increase as "Release version is faster than PGOed".
Wordpress: Bitnami blog
Zstd and LZ4: Blosc blog
Windows terminal: GitHub PR
Drill: GitHub issue
Goose: Article
Chess engines (Stockfish, Cfish, asmFish): Reddit post
Multiple smaller benchmarks by Phoronix:
- GCC 8
- GCC 9
- GCC 10
- More GCC 10
- GCC 11
- GCC 12
Benchmarks from OpenSUSE: Docs
Bunch of LLVM test suite algorithms benchmarks: Blog
ClamAV: Blog
Mesa: Mailing list about OpenGL benchmark. Worth reading the whole thread though.
hck: README note
Typst: GitHub issue
Cemu: GitHub comment
Pydantic-core: GitHub comment
xz: OpenMandriva forum
libspng: Docs
matchit: GitHub issue
QOAudio (Rust version): GitHub issue
JSON libraries (serde_json, rustc_serialize, simd-json): GitHub issue
XML libraries:
- xml-rs: GitHub issue
- quick-xml: GitHub issue
- roxmltree: GitHub issue
tonic: GitHub issue
tantivy: GitHub issue
Lychee: GitHub issue
nushell: GitHub comment
delta: GitHub comment
hurl: GitHub comment
fd: GitHub comment
MRCC: up to 40% performance boost with PGO according to the private benchmarks
Broot: GitHub issue
Geant4 (a CERN project):
- "Testing AutoFDO for Geant4" (slides)
- "Speeding up CMS simulations, reconstruction and HLT code using advanced compiler options" (link)
Youki: GitHub issue
sd: GitHub issue
frawk: GitHub comment
bat: GitHub issue
jql: GitHub issue
htmlq: GitHub issue
ouch: GitHub issue
czkawka: GitHub issue
quilkin: GitHub comment
grcov: GitHub issue
difftastic: GitHub issue
Perspective: GitHub discussion
tquic: GitHub issue
legba: GitHub issue
Slint: GitHub issue
tsv-utils: Study report
wgpu: GitHub discussion
Mesa: Phoronix post
lingua-rs: GitHub discussion
libtre: FreeBSD Bugzilla comment
ion:
- Instrumented to Release: https://gist.github.com/zamazan4ik/200b179278bcad05528eb65340977781
- PGO-optimized to Release: https://gist.github.com/zamazan4ik/a4c5b603c16ec9fe3427c9d26a50e3e5
- Platform: Linux
- ion version: master branch on 60bfb73351f0412c95b8ba2afe75e988514470a6 commit
tokei: GitHub issue
qsv: GitHub discussion
vtracer: GitHub discussion
ripgrep: GitHub comment
lol-html: GitHub issue
tokenizers: GitHub issue
Zen: GitHub discussion
native_model: GitHub issue
pathfinding: GitHub issue
HiGHS: near 2-2.5% in highs ../check/instances/greenbea.mps workload
lace: GitHub issue
minitrace-rust: GitHub issue
needletail: GitHub issue
logos: GitHub issue
llrt: GitHub issue
varpro: GitHub issue
awk: LWN article
gawk: GitHub commit
candy: GitHub discussion
axum: GitHub dicussion
rustls: GitHub issue
python-libipld: GitHub PR
sqlparser-rs: GitHub discussion
arrow-datafusion: GitHub discussion
actson-rs: GitHub issue
oha: GitHub PR
rust_serialization_benchmark: GitHub issue
ada-url: GitHub issue
struson: GitHub discussion
ast-grep: GitHub discussion
Symbolicator: GitHub issue
libjxl: GitHub issue
nucleo: GitHub discussion
martin: GitHub discussion
serde-sqlite-jsonb: GitHub discussion
LibreOffice: Blog. The article is from 2014 - keep it in mind.
A lot of insights, history, and great benchmarks for LTO and PGO efficiency in LLVM and GCC in various software (including Firefox and LibreOffice) from Honza Hubička: GCC 4.8, GCC 5, GCC 6 and Clang 3.9, GCC8 and Clang 6, GCC9
koto: GitHub discussion
prost: GitHub discussion
angle-grinder: GitHub discussion
zune-image: GitHub discussion
graphql-lint: GitHub issue
nom: GitHub comment
prettyplease: GitHub comment
genson-rs: GitHub comment
resvg: GitHub comment
Cloudflare (internal services): Blog
rustwire: GitHub comment
Bend: GitHub comment
Amber: GitHub discussion
Iggy-rs: GitHub comment
html5ever: GitHub comment
Symbolica: Zulip message
oxc: GitHub comment
libvpx: Chromium issue tracker
lady-deirdre: GitHub comment
musli: GitHub discussion
limbo: GitHub comment
amber: GitHub comment
OpenRadioss: GitHub comment
ieee80211-rs: GitHub comment
jiff, chrono, time: GitHub discussion
pulldown-latex: GitHub comment
vrl: GitHub discussion
Picodrive: Habr comment (in Russian. 2x performance improvement)
harper: GitHub discussion
wildcard: GitHub comment
GQL: GitHub discussion
trie-hard, radix-trie: GitHub comment
pingora: GitHub discussion
tex-fmt: GitHub comment

Projects with already integrated PGO into their build scripts

Below you can find some examples of where and how PGO is integrated into different projects.

Rustc: a CI tool for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script.
Clang:
- Docs
- MinGW build script
Python:
- CPython: README
- Pyston: README
Go: Bash script
Swift: CMake script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag
ISPC: CMake scipts
NodeJS: Configure script
Android Open Source Project (AOSP):
- Official documentation
- Committed PGO profiles: repository
DMD: Custom build rule
LDC: GitHub action
tsv-utils: Makefile
Erlang OTP: Makefile
Clingo (PGO enabled only in Spack): Package recipe
SWI-Prolog:
- Script
- CMake module
hck: Justfile
oha: GitHub PR
Dolt: Blog

Project-specific documentation about PGO

Here we collect projects where PGO is described as an optimization option in the documentation:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
Rustc: https://rustc-dev-guide.rust-lang.org/building/optimized-build.html#profile-guided-optimization
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

PGO support in programming languages and compilers

C and C++:
- GCC
- Clang
- MSVC
- ICC
- AOCC (supports but the documentation right now exists only as PDF files)
- Circle (not exactly a C++ compiler): no PGO support
Rust:
- rustc
Fortran:
- GCC
- Flang
- IFC
- IBM
- AOCC (supports, but the documentation right now exists only as PDF files)
C#:
- Gist from EgorBo
- MS Blog
- .Net 7
Java:
- GraalVM (already free to use)
Go:
- Go compiler in Preview since Go 1.20, GA in 1.21
- GoLLVM - not yet
- GCCGO - unknown, but it should be possible to try
Swift: Seems like supports but I am not sure
Kotlin: Seems like no
Ada:
- GNAT: should be possible, same as GCC
D: LDC docs
Nim: Nim forum
Ocaml: almost no
Zig: no
V: kind of
Red: seems like not
Pascal: No
Haskell:
- GHC: no

Possibly other compilers support PGO too. If you know any, please let me know.

PGO support in build systems

Here we collect and track PGO integrations into build systems:

Cargo: No built-in support but there is awesome cargo-pgo
Bazel: Supports (command-line reference)
CMake: No support yet (GitLab issue)
Meson: Supports (b_pgo in the docs).
SCons: No support yet (GitHub discussion)

Sampling PGO (AutoFDO) support

Here we collect information about supporting PGO via sampling across different compilers.

C and C++:
- GCC: supports
- Clang: supports
Rust:
- rustc: supports, but marked unstable: commit, unstable book

Are we PGO yet?

Check "are_we_pgo_yet.md" file in the repo to check the PGO status in a project.

BOLT showcases

Here I collect all results by applying LLVM BOLT to the projects (with numbers).

Linux kernel:
- Phoronix article
- GitHub docs
Rustc:
CPython: GitHub PR
YDB: GitHub comment
Clang:
LDC: GitHub comment
HHVM, Proxygen and others: Facebook paper
NodeJS: Blog
Chromium: Blog
MySQL, MongoDB, memcached, Verilator: Paper
ast-grep: GitHub issue
Symbolicator: GitHub issue
Pango: Gnome blog
pylyzer: GitHub discussion
prettyplease: GitHub comment
bbolt-rs: GitHub comment
resvg: GitHub comment

Projects with already integrated BOLT into their build scripts

Rustc: GitHub PR
CPython: GitHub PR
Pyston:
- README
- Makefile
Clang: CMake script
Linux kernel:
- LLVM branch for BOLTing the kernel

Are we BOLT yet?

Just a list of BOLT-related issues in different projects. So you can estimate the BOLT state in your favorite open-source product.

Chromium: Chromium bugtracker
Firefox: Mozilla bugtracker
- The same for Propeller: Mozilla bugtracker
NodeJS: GitHub issue
LDC: GitHub issue
GCC: Bugzilla

LTO, PGO, BOLT, etc and provided by someone binaries

Well, it's hard to say, is your binary already LTO/PGO optimized or not. It depends on multiple factors like upstream support for LTO/PGO, maintainers willing to enable these optimizations, etc. Usually, the most obvious way to check it - just ask the question "Is the binary LTO/PGO optimized?" from the binary author (a person who built the binary). It could be your colleague (if you build programs on your own), build scripts from CI, maintainers from your favorite OS/repository (if you use provided by repos binaries), software developers (if you use downloaded from a site "official" binaries). Do not hesitate to ask!

PGO adoption across projects

PGO usually is not enabled by the upstream developers due to a lack of support for sample load or a lack of resources for the multi-stage build. So please ask maintainers explicitly about PGO support addition.

PGO adoption across Linux distros

Even if PGO is supported by a project, it does not mean that your favorite Linux distro builds this project with PGO enabled. For this there are a lot of reasons: maintainer burden (because we are humans (yet)), build machines burden (in general you need to compile twice), reproducibility issues (like profile is an additional input to the build process and you need to make it reproducible), a maintainer just don't know about PGO, etc.

So here I will try to collect information about the PGO status across the Linux distros for the projects that support PGO in the upstream. If you didn't find your distro - don't worry! Just check it somehow (probably in some chats/distros' build systems, etc.) and report it here (e.g. via Issues) - I will add it to the list.

GCC:
- Note: PGO for GCC usually is not enabled for all architectures since it requires too much from the build systems
- Debian: yes
- Ubuntu: same as Debian
- RedHat: Yes. And that is the reason why PGO is enabled for GCC in all RedHat-based distros.
- Fedora: yes
- Rocky Linux: yes
- Alma Linux: yes
- NixOS: no
- OpenSUSE: yes, see line 2414
Clang:
- Binaries from LLVM are already PGO-optimized (according to the note about using "stage2" build - it's PGO optimized build)
- RedHat (CentOS Stream): no
- Fedora: no
- AlmaLinux: no
- Rocky Linux: no
- NixOS: no
- Arch Linux: sent an email to the Clang maintainer in Arch Linux - no response yet
Rustc:
- Fedora: yes
CPython:
- Fedora: yes. Also, check this discussion. I guess other RedHat-based distro builds are the same for this package (however I didn't check it but Rocky Linux is the same).

BOLT adoption across Linux distros

Here we track LLVM BOLT enablement across various projects in various OS-specific build scripts:

Clang:
- Gentoo bugtracker
GCC: TODO
Rustc:
- Fedora: no
- RedHat: no
CPython: TODO
Pyston: TODO

Meta-issues about PGO and LLVM BOLT usage in different OSs and package managers:

Fedora: Bugzilla
RedHat: JIRA
ClearLinux: GitHub issue
CachyOS (Website): according to the search over its GitHub repositories - they are trying to integrate BOLT as much as possible
OpenSUSE: Cannot create an account to create a corresponding issue
Ubuntu: Ubuntu forums
Alpine Linux: Gitlab issue
Mageia: Bugzilla
Void Linux: GitHub issue
Chromebrew: GitHub discussion
Homebrew: GitHub discussion
Spack: GitHub discussion
Vcpkg: GitHub discussion
FreeBSD: FreeBSD forum
Conan: GitHub issue
MacPorts: Ticket
- They said this question should be discussed in mailing lists
LLVM-mingw: GitHub issue
MinGW repo: GitHub issue
CBL-Mariner: GitHub discussion

Other optimization techniques like BOLT

BOLT and others certainly are not enabled by default anywhere right now. So if you see a performance improvement from it - contact the upstream.

Beyond PGO (could be covered here later as well)

AutoFDO:
- Paper
- GitHub
BOLT:
Propeller
- Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications

Traps

The biggest problem is "How to collect a good profile?". There are multiple ways to do this:

Prepare a reference workload. It could be quite difficult to create and maintain (since during the time it could become more and more different from your actual workload). However, for some loads like compilers load is usually predictable (compiling programs) so this way is good enough in this case. For other cases like databases the workload could hugely depend on the actual input from your users and users can change their queries for some reason. So be careful.
Collect profile from your actual production. It could be difficult to do with a usual PGO since it requires an instrumentation, and instrumentation binaries could work too slowly. If it's your case - you could try to use AutoFDO since it has a low overhead due to the underlying perf nature. But it also has its own limitations (usually Linux-only, less efficient than usual PGO, could be more buggy). E.g. Google uses AutoFDO for profiling all their services and has a lot of automation around sampling profiles at their scale, storing them, integration into CI pipelines, etc. But all this tooling is closed-source so you need to implement it from the scratch.

In my opinion, usually you should start with simple PGO via Instrumentation mode, especially if you upgrade your binaries seldomly. And only if Instrumentation starts to hurt you - start thinking about AutoFDO.

Another issue could be reproducibility. Since you are injecting some information from runtime (some execution counters based on your sample workload) you get more variables that could influence your binary. In this case, you need to store somewhere in VCS your sample workload, probably collected profiles based on this workload, etc.

Other pitfalls include the following things:

PGO
- Requires multiple builds (at least two stages, in Context-Sensitive LLVM PGO (CSPGO) - three stages)
- Instrumented binaries work too slowly, so rarely could be used in production -> you need to prepare a "sample" workload
- For services sometimes PGO reports are not flushed to the disk properly, so you need to do it manually like here
- Reproducibility issues - could be important for some use cases even more than performance
- Bugs. E.g. LLVM issues when PGO is combined with LTO - GitHub issue
AutoFDO
- Huge memory consumption during profile conversion: GitHub issue
- Supports only perf, so cannot be used with other profilers from different like Windows/macOS (support for other profilers could be implemented manually)
- "Support" from Google is at least questionable: no regular releases, compilation issues
Bolt
- Huge memory usage during build: GitHub issue
- For better results, you need hardware/software with LBR/BRS support
- There are a lot of bugs - be careful
Propeller:
- Too Google-oriented - could be hard to use outside of Google
- Relies on the latest compiler developments, also unstable

Useful links

Implementation details of different PGO approaches in Clang: Youtube, slides
Some notes about PGO
A rejected idea to integrate BOLT into cpython build: link
cperl notes on LTO, PGO, BOLT
.profraw internal details: blog
Slides about PGO from C++ Russia 2021 (Pavel Kosov): slides (in Russian), video
Overview of all kinds of PGO in LLVM: link
MSVC insights about PGO (a video from 2012): Microsoft learn

Communities

Here is the incomplete community list where you can find PGO-related advice with higher probability:

Gentoo (chats, forums)
ClearLinux (chats, forums)

Related projects

Awesome Machine learning in compilers
CompilerGym: https://github.com/facebookresearch/CompilerGym/ (an interesting project about applying ML on compiler optimization flags)
MLGO: A Machine Learning Framework for Compiler Optimization: Google blog
Phoronix Test Suite (PTS) integration with PGO: GitHub
An article about BOLT
Nvidia paper about PGO in gamedev: Publication

Where PGO did not help (according to my tests)

Catboost - I think this is due to the highly math-oriented nature of this. I did a test on fit and calc modes (training and evaluation, respectively) on epsilon dataset. In the calc mode PGO for some reason made things even worse. Maybe, PGO could help in other modes but I didn't test it (yet).

Contribute

If you have an example where PGO shines (and where doesn't) - please open an issue and/or PR to the repo. It's important to collect as many as possible showcases about PGO!

zamazan4ik / awesome-pgo

readme