ydb-platform / ydb

YDB is an open source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions
https://ydb.tech
Apache License 2.0
3.69k stars 501 forks source link

Consider using LTO + PGO + Bolt #140

Open zamazan4ik opened 1 year ago

zamazan4ik commented 1 year ago

Hi!

YDB right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".

Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.

Also, there are some caveats to consider like:

Links:

zamazan4ik commented 1 year ago

I did some performance experiments on my local machine.

My setup:

For benchmark purposes and profile generation, I've used KqpLoad actor (https://ydb.tech/en/docs/development/load-actors-kqp) which I've run multiple times for 300 seconds each time (all other parameters are default). YDB setup - local with RAM storage as described here: https://ydb.tech/en/docs/getting_started/self_hosted/ydb_local but with my own ydbd binaries.

I did the following things:

The results are the following:

Also, I've tried to apply BOLT but perf2bolt consumes more than 32 Gib RAM for ydbd binary so it was OOM-killed :(

Additional notes regarding PGO via instrumentation. During my profile generation with instrumented ydbd binary via KqpLoadActor I found a strange error, possibly due to hardcoded deadlines - see here: https://github.com/ydb-platform/ydb/blob/main/ydb/core/load_test/kqp.cpp#L332 Since instrumented binaries are much slower, some deadlines shall be adjusted. During my local benchmarking, I just commented out these deadlines and the profile was generated successfully. Possibly, would be better to have an ability to configure the timeout externally without code modification.

zamazan4ik commented 1 year ago

Well, I managed to run BOLT with some "magic" options (details are here: https://github.com/llvm/llvm-project/issues/61711).

As expected, BOLT didn't provide a significant performance boost after PGO - but still, I see measurable improvements:

I think Propeller (an alternative approach, similar to BOLT but from Google) could bring almost the same numbers. I tried to test YDB with Propeller... But Propeller requires the latest Clang compiler from the main branch, and YDB has a bunch of compilation errors with it - and right now I have some motivation lack to fix them... Maybe, one day I will test it too :)

eivanov89 commented 1 year ago

Hi Alexander Zaitsev, thank you very much for sharing this excellent idea and making the initial experiments. One of our engineers have confirmed your results and working further on integration details. We will be back soon, when collect more data and understand best possible usage.

zamazan4ik commented 10 months ago

@eivanov89 do you have updates regarding PGO? If you confirm the results and you find them useful, I suggest adding to the YDB documentation a note regarding tuning YDB with PGO. Here are the examples from other projects, how this documentation can look like:

Having this kind of information in the official documentation makes optimization opportunities more visible to the end users and maintainers.

eivanov89 commented 10 months ago

Hi @zamazan4ik, sorry for delay. We have some issues with our internal tools and build. Hope to solve soon though. But if fail, we will consider applying this to github build only.

zamazan4ik commented 10 months ago

But if fail, we will consider applying this to github build only.

Understood. I suggest if you confirm the results above, add a note about PGO to the YDB documentation. So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

eivanov89 commented 10 months ago

So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

The tests that we both have used to test PGO are too narrow, imho. We're going to try YCSB and TPC-C to check if real benchmarks benefit same manner as microbenchmarks we have used so far.