Closed alexpyattaev closed 10 months ago
Thank you for the improvements. Really, so many people have complained, maybe even looked at the code, but nothing ever came from it. So I am really grateful you dug into this!
Your proposed changes got me thinking:
I certainly don't mind isolating the server to 4 cores and wrk to 4 other cores. But for relative comparisons, I'd argue that all subjects under test are influenced by the same test setup. So, I am not sure whether it is really necessary for relative comparisons.
Then, I revisited the 4 threads topic. I recall why the rust code uses 4 threads only. Because the the zap code also uses 4 threads:
zap.start(.{
.threads = 4,
.workers = 4,
});
This got me thinking: why waste 4 processes for 4 threads. The IPC might slow things down. By experimentation, I figured out that the ideal number of worker processes seems to be: 2 for 4 threads.
zap.start(.{
.threads = 4,
.workers = 2,
});
With that I got the following results (3 runs):
Requests/sec: 686589.24
Transfer/sec: 104.11MB
Requests/sec: 658573.26
Transfer/sec: 99.86MB
Requests/sec: 681040.33
Transfer/sec: 103.27MB
Which averages out to:
avg. requests/sec: 675400.94
avg. transfer/sec: 102.41MB
compared to 4 workers:
avg. requests/sec: 649692.13
avg. transfer/sec: 98.52MB
Which is significant.
Now, the question to me is: is it fair to give the rust example 128 threads when the zap example gets only 4?
BTW my tests from above weren't yet made with the 4 / 4 core split that you suggest.
Update: You've convinced me that the taskset stuff is good because it creates even more impressive numbers :smile:.
However, I just ran the new, updated rust example and still get no performance increase. Here are some exemplary runs on my machine:
./wrk/measure.sh zig
Listening on 0.0.0.0:3000
========================================================================
zig
========================================================================
Running 10s test @ http://127.0.0.1:3000
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 588.74us 423.55us 17.88ms 92.34%
Req/Sec 175.71k 26.96k 216.79k 52.75%
Latency Distribution
50% 487.00us
75% 674.00us
90% 0.93ms
99% 2.12ms
6991199 requests in 10.02s, 1.04GB read
Requests/sec: 698047.77
Transfer/sec: 105.85MB
contrib_zap on master via ↯ v0.11.0 via impure (nix-shell) took 11s
➜
./wrk/measure.sh rust
Finished release [optimized] target(s) in 0.03s
========================================================================
rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 624.20us 1.49ms 208.31ms 99.90%
Req/Sec 33.01k 5.21k 39.64k 64.25%
Latency Distribution
50% 579.00us
75% 738.00us
90% 0.93ms
99% 1.38ms
1313715 requests in 10.02s, 68.91MB read
Socket errors: connect 0, read 1313668, write 0, timeout 0
Requests/sec: 131114.37
Transfer/sec: 6.88MB
contrib_zap on master via ↯ v0.11.0 via impure (nix-shell) took 11s
➜ ./wrk/measure.sh go
listening on 0.0.0.0:8090
========================================================================
go
========================================================================
Running 10s test @ http://127.0.0.1:8090/hello
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.18ms 810.81us 13.95ms 75.24%
Req/Sec 89.86k 1.04k 92.66k 73.25%
Latency Distribution
50% 1.09ms
75% 1.54ms
90% 2.12ms
99% 4.18ms
3577111 requests in 10.02s, 457.13MB read
Requests/sec: 356961.73
Transfer/sec: 45.62MB
In fact, it seems to have gotten a bit slower. Here is an exemplary run of the "old" code:
➜ ./wrk/measure.sh rust
Finished release [optimized] target(s) in 0.03s
========================================================================
rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.26ms 625.30us 14.34ms 64.06%
Req/Sec 34.14k 2.63k 39.02k 78.75%
Latency Distribution
50% 1.38ms
75% 1.69ms
90% 1.97ms
99% 2.54ms
1359392 requests in 10.01s, 71.30MB read
Socket errors: connect 0, read 1359370, write 0, timeout 0
Requests/sec: 135751.05
Transfer/sec: 7.12MB
I repeated the rust runs a few times and the examples above are representative on my machine.
Speaking of my (testing) machine:
➜ neofetch
▗▄▄▄ ▗▄▄▄▄ ▄▄▄▖ rs@ryzen
▜███▙ ▜███▙ ▟███▛ --------
▜███▙ ▜███▙▟███▛ OS: NixOS 23.05.997.ddf4688dc7a (Stoat) x86_64
▜███▙ ▜██████▛ Host: Micro-Star International Co., Ltd. B550-A PRO (MS-7C56)
▟█████████████████▙ ▜████▛ ▟▙ Kernel: 6.3.7
▟███████████████████▙ ▜███▙ ▟██▙ Uptime: 14 days, 17 hours, 11 mins
▄▄▄▄▖ ▜███▙ ▟███▛ Packages: 2094 (nix-system), 1356 (nix-user), 7 (flatpak)
▟███▛ ▜██▛ ▟███▛ Shell: bash 5.2.15
▟███▛ ▜▛ ▟███▛ Resolution: 3840x2160
▟███████████▛ ▟██████████▙ DE: none+i3
▜██████████▛ ▟███████████▛ WM: i3
▟███▛ ▟▙ ▟███▛ Terminal: tmux
▟███▛ ▟██▙ ▟███▛ CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz
▟███▛ ▜███▙ ▝▀▀▀▀ GPU: AMD ATI Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT
▜██▛ ▜███▙ ▜██████████████████▛ Memory: 4222MiB / 32028MiB
▜▛ ▟████▙ ▜████████████████▛
▟██████▙ ▜███▙
▟███▛▜███▙ ▜███▙
▟███▛ ▜███▙ ▜███▙
▝▀▀▀ ▀▀▀▀▘ ▀▀▀▘
My AMD CPU shows up as 12 cores in htop.
I am willing to merge this, since the measure.sh
improvements are great!
But the rust-example performance differences, for some strange reason, are not so great, and on top of it: negative. I will experiment with the thread count for a moment and give an update in a few minutes.
Update: I tried 8, 12, 16, 32, 64 threads and did not get really different numbers out of it: most were ca. 131k req/s.
What stood out was 32 threads, where I got > 131k rather consistently:
➜ ./wrk/measure.sh rust
Finished release [optimized] target(s) in 0.03s
========================================================================
rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.92ms 1.02ms 203.00ms 96.12%
Req/Sec 33.42k 4.73k 39.67k 54.75%
Latency Distribution
50% 801.00us
75% 1.36ms
90% 1.71ms
99% 2.20ms
1330109 requests in 10.02s, 69.77MB read
Socket errors: connect 0, read 1330101, write 0, timeout 0
Requests/sec: 132749.58
Transfer/sec: 6.96MB
contrib_zap on master [!] via ↯ v0.11.0 via impure (nix-shell) took 11s
➜ ./wrk/measure.sh rust
Finished release [optimized] target(s) in 0.03s
========================================================================
rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.93ms 1.03ms 203.78ms 96.54%
Req/Sec 33.51k 4.88k 39.83k 51.75%
Latency Distribution
50% 824.00us
75% 1.37ms
90% 1.72ms
99% 2.20ms
1333909 requests in 10.02s, 69.97MB read
Socket errors: connect 0, read 1333871, write 0, timeout 0
Requests/sec: 133136.87
Transfer/sec: 6.98MB
contrib_zap on master [!] via ↯ v0.11.0 via impure (nix-shell) took 11s
contrib_zap on master [!] via ↯ v0.11.0 via impure (nix-shell) took 11s
➜ ./wrk/measure.sh rust
Finished release [optimized] target(s) in 0.03s
========================================================================
rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.92ms 606.17us 16.93ms 66.11%
Req/Sec 33.28k 4.71k 39.83k 63.25%
Latency Distribution
50% 804.00us
75% 1.37ms
90% 1.72ms
99% 2.24ms
1325189 requests in 10.03s, 69.51MB read
Socket errors: connect 0, read 1325154, write 0, timeout 0
Requests/sec: 132167.69
Transfer/sec: 6.93MB
contrib_zap on master [!] via ↯ v0.11.0 via impure (nix-shell) took 11s
➜
Hang on. There's another difference: when I tested the old code, I also tested with the old measure.sh
which did not limit the rust server to 4 threads.
With the new measure.sh and the old rust code, I get 132k, 136k, 134k, 138k req/s.
I am curious, how does the new rust code compare to the old one on your machine and what kind of machine is it?
Some more info:
$ rustc --version
rustc 1.69.0 (84c898d65 2023-04-16) (built from a source tarball)
$ cargo version
cargo 1.69.0
Ok I think the most important bit is that you have to always isolate the load generation from server to the best of your ability. So taskset should stay =)
As for number of threads, this just constraints how many things you could run in parallel, not how much you actually get (the latter is achieved with taskset).
zap (and most other modern web frameworks) use event based IO, and as a result they can actually serve multiple connections with one thread, so they only need as many threads as you have cores. The silly rust code uses blocking IO, and as a result needs more threads (though only 4 of them will be running at any given point in time).
My machine is a ryzen 5, though relative numbers should translate well across machines. I actually could not compile zap due to some zig errors, so I was comparing with go and axum (which is about as fast as rust is able to go anyway).
I've also eaten the brains of the folks in the rust community about why axum is so fast, and most people conclude it is some sort of dark magic that is not given to mere mortals.
Well put. Thanks for your clarifications and explanations.
Agree, the taskset should stay.
I used to do select
in the nineties, so IMHO old web / networking code is a kind of role-model to modern event-based I/O, which often combine events & threads or (async) "coroutines".
I agree that the silly rust code using blocking I/O reaches its performance limits very quickly, as only ever 4 connections can be served concurrently (under taskset constraints).
Interesting that you are on a Ryzen 5, too! 😊
Regarding compiling zap: I have pinned zap to the 0.11.0 ( = latest) release of zig recently. So it might not work with zig master anymore.
My question to you now is: should I merge the entire PR?
Can you compare the old rust code with your new rust code using (your adapted) measure.sh
?
I am wondering: if the old (by-the-book) variant does unnecessary memory allocations and your clean version does not, why is the old version slightly faster on my machine?
Seems like a darker kind of dark magic to me 😊.
Alternative: I could provide both rust implementations. Then automate the perf tests so it's easy for anyone to replicate and see which one performs best.
Please let me know how you'd like me to proceed with this PR.
Update: I added your rust example as "rust-clean" and renamed the other one to "rust-bythebook".
I created a wrk/measure_all.sh
convenience script that runs all perf tests 3 times.
Also, wrk/graph.py
generates the performance graphs: req/sec, xfer/sec, based on the mean of the 3 runs each.
The nix flake provides all dependencies. For python, you need matplotlib for generating the graphs (and sanic if you want to test it).
I will run all this on my test machine when I return home. On my work laptop, I get these results which I don't want to put in the README because it looks terribly biased :facepalm: :
Maybe if you can hack measure_all.sh
to just run the 2 rust versions, then run python ./wrk/graph.py
, you can check which of the two is faster on your machine. I am still curious.
Update: I found a workstation at work (not a laptop) that yields more realistic numbers:
Interesting!
Okay, with some help from the community, here are the reasons for poor perf (in order of importance)
Just checking. Is it ripe for merging?
Awesome contrib @alexpyattaev , :clap:
$> neofetch --stdout
kassane@Catarino
----------------
OS: Arch Linux x86_64
Kernel: 6.4.11-arch2-1
Uptime: 26 mins
Packages: 992 (pacman)
Shell: zsh 5.9
Resolution: 1360x768
DE: Plasma 5.27.7
WM: KWin
Theme: [Plasma], Breeze [GTK2/3]
Icons: [Plasma], breeze-dark [GTK2/3]
Terminal: konsole
CPU: AMD Ryzen 7 5700G with Radeon Graphics (16) @ 3.800GHz
GPU: NVIDIA Geforce RTX 3050
Memory: 2813MiB / 15778MiB
Note: my OS (arch-linux) doesn't have python-sanic and dotnet installed.
@kassane there is something seriously strange going on on your machine. I just cannot wrap my head around the simplistic, blocking rust-bythebook example outperforming axum. Rust and performance will forever be a mystery to me. People argue theoretically why their rust implementation is bound to be fast (because of x, y, z) but when you put it to the test, theory and reality often don't like to agree. Similar thing can be said about cpp-beast. It claims to be async and a beast, yet its performance seems to be below what I would expect from a "fast" C++ application. I more and more come to the conclusion that perf tests are a trap. How is it possible that "shitty" rust outperforms axum. My mind is blown!
@kassane have you used @alexpyattaev 's latest patch btw? I haven't because I want to get his "finished, ready for merging" signal first.
Just wondering... By my logic, that one is supposed to be faster than the bythebook example.
Why are all rust implementations performing similarly fast on your machine when their implementations are so different? This could mean they all hit the same limit of your system.
Can you also share your lscpu output?
@kassane have you used @alexpyattaev 's latest patch btw?
yes! replacing rust implementation only.
Can you also share your lscpu output?
$>lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 5700G with Radeon Graphics
CPU family: 25
Model: 80
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 49%
CPU max MHz: 4672.0698
CPU min MHz: 1400.0000
BogoMIPS: 7588.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_
opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 f
ma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa
lignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_
pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb
sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoi
nvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsa
ve_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 4 MiB (8 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Welcome to the Ryzen club btw 😁.
@alexpyattaev sorry to be such a pain but: do you consider the PR ready for merging? I desperately want to try it out, but better wait until it is finished.
Hi, yes, as far as I can tell it is all good to go.
Why are all rust implementations performing similarly fast on your machine when their implementations are so different? This could mean they all hit the same limit of your system.
The performance of these impls depends a lot on how the memory hierarchy works in your CPU. The one with thread pool pre-made requires way less memory bandwidth, so it will perform somewhat better on a low-end CPU. On a high-end CPU it should all be about same.
If you have an intel machine you can dive deep into this with pcm tool. On all machines there are some perf counters available also which give you an idea about load on memory bus (though on AMD these stats are lacking compared to intel AFAIK)
hang on @alexpyattaev : you also changed the bythebook example. I suppose it's not 'by the book' anymore? Or is it? Can you please elaborate a little, pls?
So, I wanted to keep the by-the-book example for reference. In my eyes, we now have 4 rust implementations:
I would take your branch's bythebook and integrate it into the zap repo as 'bythebook-improved', use your clean as 'clean', and keep the old 'bythebook'.
Would you agree? Or am I getting sth terribly wrong here?
Uh... yes... sorry for confusion. I should have given it a better name=) The idea was that
The literal "as in the book solution" is not suitable for this comparison (as it does not have keepalive and has terrible thread sync arrangements). The only thing that would illustrate to zap users is that it is indeed possible to write terrible rust =) I'll be making a fix to the actual rust book once I have some time spare.
OK, thanks for clarifying. I will go ahead as outlined above if you don't mind, with the result being the 4 rust versions.
We could argue to leave the "classic bythebook" out of `measure_all.sh", and only keep the clean, and the bythebook-improved (as I would call it) versions, i.e. your current versions, in there.
It's entirely up to you which code to leave in vs out, my hope is simply to make sure that people using these benchmarks do not get misleading information.
To that end, I hope some cpp expert can come in and make a halfway decent idiomatic cpp implementation.
I think I will have to put in a legend under or on top of the graphs explaining the shortcomings and inapplicability of the original book version which should be underlined by the bythebook-improved version performing much better.
See my updated blazingly-fast.md: I don't like adding more benchmarks. The more there are, the more people get misled. And then, so set things straight, even more benchmarks need to be added which will trigger even more reactions in the form of slightly better implementations, etc. Soon we'd have multiple versions for all programming languages and even frameworks. I probably would make an exception for one good C++ implementation though, as I fear, cpp-beast is not the ideal candidate for representing C++. It was @kassane 's idea and I had thought with such a name it would be an absolute beast so I approved before measuring.
I suppose leaving one prominent benchmark per language is reasonable, in this case I'd suggest you leave axum (as it is by far the most reasonable thing to compare with zap in a feature-to-feature sort of way). The others can stay out of sight "for the curious folks to find" =)
Have a look at this: axum performs less good than your other implementations:
You really did the rust community a huge favor!!!!!!!
OK, so I have an idea: we put all the details into the blazingly-fast document - and keep the one in the readme simplified (with a link to the other document)
@alexpyattaev I cannot thank you enough. The endless shitstorms I received because of the rust benchmark should be a part of history. :smile: :+1:
You have also done the rust community a huge service. Many have tried, but only you (and your friends) succeeded, diving into the details deep enough and coming up with really favorable results for rust.
Thanks again, I will close this PR now, it has all been merged, as you can see in the blazingly-fast (details) and README.
Thanks again for your valuable contributions!!!!! :tada:
I'm happy to help. Thanks for braving the shitstorm and actually getting to the truth=) Rust fans can be quite toxic at times, there is way too much hype and drama. Be prepared though, zig will have its fair share at some point too.
Ping me once c++ version becomes a reality=)
I sure will ping you!
So far, the zig community (... as if all users of languages were in big, respective communities - but for a lack of a better word, I, too use it) has been small enough I suppose. Also, the maintainers are doing a brilliant job of leading by positive example. I wish it will stay this cozy. But once zig adoption becomes massive, no doubt will it also attract lower character people, toxic ones, and radical fans riding on the hype wave. I see potential for an ongoing pun-like competition between rust and zig devs to get out of hands sometimes, creating drama. I also have my own vague, not thought-through, theories about the need to feel "safe" (as opposed to: mere valuing safety) in one's programming language and certain personality traits. Could be totally wrong there but it surely makes for a good conspiracy theory 😂.
Now all measurements are forcing the server to run on 4 cores exactly, and wrk is forced to use the other 4 cores. Naturally, if someone has less than 8 cores they should upgrade=)
Also rust example now does not waste time waiting for IO and does not have useless memory allocations. It is still 2x slower than go due to lack of proper evented IO (which is covered by axum).