zigzap / zap

blazingly fast backends in zig
MIT License
1.99k stars 71 forks source link

Improved measurement setup with taskset, removed crud from rust bench #37

Closed alexpyattaev closed 10 months ago

alexpyattaev commented 10 months ago

Now all measurements are forcing the server to run on 4 cores exactly, and wrk is forced to use the other 4 cores. Naturally, if someone has less than 8 cores they should upgrade=)

Also rust example now does not waste time waiting for IO and does not have useless memory allocations. It is still 2x slower than go due to lack of proper evented IO (which is covered by axum).

renerocksai commented 10 months ago

Thank you for the improvements. Really, so many people have complained, maybe even looked at the code, but nothing ever came from it. So I am really grateful you dug into this!

Your proposed changes got me thinking:

I certainly don't mind isolating the server to 4 cores and wrk to 4 other cores. But for relative comparisons, I'd argue that all subjects under test are influenced by the same test setup. So, I am not sure whether it is really necessary for relative comparisons.

Then, I revisited the 4 threads topic. I recall why the rust code uses 4 threads only. Because the the zap code also uses 4 threads:

    zap.start(.{
        .threads = 4,
        .workers = 4,
    });

This got me thinking: why waste 4 processes for 4 threads. The IPC might slow things down. By experimentation, I figured out that the ideal number of worker processes seems to be: 2 for 4 threads.

    zap.start(.{
        .threads = 4,
        .workers = 2,
    });

With that I got the following results (3 runs):

Requests/sec: 686589.24
Transfer/sec:    104.11MB

Requests/sec: 658573.26
Transfer/sec:     99.86MB

Requests/sec: 681040.33
Transfer/sec:    103.27MB

Which averages out to:

avg. requests/sec: 675400.94
avg. transfer/sec:    102.41MB

compared to 4 workers:

avg. requests/sec: 649692.13
avg. transfer/sec:    98.52MB

Which is significant.

Now, the question to me is: is it fair to give the rust example 128 threads when the zap example gets only 4?

BTW my tests from above weren't yet made with the 4 / 4 core split that you suggest.

renerocksai commented 10 months ago

Update: You've convinced me that the taskset stuff is good because it creates even more impressive numbers :smile:.

However, I just ran the new, updated rust example and still get no performance increase. Here are some exemplary runs on my machine:

./wrk/measure.sh zig
Listening on 0.0.0.0:3000
========================================================================
                          zig
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   588.74us  423.55us  17.88ms   92.34%
    Req/Sec   175.71k    26.96k  216.79k    52.75%
  Latency Distribution
     50%  487.00us
     75%  674.00us
     90%    0.93ms
     99%    2.12ms
  6991199 requests in 10.02s, 1.04GB read
Requests/sec: 698047.77
Transfer/sec:    105.85MB

contrib_zap on  master via ↯ v0.11.0 via  impure (nix-shell) took 11s 
➜ 
./wrk/measure.sh rust
    Finished release [optimized] target(s) in 0.03s
========================================================================
                          rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   624.20us    1.49ms 208.31ms   99.90%
    Req/Sec    33.01k     5.21k   39.64k    64.25%
  Latency Distribution
     50%  579.00us
     75%  738.00us
     90%    0.93ms
     99%    1.38ms
  1313715 requests in 10.02s, 68.91MB read
  Socket errors: connect 0, read 1313668, write 0, timeout 0
Requests/sec: 131114.37
Transfer/sec:      6.88MB

contrib_zap on  master via ↯ v0.11.0 via  impure (nix-shell) took 11s 
➜ ./wrk/measure.sh go
listening on 0.0.0.0:8090
========================================================================
                          go
========================================================================
Running 10s test @ http://127.0.0.1:8090/hello
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.18ms  810.81us  13.95ms   75.24%
    Req/Sec    89.86k     1.04k   92.66k    73.25%
  Latency Distribution
     50%    1.09ms
     75%    1.54ms
     90%    2.12ms
     99%    4.18ms
  3577111 requests in 10.02s, 457.13MB read
Requests/sec: 356961.73
Transfer/sec:     45.62MB

In fact, it seems to have gotten a bit slower. Here is an exemplary run of the "old" code:

➜ ./wrk/measure.sh rust
    Finished release [optimized] target(s) in 0.03s
========================================================================
                          rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.26ms  625.30us  14.34ms   64.06%
    Req/Sec    34.14k     2.63k   39.02k    78.75%
  Latency Distribution
     50%    1.38ms
     75%    1.69ms
     90%    1.97ms
     99%    2.54ms
  1359392 requests in 10.01s, 71.30MB read
  Socket errors: connect 0, read 1359370, write 0, timeout 0
Requests/sec: 135751.05
Transfer/sec:      7.12MB

I repeated the rust runs a few times and the examples above are representative on my machine.

Speaking of my (testing) machine:

➜ neofetch
          ▗▄▄▄       ▗▄▄▄▄    ▄▄▄▖            rs@ryzen 
          ▜███▙       ▜███▙  ▟███▛            -------- 
           ▜███▙       ▜███▙▟███▛             OS: NixOS 23.05.997.ddf4688dc7a (Stoat) x86_64 
            ▜███▙       ▜██████▛              Host: Micro-Star International Co., Ltd. B550-A PRO (MS-7C56) 
     ▟█████████████████▙ ▜████▛     ▟▙        Kernel: 6.3.7 
    ▟███████████████████▙ ▜███▙    ▟██▙       Uptime: 14 days, 17 hours, 11 mins 
           ▄▄▄▄▖           ▜███▙  ▟███▛       Packages: 2094 (nix-system), 1356 (nix-user), 7 (flatpak) 
          ▟███▛             ▜██▛ ▟███▛        Shell: bash 5.2.15 
         ▟███▛               ▜▛ ▟███▛         Resolution: 3840x2160 
▟███████████▛                  ▟██████████▙   DE: none+i3 
▜██████████▛                  ▟███████████▛   WM: i3 
      ▟███▛ ▟▙               ▟███▛            Terminal: tmux 
     ▟███▛ ▟██▙             ▟███▛             CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz 
    ▟███▛  ▜███▙           ▝▀▀▀▀              GPU: AMD ATI Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT 
    ▜██▛    ▜███▙ ▜██████████████████▛        Memory: 4222MiB / 32028MiB 
     ▜▛     ▟████▙ ▜████████████████▛
           ▟██████▙       ▜███▙                                       
          ▟███▛▜███▙       ▜███▙                                      
         ▟███▛  ▜███▙       ▜███▙
         ▝▀▀▀    ▀▀▀▀▘       ▀▀▀▘

My AMD CPU shows up as 12 cores in htop.

I am willing to merge this, since the measure.sh improvements are great!

But the rust-example performance differences, for some strange reason, are not so great, and on top of it: negative. I will experiment with the thread count for a moment and give an update in a few minutes.

renerocksai commented 10 months ago

Update: I tried 8, 12, 16, 32, 64 threads and did not get really different numbers out of it: most were ca. 131k req/s.

What stood out was 32 threads, where I got > 131k rather consistently:

➜ ./wrk/measure.sh rust
    Finished release [optimized] target(s) in 0.03s
========================================================================
                          rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.92ms    1.02ms 203.00ms   96.12%
    Req/Sec    33.42k     4.73k   39.67k    54.75%
  Latency Distribution
     50%  801.00us
     75%    1.36ms
     90%    1.71ms
     99%    2.20ms
  1330109 requests in 10.02s, 69.77MB read
  Socket errors: connect 0, read 1330101, write 0, timeout 0
Requests/sec: 132749.58
Transfer/sec:      6.96MB

contrib_zap on  master [!] via ↯ v0.11.0 via  impure (nix-shell) took 11s 
➜ ./wrk/measure.sh rust
    Finished release [optimized] target(s) in 0.03s
========================================================================
                          rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.93ms    1.03ms 203.78ms   96.54%
    Req/Sec    33.51k     4.88k   39.83k    51.75%
  Latency Distribution
     50%  824.00us
     75%    1.37ms
     90%    1.72ms
     99%    2.20ms
  1333909 requests in 10.02s, 69.97MB read
  Socket errors: connect 0, read 1333871, write 0, timeout 0
Requests/sec: 133136.87
Transfer/sec:      6.98MB

contrib_zap on  master [!] via ↯ v0.11.0 via  impure (nix-shell) took 11s 
contrib_zap on  master [!] via ↯ v0.11.0 via  impure (nix-shell) took 11s 
➜ ./wrk/measure.sh rust
    Finished release [optimized] target(s) in 0.03s
========================================================================
                          rust
========================================================================
Running 10s test @ http://127.0.0.1:7878
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.92ms  606.17us  16.93ms   66.11%
    Req/Sec    33.28k     4.71k   39.83k    63.25%
  Latency Distribution
     50%  804.00us
     75%    1.37ms
     90%    1.72ms
     99%    2.24ms
  1325189 requests in 10.03s, 69.51MB read
  Socket errors: connect 0, read 1325154, write 0, timeout 0
Requests/sec: 132167.69
Transfer/sec:      6.93MB

contrib_zap on  master [!] via ↯ v0.11.0 via  impure (nix-shell) took 11s 
➜ 
renerocksai commented 10 months ago

Hang on. There's another difference: when I tested the old code, I also tested with the old measure.sh which did not limit the rust server to 4 threads.

With the new measure.sh and the old rust code, I get 132k, 136k, 134k, 138k req/s.

I am curious, how does the new rust code compare to the old one on your machine and what kind of machine is it?

Some more info:

$ rustc --version
rustc 1.69.0 (84c898d65 2023-04-16) (built from a source tarball)
$ cargo version
cargo 1.69.0
alexpyattaev commented 10 months ago

Ok I think the most important bit is that you have to always isolate the load generation from server to the best of your ability. So taskset should stay =)

As for number of threads, this just constraints how many things you could run in parallel, not how much you actually get (the latter is achieved with taskset).

zap (and most other modern web frameworks) use event based IO, and as a result they can actually serve multiple connections with one thread, so they only need as many threads as you have cores. The silly rust code uses blocking IO, and as a result needs more threads (though only 4 of them will be running at any given point in time).

My machine is a ryzen 5, though relative numbers should translate well across machines. I actually could not compile zap due to some zig errors, so I was comparing with go and axum (which is about as fast as rust is able to go anyway).

I've also eaten the brains of the folks in the rust community about why axum is so fast, and most people conclude it is some sort of dark magic that is not given to mere mortals.

renerocksai commented 10 months ago

Well put. Thanks for your clarifications and explanations.

Agree, the taskset should stay.

I used to do select in the nineties, so IMHO old web / networking code is a kind of role-model to modern event-based I/O, which often combine events & threads or (async) "coroutines".

I agree that the silly rust code using blocking I/O reaches its performance limits very quickly, as only ever 4 connections can be served concurrently (under taskset constraints).

Interesting that you are on a Ryzen 5, too! 😊

Regarding compiling zap: I have pinned zap to the 0.11.0 ( = latest) release of zig recently. So it might not work with zig master anymore.

My question to you now is: should I merge the entire PR?

Can you compare the old rust code with your new rust code using (your adapted) measure.sh?

I am wondering: if the old (by-the-book) variant does unnecessary memory allocations and your clean version does not, why is the old version slightly faster on my machine?

Seems like a darker kind of dark magic to me 😊.

Alternative: I could provide both rust implementations. Then automate the perf tests so it's easy for anyone to replicate and see which one performs best.

Please let me know how you'd like me to proceed with this PR.

renerocksai commented 10 months ago

Update: I added your rust example as "rust-clean" and renamed the other one to "rust-bythebook".

I created a wrk/measure_all.sh convenience script that runs all perf tests 3 times.

Also, wrk/graph.py generates the performance graphs: req/sec, xfer/sec, based on the mean of the 3 runs each.

The nix flake provides all dependencies. For python, you need matplotlib for generating the graphs (and sanic if you want to test it).

I will run all this on my test machine when I return home. On my work laptop, I get these results which I don't want to put in the README because it looks terribly biased :facepalm: :

req_per_sec_graph

Maybe if you can hack measure_all.sh to just run the 2 rust versions, then run python ./wrk/graph.py, you can check which of the two is faster on your machine. I am still curious.

Update: I found a workstation at work (not a laptop) that yields more realistic numbers:

req_per_sec_graph

Interesting!

alexpyattaev commented 10 months ago

Okay, with some help from the community, here are the reasons for poor perf (in order of importance)

  1. no keepalive support - so the client had to reopen connection for every request, while other implementations are reusing connections (at least they should, and wrk expects them to)
  2. Having just 4 threads doing blocking IO resulted in poor utilization of the CPU due to stalls.
  3. mutex of channel was a bad idea =) Having a bunch of channels is way better.
  4. passing trait objects where a function pointer is sufficient creates some overheads (small as they are) in a sense that trait objects require memory allocation, which needs to be synced across threads.
renerocksai commented 10 months ago

Just checking. Is it ripe for merging?

kassane commented 10 months ago

Awesome contrib @alexpyattaev , :clap:

My Setup

$> neofetch --stdout
kassane@Catarino 
---------------- 
OS: Arch Linux x86_64 
Kernel: 6.4.11-arch2-1 
Uptime: 26 mins 
Packages: 992 (pacman) 
Shell: zsh 5.9 
Resolution: 1360x768 
DE: Plasma 5.27.7 
WM: KWin 
Theme: [Plasma], Breeze [GTK2/3] 
Icons: [Plasma], breeze-dark [GTK2/3] 
Terminal: konsole 
CPU: AMD Ryzen 7 5700G with Radeon Graphics (16) @ 3.800GHz 
GPU: NVIDIA Geforce RTX 3050 
Memory: 2813MiB / 15778MiB

Note: my OS (arch-linux) doesn't have python-sanic and dotnet installed.

Running test

req_per_sec_graph xfer_per_sec_graph

renerocksai commented 10 months ago

@kassane there is something seriously strange going on on your machine. I just cannot wrap my head around the simplistic, blocking rust-bythebook example outperforming axum. Rust and performance will forever be a mystery to me. People argue theoretically why their rust implementation is bound to be fast (because of x, y, z) but when you put it to the test, theory and reality often don't like to agree. Similar thing can be said about cpp-beast. It claims to be async and a beast, yet its performance seems to be below what I would expect from a "fast" C++ application. I more and more come to the conclusion that perf tests are a trap. How is it possible that "shitty" rust outperforms axum. My mind is blown!

renerocksai commented 10 months ago

@kassane have you used @alexpyattaev 's latest patch btw? I haven't because I want to get his "finished, ready for merging" signal first.

Just wondering... By my logic, that one is supposed to be faster than the bythebook example.

Why are all rust implementations performing similarly fast on your machine when their implementations are so different? This could mean they all hit the same limit of your system.

Can you also share your lscpu output?

kassane commented 10 months ago

@kassane have you used @alexpyattaev 's latest patch btw?

yes! replacing rust implementation only.

Can you also share your lscpu output?

$>lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 5700G with Radeon Graphics
    CPU family:          25
    Model:               80
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  49%
    CPU max MHz:         4672.0698
    CPU min MHz:         1400.0000
    BogoMIPS:            7588.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_
                         opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 f
                         ma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa
                         lignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_
                         pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb 
                         sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoi
                         nvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsa
                         ve_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
renerocksai commented 10 months ago

Welcome to the Ryzen club btw 😁.

@alexpyattaev sorry to be such a pain but: do you consider the PR ready for merging? I desperately want to try it out, but better wait until it is finished.

alexpyattaev commented 10 months ago

Hi, yes, as far as I can tell it is all good to go.

alexpyattaev commented 10 months ago

Why are all rust implementations performing similarly fast on your machine when their implementations are so different? This could mean they all hit the same limit of your system.

The performance of these impls depends a lot on how the memory hierarchy works in your CPU. The one with thread pool pre-made requires way less memory bandwidth, so it will perform somewhat better on a low-end CPU. On a high-end CPU it should all be about same.

If you have an intel machine you can dive deep into this with pcm tool. On all machines there are some perf counters available also which give you an idea about load on memory bus (though on AMD these stats are lacking compared to intel AFAIK)

renerocksai commented 10 months ago

hang on @alexpyattaev : you also changed the bythebook example. I suppose it's not 'by the book' anymore? Or is it? Can you please elaborate a little, pls?

renerocksai commented 10 months ago

So, I wanted to keep the by-the-book example for reference. In my eyes, we now have 4 rust implementations:

I would take your branch's bythebook and integrate it into the zap repo as 'bythebook-improved', use your clean as 'clean', and keep the old 'bythebook'.

Would you agree? Or am I getting sth terribly wrong here?

alexpyattaev commented 10 months ago

Uh... yes... sorry for confusion. I should have given it a better name=) The idea was that

The literal "as in the book solution" is not suitable for this comparison (as it does not have keepalive and has terrible thread sync arrangements). The only thing that would illustrate to zap users is that it is indeed possible to write terrible rust =) I'll be making a fix to the actual rust book once I have some time spare.

renerocksai commented 10 months ago

OK, thanks for clarifying. I will go ahead as outlined above if you don't mind, with the result being the 4 rust versions.

We could argue to leave the "classic bythebook" out of `measure_all.sh", and only keep the clean, and the bythebook-improved (as I would call it) versions, i.e. your current versions, in there.

alexpyattaev commented 10 months ago

It's entirely up to you which code to leave in vs out, my hope is simply to make sure that people using these benchmarks do not get misleading information.

To that end, I hope some cpp expert can come in and make a halfway decent idiomatic cpp implementation.

renerocksai commented 10 months ago

I think I will have to put in a legend under or on top of the graphs explaining the shortcomings and inapplicability of the original book version which should be underlined by the bythebook-improved version performing much better.

See my updated blazingly-fast.md: I don't like adding more benchmarks. The more there are, the more people get misled. And then, so set things straight, even more benchmarks need to be added which will trigger even more reactions in the form of slightly better implementations, etc. Soon we'd have multiple versions for all programming languages and even frameworks. I probably would make an exception for one good C++ implementation though, as I fear, cpp-beast is not the ideal candidate for representing C++. It was @kassane 's idea and I had thought with such a name it would be an absolute beast so I approved before measuring.

alexpyattaev commented 10 months ago

I suppose leaving one prominent benchmark per language is reasonable, in this case I'd suggest you leave axum (as it is by far the most reasonable thing to compare with zap in a feature-to-feature sort of way). The others can stay out of sight "for the curious folks to find" =)

renerocksai commented 10 months ago

Have a look at this: axum performs less good than your other implementations:

req_per_sec_graph

xfer_per_sec_graph

You really did the rust community a huge favor!!!!!!!

renerocksai commented 10 months ago

OK, so I have an idea: we put all the details into the blazingly-fast document - and keep the one in the readme simplified (with a link to the other document)

renerocksai commented 10 months ago

@alexpyattaev I cannot thank you enough. The endless shitstorms I received because of the rust benchmark should be a part of history. :smile: :+1:

You have also done the rust community a huge service. Many have tried, but only you (and your friends) succeeded, diving into the details deep enough and coming up with really favorable results for rust.

Thanks again, I will close this PR now, it has all been merged, as you can see in the blazingly-fast (details) and README.

Thanks again for your valuable contributions!!!!! :tada:

alexpyattaev commented 10 months ago

I'm happy to help. Thanks for braving the shitstorm and actually getting to the truth=) Rust fans can be quite toxic at times, there is way too much hype and drama. Be prepared though, zig will have its fair share at some point too.

Ping me once c++ version becomes a reality=)

renerocksai commented 10 months ago

I sure will ping you!

So far, the zig community (... as if all users of languages were in big, respective communities - but for a lack of a better word, I, too use it) has been small enough I suppose. Also, the maintainers are doing a brilliant job of leading by positive example. I wish it will stay this cozy. But once zig adoption becomes massive, no doubt will it also attract lower character people, toxic ones, and radical fans riding on the hype wave. I see potential for an ongoing pun-like competition between rust and zig devs to get out of hands sometimes, creating drama. I also have my own vague, not thought-through, theories about the need to feel "safe" (as opposed to: mere valuing safety) in one's programming language and certain personality traits. Could be totally wrong there but it surely makes for a good conspiracy theory 😂.