Poor TCP performance - Githubissues

xhpohanka commented 4 years ago

I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With nucleo_f429zi board and big_http_download sample I got very slow download speed. This pushed me to check network performace with zperf.

For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.

I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in zperf. With bisecting i found that this issue starts with d88f25bd763e2dfa70873b3c2321f2f8677d643d.

I hoped that reverting it will solve also the slow big_http_download but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.

I do not know if these issues are related just with STM32 platform. I have just mentioned nucleo_f429zi and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.

The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?

jukkar commented 4 years ago

Indeed, there is a regression with the TCP throughput. Problem with TCP has been that currently we have no proper tests for TCP that would catch these kind of regression issues. This has been one of the reasons we have been building a new version of the TCP stack that would support proper testing that could be integrated into sanitychecker. The TCP2 has been cooking quite long time and we are slowly getting there where it would be useful, but we are not yet there. Obviously the 10kb/sec is not good and this needs to be fixed.

jukkar commented 4 years ago

I just compiled samples/net/sockets/dumb_http_server_mt for Atmel sam-e70 board and got following numbers:

ab -n100 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient).....done

Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        2084 bytes

Concurrency Level:      1
Time taken for tests:   0.381 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      214000 bytes
HTML transferred:       208400 bytes
Requests per second:    262.67 [#/sec] (mean)
Time per request:       3.807 [ms] (mean)
Time per request:       3.807 [ms] (mean, across all concurrent requests)
Transfer rate:          548.95 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     1    3  20.7      1     208
Waiting:        1    3  20.7      1     208
Total:          1    4  20.6      2     208

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      2
  95%      2
  98%      3
  99%    208
 100%    208 (longest request)

So definitely more than 10kb/sec. Then using wrk

./wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.74ms   62.79ms 616.12ms   97.41%
    Req/Sec     1.74      3.83    20.00     89.80%
  Latency Distribution
     50%    1.29ms
     75%    1.35ms
     90%    4.46ms
     99%  207.79ms
  116 requests in 20.10s, 242.42KB read
  Socket errors: connect 0, read 134, write 0, timeout 0
Requests/sec:      5.77
Transfer/sec:     12.06KB

Now the transfer speed is very poor.

I then increased the CONFIG_NET_TCP_BACKLOG_SIZE=2 and got considerably better results:

./wrk -d 10 -t 2 -c 100 --latency http://192.0.2.1:8080
Running 10s test @ http://192.0.2.1:8080
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.61ms  421.92us   9.90ms   95.03%
    Req/Sec   375.32    153.56   610.00     72.80%
  Latency Distribution
     50%    1.41ms
     75%    1.88ms
     90%    1.98ms
     99%    2.17ms
  4768 requests in 10.01s, 9.73MB read
  Socket errors: connect 0, read 4774, write 0, timeout 0
Requests/sec:    476.10
Transfer/sec:      0.97MB

xhpohanka commented 4 years ago

I have this behavior for STM32F429ZI

CONFIG_NET_TCP_BACKLOG_SIZE=1

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.51ms    1.79ms   8.77ms   89.86%
    Req/Sec     1.86      3.43    10.00     86.96%
  Latency Distribution
     50%    1.93ms
     75%    1.95ms
     90%    6.70ms
     99%    8.77ms
  69 requests in 20.10s, 144.20KB read
  Socket errors: connect 0, read 69, write 0, timeout 0
Requests/sec:      3.43
Transfer/sec:      7.17KB

CONFIG_NET_TCP_BACKLOG_SIZE=2

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   196.18ms   36.93ms 407.71ms   96.32%
    Req/Sec     8.92      4.18    35.00     85.63%
  Latency Distribution
     50%  202.17ms
     75%  202.23ms
     90%  202.32ms
     99%  209.31ms
  896 requests in 20.09s, 1.84MB read
  Socket errors: connect 0, read 898, write 0, timeout 0
Requests/sec:     44.60
Transfer/sec:     93.92KB

I do not understand why latency is so slow.

The previous result was for v2.2.0-rc3, for 2.1.0-rc3 it is even worse... CONFIG_NET_TCP_BACKLOG_SIZE=2

$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://10.42.0.192:8080
  24 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   168.84ms   74.40ms 220.12ms   82.07%
    Req/Sec     5.78     10.02   100.00     98.40%
  Latency Distribution
     50%  202.34ms
     75%  202.38ms
     90%  202.58ms
     99%  218.44ms
  145 requests in 20.03s, 305.86KB read
  Socket errors: connect 0, read 789, write 0, timeout 0
Requests/sec:      7.24
Transfer/sec:     15.27KB

xhpohanka commented 4 years ago

It also seems to me that issue with big_http_download sample is a bit different. The download is very slow for me with no respect to CONFIG_NET_TCP_BACKLOG_SIZE or reverting d88f25bd763e2dfa70873b3c2321f2f8677d643d commit. Downloading of 52kB sized file lasts 8.5 seconds. You can check my wireshark log https://drive.google.com/open?id=163-v3MlK3Hgc4F47X05GSSJRo3EFpX-s, it is full of [TCP Window Full] packets.

I also checked zperf again on v2.2.0-rc3 and this is what I get with upstream code

$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.0.2.2 port 51802 connected with 192.0.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  13.6 KBytes  10.9 Kbits/sec

and this with d88f25bd763e2dfa70873b3c2321f2f8677d643d reverted, backlog size has no impact here

$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 192.0.2.2 port 51778 connected with 192.0.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  16.1 MBytes  13.4 Mbits/sec

I'm very interested in getting this to work better. I can do more tests and even development if I get some guidance...

jukkar commented 4 years ago

Some fixes for the memory leaks in #23334 The #23246 is related to this issue. The low performance numbers are probably culprit of lot of packets being dropped. Anyway, the memory leaks needs to be fixes first, and then we can see how the performance number behave.

pfalcon commented 4 years ago

For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.

Indeed, there is a regression with the TCP throughput.

I personally never saw other TCP speed figures from Zephyr with frdm_k64f/any of qemu networking drivers. (Well, perhaps I saw 15KBytes/s, but you get the point.) I actually wanted to add dumping of download speed to big_http_download, but decided against that ;-).

pfalcon commented 4 years ago

I'm currently working on switching over big_http_download to a local host downloads (yep, to put it into CI and not rely on external hosts which would provide additional disturbance to testing).

And I decided to share what I see with just an asciinema recording: https://asciinema.org/a/BXbcuYTQsrsPxDmdCnbfUz7Qz (as I'm not sure how soon they garbage-collect uploads, also attaching here big_http_download-localhost.zip ).

So, as you can see, it starts slow. Then at ~70Kb (~00:19 on the cast) it finally start to work "as it should", than at ~140KB (~00:21) it breaks again. You've already read my explanation of what happens (based on hours of peering into writeshark output (but months ago)): it starts mis-synchronized with Linux TCP sending rexmits and bumps to a large backoff delay, then manages to synchronize to Linux (this is literally the first time I see it), then loses the sync again and goes back to swamp for the rest of download.

pfalcon commented 4 years ago

(Which makes me think maybe add a calculation of "momentary" speed over e.g. last second or two, and print out each on a separate line, not as with "\r").

jukkar commented 4 years ago

I tested this with nucleo_f767zi and see wildly different numbers which depend on the tool used. ApacheBench gave around 470kb/sec results which is quite reasonable. The wrk tool on the other hand pushes so much stuff to zephyr that it easily runs out of memory, and starts to drop packets. This then affects the performance number a lot. I am lowering the priority of this one until we figure out what are the correct numbers.

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

pfalcon commented 4 years ago

Recently, Zephyr switched to newer, "TCP2" implementation by default. I'm doing its triaging for the 2.4 release (#27876, #27982), ans also would like to record some quick performance data:

dumb_http_server sample, qemu_x86, standard SLIP-based QEMU networking setup, ab -n1000 http://192.0.2.1:8080/: 60.78 req/s.

For reference, with TCP1 (CONFIG_NET_TCP1=y), it's 4.18 req/s.

So, there's a noticeable improvement in connection latency.

pfalcon commented 4 years ago

But not so bright with data transfer throughput. The sample is big_http_download, the rest is as above.

First, reference with CONFIG_NET_TCP1=y: 3108 b/s . That's pretty slow. I remember speeds around 10KB/s. Dunno what's up now, my inet connection is not ideal either.

So, with default TCP2, it's 6154 b/s (mind #27982, the transfer doesn't complete successfully, implied end time was used to calc speed).

pfalcon commented 4 years ago

For reference, trying @nzmichaelh's hack from https://github.com/zephyrproject-rtos/zephyr/issues/26330#issuecomment-656194045 (set NET_TCP_BUF_MAX_LEN to (4*1280)) didn't make a statistical difference to me (on qemu_x86 with SLIP, again).

vandita5 commented 4 years ago

Has anyone looked at the cpu usage when pushing the udp throughput to 10Mbps? I am able to go up to 8.5 Mbps with modified echo_server sample but see ~70% cpu usage. Any tips on how to reduce that? I've modified echo server to receive a message then send a dummy packet of ~1200 bytes.

pfalcon commented 4 years ago

11075 b/s with big_http_download on frdm_k64f.

dleach02 commented 4 years ago

@pfalcon, first, thank you for reporting these metrics. For reference though, when you report measured results, do you have an expected value in mind that we can compare against?

carlescufi commented 4 years ago

@jukkar could you please settle on the benchmarks that we use to measure progress in this issue?

github-actions[bot] commented 3 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

kevin0402 commented 3 years ago

@jukkar wondering to know if this issue will be further investigated into or not ? In my test, TCP2 performance is still poor compared to FreeRTOS on the same hardware platform.

@jhqian my board is i.mx rt1060 evb, do you have any suggestion ? have you done any TCP performance test before ?