Closed xhpohanka closed 2 years ago
Indeed, there is a regression with the TCP throughput. Problem with TCP has been that currently we have no proper tests for TCP that would catch these kind of regression issues. This has been one of the reasons we have been building a new version of the TCP stack that would support proper testing that could be integrated into sanitychecker. The TCP2 has been cooking quite long time and we are slowly getting there where it would be useful, but we are not yet there. Obviously the 10kb/sec is not good and this needs to be fixed.
I just compiled samples/net/sockets/dumb_http_server_mt
for Atmel sam-e70 board and got following numbers:
ab -n100 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.0.2.1 (be patient).....done
Server Software:
Server Hostname: 192.0.2.1
Server Port: 8080
Document Path: /
Document Length: 2084 bytes
Concurrency Level: 1
Time taken for tests: 0.381 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 214000 bytes
HTML transferred: 208400 bytes
Requests per second: 262.67 [#/sec] (mean)
Time per request: 3.807 [ms] (mean)
Time per request: 3.807 [ms] (mean, across all concurrent requests)
Transfer rate: 548.95 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 1 3 20.7 1 208
Waiting: 1 3 20.7 1 208
Total: 1 4 20.6 2 208
Percentage of the requests served within a certain time (ms)
50% 2
66% 2
75% 2
80% 2
90% 2
95% 2
98% 3
99% 208
100% 208 (longest request)
So definitely more than 10kb/sec. Then using wrk
./wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
24 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 10.74ms 62.79ms 616.12ms 97.41%
Req/Sec 1.74 3.83 20.00 89.80%
Latency Distribution
50% 1.29ms
75% 1.35ms
90% 4.46ms
99% 207.79ms
116 requests in 20.10s, 242.42KB read
Socket errors: connect 0, read 134, write 0, timeout 0
Requests/sec: 5.77
Transfer/sec: 12.06KB
Now the transfer speed is very poor.
I then increased the CONFIG_NET_TCP_BACKLOG_SIZE=2
and got considerably better results:
./wrk -d 10 -t 2 -c 100 --latency http://192.0.2.1:8080
Running 10s test @ http://192.0.2.1:8080
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.61ms 421.92us 9.90ms 95.03%
Req/Sec 375.32 153.56 610.00 72.80%
Latency Distribution
50% 1.41ms
75% 1.88ms
90% 1.98ms
99% 2.17ms
4768 requests in 10.01s, 9.73MB read
Socket errors: connect 0, read 4774, write 0, timeout 0
Requests/sec: 476.10
Transfer/sec: 0.97MB
I have this behavior for STM32F429ZI
CONFIG_NET_TCP_BACKLOG_SIZE=1
$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
24 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.51ms 1.79ms 8.77ms 89.86%
Req/Sec 1.86 3.43 10.00 86.96%
Latency Distribution
50% 1.93ms
75% 1.95ms
90% 6.70ms
99% 8.77ms
69 requests in 20.10s, 144.20KB read
Socket errors: connect 0, read 69, write 0, timeout 0
Requests/sec: 3.43
Transfer/sec: 7.17KB
CONFIG_NET_TCP_BACKLOG_SIZE=2
$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://192.0.2.1:8080
24 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 196.18ms 36.93ms 407.71ms 96.32%
Req/Sec 8.92 4.18 35.00 85.63%
Latency Distribution
50% 202.17ms
75% 202.23ms
90% 202.32ms
99% 209.31ms
896 requests in 20.09s, 1.84MB read
Socket errors: connect 0, read 898, write 0, timeout 0
Requests/sec: 44.60
Transfer/sec: 93.92KB
I do not understand why latency is so slow.
The previous result was for v2.2.0-rc3, for 2.1.0-rc3 it is even worse...
CONFIG_NET_TCP_BACKLOG_SIZE=2
$ wrk -d 20 -t 24 -c 500 --latency http://192.0.2.1:8080
Running 20s test @ http://10.42.0.192:8080
24 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 168.84ms 74.40ms 220.12ms 82.07%
Req/Sec 5.78 10.02 100.00 98.40%
Latency Distribution
50% 202.34ms
75% 202.38ms
90% 202.58ms
99% 218.44ms
145 requests in 20.03s, 305.86KB read
Socket errors: connect 0, read 789, write 0, timeout 0
Requests/sec: 7.24
Transfer/sec: 15.27KB
It also seems to me that issue with big_http_download
sample is a bit different. The download is very slow for me with no respect to CONFIG_NET_TCP_BACKLOG_SIZE
or reverting d88f25bd763e2dfa70873b3c2321f2f8677d643d commit. Downloading of 52kB sized file lasts 8.5 seconds. You can check my wireshark log https://drive.google.com/open?id=163-v3MlK3Hgc4F47X05GSSJRo3EFpX-s, it is full of [TCP Window Full] packets.
I also checked zperf
again on v2.2.0-rc3 and this is what I get with upstream code
$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.0.2.2 port 51802 connected with 192.0.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.2 sec 13.6 KBytes 10.9 Kbits/sec
and this with d88f25bd763e2dfa70873b3c2321f2f8677d643d reverted, backlog size has no impact here
$ iperf -c 192.0.2.1
------------------------------------------------------------
Client connecting to 192.0.2.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.0.2.2 port 51778 connected with 192.0.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 16.1 MBytes 13.4 Mbits/sec
I'm very interested in getting this to work better. I can do more tests and even development if I get some guidance...
Some fixes for the memory leaks in #23334 The #23246 is related to this issue. The low performance numbers are probably culprit of lot of packets being dropped. Anyway, the memory leaks needs to be fixes first, and then we can see how the performance number behave.
For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.
Indeed, there is a regression with the TCP throughput.
I personally never saw other TCP speed figures from Zephyr with frdm_k64f/any of qemu networking drivers. (Well, perhaps I saw 15KBytes/s, but you get the point.) I actually wanted to add dumping of download speed to big_http_download
, but decided against that ;-).
I'm currently working on switching over big_http_download to a local host downloads (yep, to put it into CI and not rely on external hosts which would provide additional disturbance to testing).
And I decided to share what I see with just an asciinema recording: https://asciinema.org/a/BXbcuYTQsrsPxDmdCnbfUz7Qz (as I'm not sure how soon they garbage-collect uploads, also attaching here big_http_download-localhost.zip ).
So, as you can see, it starts slow. Then at ~70Kb (~00:19 on the cast) it finally start to work "as it should", than at ~140KB (~00:21) it breaks again. You've already read my explanation of what happens (based on hours of peering into writeshark output (but months ago)): it starts mis-synchronized with Linux TCP sending rexmits and bumps to a large backoff delay, then manages to synchronize to Linux (this is literally the first time I see it), then loses the sync again and goes back to swamp for the rest of download.
(Which makes me think maybe add a calculation of "momentary" speed over e.g. last second or two, and print out each on a separate line, not as with "\r").
I tested this with nucleo_f767zi and see wildly different numbers which depend on the tool used. ApacheBench gave around 470kb/sec results which is quite reasonable. The wrk tool on the other hand pushes so much stuff to zephyr that it easily runs out of memory, and starts to drop packets. This then affects the performance number a lot. I am lowering the priority of this one until we figure out what are the correct numbers.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Recently, Zephyr switched to newer, "TCP2" implementation by default. I'm doing its triaging for the 2.4 release (#27876, #27982), ans also would like to record some quick performance data:
dumb_http_server sample, qemu_x86, standard SLIP-based QEMU networking setup, ab -n1000 http://192.0.2.1:8080/
: 60.78 req/s.
For reference, with TCP1 (CONFIG_NET_TCP1=y
), it's 4.18 req/s.
So, there's a noticeable improvement in connection latency.
But not so bright with data transfer throughput. The sample is big_http_download, the rest is as above.
First, reference with CONFIG_NET_TCP1=y: 3108 b/s . That's pretty slow. I remember speeds around 10KB/s. Dunno what's up now, my inet connection is not ideal either.
So, with default TCP2, it's 6154 b/s (mind #27982, the transfer doesn't complete successfully, implied end time was used to calc speed).
For reference, trying @nzmichaelh's hack from https://github.com/zephyrproject-rtos/zephyr/issues/26330#issuecomment-656194045 (set NET_TCP_BUF_MAX_LEN to (4*1280)) didn't make a statistical difference to me (on qemu_x86 with SLIP, again).
Has anyone looked at the cpu usage when pushing the udp throughput to 10Mbps? I am able to go up to 8.5 Mbps with modified echo_server sample but see ~70% cpu usage. Any tips on how to reduce that? I've modified echo server to receive a message then send a dummy packet of ~1200 bytes.
11075 b/s with big_http_download on frdm_k64f.
@pfalcon, first, thank you for reporting these metrics. For reference though, when you report measured results, do you have an expected value in mind that we can compare against?
@jukkar could you please settle on the benchmarks that we use to measure progress in this issue?
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
@jukkar wondering to know if this issue will be further investigated into or not ? In my test, TCP2 performance is still poor compared to FreeRTOS on the same hardware platform.
@jhqian my board is i.mx rt1060 evb, do you have any suggestion ? have you done any TCP performance test before ?
wondering to know if this issue will be further investigated into or not ? In my test, TCP2 performance is still poor compared to FreeRTOS on the same hardware platform.
No major investigation is going on atm for the performance issue. One thing we are missing is a consistent and documented test environment i.e., how to test TCP performance, what tools to use etc. It is difficult to get consistent results as people are using different boards and software.
One thing that certainly affects TCP performance in zephyr is that currently Nagle algorithm and delayed acks are not implemented. There are certainly areas of improvement here.
One thing that certainly affects TCP performance in zephyr is that currently Nagle algorithm and delayed acks are not implemented. There are certainly areas of improvement here.
Yes, I believe that would be some of key factors. Is there any plan to do this in next release V2.5 ?
Is there any plan to do this in next release V2.5 ?
I have not seen any plans for enhancements in this area. Patches are welcome if you want to improve the code.
Just did a small test, and the TCP performance was really bad. I only got 2.5kb/sec for qemu with echo-server/client. After a quick investigation with wireshark, there is a 200ms timeout in linux side. Did not investigate yet why there is a timeout here. This used to work much better earlier.
This used to work much better earlier.
For me, it always worked like that (with occasional very short streaks of non-slow performance). The timeout is due to re-xmit timeout, i.e. there's some weird mis-synchronization between Z and Linux, where gradual re-xmit fallback always ends up in the "pit" side of it. (Writing by memory, didn't look into it for a while. Actually heck, I looked yet with TCP1.)
The timeout is due to re-xmit
Yes, I know. For some reason zephyr drops the packet which leads to this 200ms timeout (this is probably Nagle hitting in linux side). The TCP2 was working much better last time I looked, dunno what has changed now.
For my quick qemu test, a packet from linux is not received by zephyr which leads to the 200ms timeout.
[00:05:06.800,000] <dbg> eth_e1000.e1000_tx: iow32 TDT 0x00000001
[00:05:06.800,000] <dbg> eth_e1000.e1000_tx: tx.sta: 0x01
[00:05:06.800,000] <dbg> eth_e1000.e1000_isr: ior32 ICR 0x00000043
[00:05:06.800,000] <dbg> eth_e1000.e1000_rx: rx.sta: 0x07
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): ACK Seq=2221158581 Ack=1023855135 Len=0 [ESTA~
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): Lowering send window from 64203 to 2730
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): conn: 0x15eb90 len_acked=26
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): conn: 0x15eb90 total=0, unacked_len=0, send_win=2730, mss=1460
we should received ACK here
[00:05:06.800,000] <dbg> eth_e1000.e1000_isr: ior32 ICR 0x00000040
[00:05:06.800,000] <dbg> eth_e1000.e1000_rx: rx.sta: 0x07
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): conn: 0x15eb90 send_data_timer=1, send_data_retries=0
[00:05:06.800,000] <dbg> net_tcp.tcp_unsent_len: (rx_q[0]): unsent_len=0
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): ACK,PSH Seq=2221159117 Ack=1023855135 Len=536~
[00:05:06.800,000] <dbg> net_tcp.tcp_in: (rx_q[0]): Lowering send window from 64203 to 2730
we drop this ACK,PSH because seq does not match and currently zephyr does not support packet reordering
the next ACK msg should have received earlier but disappeared, so linux resends it after 200ms
[00:05:07.030,000] <dbg> eth_e1000.e1000_isr: ior32 ICR 0x00000040
[00:05:07.030,000] <dbg> eth_e1000.e1000_rx: rx.sta: 0x07
[00:05:07.030,000] <dbg> net_tcp.tcp_in: (rx_q[0]): ACK Seq=2221158581 Ack=1023855135 Len=536 [ES~
[00:05:07.030,000] <dbg> net_tcp.tcp_in: (rx_q[0]): Lowering send window from 64203 to 2730
[00:05:07.030,000] <dbg> net_tcp.tcp_out_ext: (rx_q[0]): ACK Seq=1023855135 Ack=2221159117 Len=0
[00:05:07.030,000] <dbg> net_tcp.tcp_send_process_no_lock: (rx_q[0]): ACK Seq=1023855135 Ack=2221159117 Len=0
[00:05:07.030,000] <dbg> net_tcp.tcp_send: (rx_q[0]): ACK Seq=1023855135 Ack=2221159117 Len=0
[00:05:07.030,000] <dbg> eth_e1000.e1000_tx: iow32 TDT 0x00000001
[00:05:07.030,000] <dbg> eth_e1000.e1000_tx: tx.sta: 0x01
[00:05:07.030,000] <dbg> eth_e1000.e1000_isr: ior32 ICR 0x00000043
[00:05:07.030,000] <dbg> eth_e1000.e1000_rx: rx.sta: 0x07
So system worked ok, but the performance was very poor because of the lost packets. I will try this same thing with real hw to see if the situation changes.
With sam-e70 board, I got 75kb/sec which is not great either.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I'm seeing poor performance with TCP and UDP on both the K64F and STM32. Any plans on addressing this issue?
I'm seeing poor performance with TCP and UDP on both the K64F and STM32. Any plans on addressing this issue?
Can you please provide more info as to how you are testing and measuring, so we can reproduce?
I'm seeing poor performance with TCP and UDP on both the K64F and STM32. Any plans on addressing this issue?
Can you please provide more info as to how you are testing and measuring, so we can reproduce?
Hi Carles, Here's the code I tested it with. https://github.com/mdkf/ZephyrTCPSlow I get 2Mbps with the K64F and and a hang on STM32F4 and STM32F7. I have attached a Wireshark capture of the hang. Zephyr STM32.pcapng.gz
I get 2Mbps with the K64F and and a hang on STM32F4 and STM32F7. I have attached a Wireshark capture of the hang.
Thanks!
Can you please open a separate issue for the "hang"? Regarding the throughput on K64F, I would ask @dleach02 or @danieldegrasse to take a look first see if they can identify a potential issue in the ethernet driver itself.
Also, our TCP implementation currently lacks RX window handling, which makes throughput suffer.
I think the open question is what is the expected throughput?
As @carlescufi comments, we appear to have some limitations in the TCP implementation. Is this captured anywhere in a ticket? I removed the stale label on this because I do think it is important for Zephyr to ensure that our TCP stack is robust and performant when compared to the alternative solutions.
Yes, there is an enhancement issue to implement Receive Window handling (https://github.com/zephyrproject-rtos/zephyr/issues/39132) - had to reopen it as it got closed by a wrongly linked PR.
Yes, there is an enhancement issue to implement Receive Window handling (#39132) - had to reopen it as it got closed by a wrongly linked PR.
That would help, but I don't think it explains the issue I am seeing.
Yes, there is an enhancement issue to implement Receive Window handling (#39132) - had to reopen it as it got closed by a wrongly linked PR.
@jukkar had a pull request for receive window update https://github.com/zephyrproject-rtos/zephyr/pull/32076
Yes, there is an enhancement issue to implement Receive Window handling (#39132) - had to reopen it as it got closed by a wrongly linked PR.
That would help, but I don't think it explains the issue I am seeing.
Unlimited traffic from the peer can end up with packet drop at the driver level, which leads to TCP retransmissions - effectively reducing the throughput. But that's TCP only, if you observe the impact on UDP as well it could be a driver thing.
Hi Carles, Here's the code I tested it with. https://github.com/mdkf/ZephyrTCPSlow I get 2Mbps with the K64F and and a hang on STM32F4 and STM32F7. I have attached a Wireshark capture of the hang. Zephyr STM32.pcapng.gz
@mdkf Thanks, just discussed with @rlubos and he will take a look. Could you specify which exact boards you are testing this on please?
Yes, there is an enhancement issue to implement Receive Window handling (#39132) - had to reopen it as it got closed by a wrongly linked PR.
@jukkar had a pull request for receive window update #32076
@rlubos has now completed the implementation of RX window support and it has been merged: #43018
@carlescufi, @rlubos do you have any testing data that shows before/after performance with the RX window support?
@dleach02 I'm going to order a few Ethernet boards to perform tests with zperf/iperf on an actual hardware. In a meantime, I've tested how does the recv window affect the traffic using a native_posix
platform, with the following results (zperf
sample acting as a TCP server, flooding data with iperf
from the host machine) :
$ iperf -l 1K -V -c 2001:db8::1 -p 5001
------------------------------------------------------------
Client connecting to 2001:db8::1, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 2001:db8::2 port 39930 connected with 2001:db8::1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 117 KBytes 94.5 Kbits/sec
2. I've noticed however, that `ethenet_native_posix` driver introduces an artificial delay on the RX path (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration. That's why I'm eager to test with an actual HW, where there should be no such delays as the data reception should be interrupt driven:
[ 3] local 2001:db8::2 port 39940 connected with 2001:db8::1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 315 KBytes 253 Kbits/sec
3. Next, I've tested with the receive window size set to a maximum possible value (note, that this configuration also requires a significant increase of net buffers available to prevent packet drops at driver level). This gives a significant boost in the throughput:
[ 3] local 2001:db8::2 port 39944 connected with 2001:db8::1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 30.8 MBytes 25.7 Mbits/sec
4. Finally, I've tested with the `CONFIG_NET_TC_RX_COUNT=0` config additionally - this configuration allows to skip the RX queue between driver/net stack, effectively reducing the number of thread rescheduling in the system. Again, this gives a significant boost on the native_posix platform:
[ 3] local 2001:db8::2 port 39946 connected with 2001:db8::1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.04 GBytes 896 Mbits/sec
Again, I'm eager to test with the actual hardware, but I believe that the above results gives hope that decent results can be expected.
I would compare with before and after adding this feature. We have started looking at this by testing 3.0 vs tip of main and then we started tweaking the parameters. In general, we are not seeing good performance results so trying to figure out how to triage this some more.
ping @hakehuang
As the UDP performance is also poor, I think the issue lies beyond the TCP window size. On the NXP iMXRT1050 sending UDP packets as fast as possible got me 11.9 Mbps, on Zephyr, while it reached 78.5 Mbps on Mbed.
- I've noticed however, that
ethenet_native_posix
driver introduces an artificial delay on the RX path (main
/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration. That's why I'm eager to test with an actual HW, where there should be no such delays as the data reception should be interrupt driven:
This is weird to have a delay like that or that big.
As the UDP performance is also poor, I think the issue lies beyond the TCP window size. On the NXP iMXRT1050 sending UDP packets as fast as possible got me 11.9 Mbps, on Zephyr, while it reached 78.5 Mbps on Mbed.
@mdkf what test do you use to gather these numbers?
@mdkf what test do you use to gather these numbers? I uploaded my test code here: https://github.com/mdkf/ZephyrTCPSlow Swap lines 107 and 108 to test with UDP instead.
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40">
I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With
nucleo_f429zi
board andbig_http_download
sample I got very slow download speed. This pushed me to check network performace withzperf
.For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.
I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in
zperf
. With bisecting i found that this issue starts with d88f25bd763e2dfa70873b3c2321f2f8677d643d.I hoped that reverting it will solve also the slow
big_http_download
but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.I do not know if these issues are related just with STM32 platform. I have just mentioned
nucleo_f429zi
and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?