Closed xhpohanka closed 2 years ago
@hakehuang What test code did you use on the Zephyr side, zperf
? Note, there's currently a bug, fixed in https://github.com/zephyrproject-rtos/zephyr/pull/43379, w/o it you won't get decent results (the recv window gets filled and the communication stalls).
Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).
Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).
Thanks @rlubos , I will try the latest code, which merged 2 hours ago. and I see if I set the WIN_SIZE to 0, it will automatcially choose the right WINSIZE am I right?
Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).
Thanks @rlubos , I will try the latest code, which merged 2 hours ago. and I see if I set the WIN_SIZE to 0, it will automatcially choose the right WINSIZE am I right?
If window size is set to 0, the actual size will be chosen automatically, based on the number of avaialble buffers. The default value is pretty low though (~1k bytes) so I recommend increasing the buffer count anyway,
2. I've noticed however, that
ethenet_native_posix
driver introduces an artificial delay on the RX path (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration.
I had to add artifical delay to native_posix Ethernet driver because if there was none, the application just started spinning and refused to do any work. Perhaps delay is no longer needed or it could be made smaller, in any case I would say native_posix is not a good platform for any performance related statistics.
- I've noticed however, that
ethenet_native_posix
driver introduces an artificial delay on the RX path (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration.I had to add artifical delay to native_posix Ethernet driver because if there was none, the application just started spinning and refused to do any work. Perhaps delay is no longer needed or it could be made smaller, in any case I would say native_posix is not a good platform for any performance related statistics.
From what I've experienced it can not be removed entirely, as the application then behaved exactly as you described (completely unresponsive).
If window size is set to 0, the actual size will be chosen automatically, based on the number of avaialble buffers. The default value is pretty low though (~1k bytes) so I recommend increasing the buffer count anyway,
@rlubos I update to the latest code base and add RX_COUNT as blow https://github.com/hakehuang/zephyr/tree/zperf_qemu however the zperf test result is still poor.
`
ubuntu@ubuntu-Latitude-E6420:~/code/iperf2-code$ sudo src/iperf -l 1K -c 192.0.2.1 -p 5001
Client connecting to 192.0.2.1, TCP port 5001 TCP window size: 16.0 KByte (default)
[ 1] local 192.0.2.2 port 56078 connected with 192.0.2.1 port 5001 (icwnd/mss/irtt=14/1460/6735) [ ID] Interval Transfer Bandwidth [ 1] 0.00-20.59 sec 50.0 KBytes 19.9 Kbits/sec `
@hakehuang e1000 tends to drop packet in case of large transfers (I tested this in the past and it seems that we don't even get and IRQ for the packet so this could be a qemu thing). I don't have enough knowledge on qemu to investigate the problem efficiently. The thing is that this leads to multiple retransmissions and overall poor performace. The same configuration running SLIP gives considerably better results:
$ iperf -l 1K -V -c 2001:db8::1 -p 5001
------------------------------------------------------------
Client connecting to 2001:db8::1, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 2001:db8::2 port 42156 connected with 2001:db8::1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 681 KBytes 551 Kbits/sec
I couldn't get better than 1Mbps though with qemu + SLIP no matter how large the RX window is, I'm guessing SLIP could be a bottleneck here.
@rlubos so do we have a golden platform for zperf?
@rlubos so do we have a golden platform for zperf?
I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.
I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.
without a golden platform, we will have to take account of driver impacts, which may not be a good thing.
without a golden platform, we will have to take account of driver impacts, which may not be a good thing.
At Linaro, we standardized on frdm_k64f as the default networking platform from Zephyr's start. My records of Zephyr's networking testing against it (and qemu which is everyone's favorite platform) is at https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 (it's pretty spotty, as it's my personal initiative to maintain such as a spreadsheet, so it was done on the best effort basis). When I have a chance, I'll test current Zephyr using the testcases in the spreadsheet (I'm working on other projects now).
Why is there a sleep within a send loop (https://github.com/mdkf/ZephyrTCPSlow/blob/main/src/socketTest.c#L159)? 2 ms sleep betwen each datagram sent doesn't sound like a best idea for a througput measuring test.
I inserted the sleep when testing the F746ZG. It died otherwise. It was not included in the other measurements.
Additionally, 90 byte payload size in datagram seems pretty small fro throughput measurement, did you send such a small packets with mbed as well?
I used the same test in Mbed and FreeRTOS. I also tried with a 1400byte packet. It did not improve the performance in Zephyr. After the first 90 byte packet was sent, Wireshark shows the subsequent packets sent were closer to 1480 bytes. Basically the 90 byte packets accumulated and were sent together.
@hakehuang What test code did you use on the Zephyr side,
zperf
? Note, there's currently a bug, fixed in #43379, w/o it you won't get decent results (the recv window gets filled and the communication stalls).Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).
@rlubos Great that you mentioned ACK issue. Yes, there is room for improvement in terms of ACK-ing multiple packages with a single ACK, there is even a (somewhat stale) issue for that: https://github.com/zephyrproject-rtos/zephyr/issues/30366
@jukkar Probably wrong thread to ask, but is there a reason behind magic constant 3 ? https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/tcp.c#L1801
Because I observed that max window size is quite critical. Actually not to get Data buffer allocation errors, I had to change that to 4, or define custom CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE
Sorry if I'm missing something here
Probably wrong thread to ask, but is there a reason behind magic constant 3 ?
No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.
Probably wrong thread to ask, but is there a reason behind magic constant 3 ?
No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.
I don't think it makes sense to open PR with another magic constant, at the very least with KConfig flag.
It worked for me, yes, but I'd like to understand the meaning behind it. Ideally this coefficient should be calculated.
With the latest patches in, in the field I get the cloud application dropping larger transfers as the transfer rate drops too low. This happens when there is less then 240 bytes transferred in the last 5 seconds.
Looking at the test results I can see something interesting happening.
Based on the qemu_cortex_a9 target (not different for qemu x86)
Transferring 60 kByte
With preemptive scheduling:
Without packet loss:
===================================================================
START - test_v4_send_recv_large
PASS - test_v4_send_recv_large in 19.84 seconds
With packet loss:
===================================================================
START - test_v4_send_recv_large
PASS - test_v4_send_recv_large in 25.102 seconds
===================================================================
With cooperative scheduling
Without packet loss:
===================================================================
START - test_v4_send_recv_large
PASS - test_v4_send_recv_large in 10.751 seconds
With packet loss:
===================================================================
START - test_v4_send_recv_large
PASS - test_v4_send_recv_large in 22.12 seconds
===================================================================
In the case of no packet loss, I would expect the elapsed time to be less then a second, a no timeouts need to occur. It is also interesting to see the cooperative scheduling being almost 2 times faster. In case of packet loss, timeouts could need to occur for re-transmission, than it could take longer. Nevertheless is little difference in runtime between the case with and without packet loss.
@rlubos Mentioned that increasing the buffers
CONFIG_NET_BUF_RX_COUNT=64
CONFIG_NET_BUF_TX_COUNT=64
And putting a k_yield after the send call helps significantly to accelerate the test. But nevertheless, in a zero delay, no packet loss testcase, it should also be possible to get a fast throughput with just smaller number of buffers.
I attempted to dive a little bit deeper in issue: https://github.com/zephyrproject-rtos/zephyr/issues/45367
I've finally managed to do some throughput tests on actual hardware. I had mimxrt1020_evk
and nucleo_h723zg
on the table.
TL;DR The results for nucleo_h723zg
are good, but for mimxrt1020_evk
they're rather poor.
For testing, I've used iperf
on the Linux host side and zperf
sample on the Zephyr side. Let's focus on nucleo_h723zg
first. As a reference, I've used UDP throughput as in this case we avoid protocol specific constraints (like TX/RX window size with TCP).
Initial results for zperf
running in default configuration are not bad but not great either. When analyzing eth_stm32_hal.c
driver I've noticed though, that on the TX path, the driver blocks during the transmission, effectively negating any positive performance effects of using DMA. In result, the total transmission time of a single frame consists not only of the time needed to actually transmit the frame, but also the time needed to process UDP/IP is added to the overall transmit time. As in the default configuration Zephyr does the L4/L3/L2 and the driver processing in a single thread, all of the processing times adds up, affecting the final throughput.
I've managed to increase the throughput by enabling the TX queue (CONFIG_NET_TC_TX_COUNT=1
). As a result, a packet, instead of being passed to L2 directy, is queued, and the actual L2 and driver processing is done in a separate thread. This allows for increased throughput, because when the driver blocks during the transmission, the other thread which does the L4/L3 processing is able to proceed with the next frame. I think this should be a default configuration in the zperf
sample.
Another small throughput improvement can be achieved by setting the net buffer size to the actual network MTU (CONFIG_NET_BUF_DATA_SIZE=1500
). In this case, L3/L4 processing takes less time, as the packet consists of a single buffer, instead of a chain of buffers that net stack needs to process. This also increases the default TCP TX/RX window size, which improves the TCP throughput, both ways.
Finally, TCP throughput can be further increased by maximizing the TCP window sizes. I've achieved that by increasing the net_pkt
/net_buf
count and relying on the default window size set by Zephyr.
The overall results are presented in the table below (the measurements were taken on the receiving node, i.e. iperf
for upload, zperf
for download):
Configuration | TCP RX/TX window | UDP upload | TCP upload | UDP download | TCP download |
---|---|---|---|---|---|
Default | 1194 | 51.2 Mbits/sec | 670 Kbits/sec | 88.12 Mbits/sec | 7.71 Mbits/sec |
CONFIG_NET_TC_TX_COUNT=1 | 1194 | 73.6 Mbits/sec | 670 Kbits/sec | 88.03 Mbits/sec | 7.73 Mbits/sec |
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 | 14000 | 78.3 Mbits/sec | 69.8 Mbits/sec | 88.01 Mbits/sec | 75.03 Mbits/sec |
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80 | 40000 | 77.9 Mbits/sec | 75.0 Mbits/sec | 88.12 Mbits/sec | 79.56 Mbits/sec |
A side note, I was able to improve the UDP TX throughput even further by modifying the eth_stm32_hal.c
, to block not after sending the packet to the HAL, but before (i. e. to block if the previous transfer hasn't finished yet). This allowed to reach ~87 Mbits/sec, however I'm not confident enough to push those changes upstream, as there are other aspects to consider (for instance PTP is processed after the packet is transmitted, I'm not sure if that change wouldn't break that). I'll leave it to the driver maintainers to decide whether to improve or not.
Now when it comes to mimxrt1020_evk
, the results are presented below:
Configuration | TCP RX/TX window | UDP upload | TCP upload | UDP download | TCP download |
---|---|---|---|---|---|
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80 | 40000 | 17.2 Mbits/sec | 7.88 Mbits/sec | 16.23 Mbits/sec | 456 Kbits/sec |
I've investigated this platform a bit, and the conclusion for the poor performance is as follows:
eth_mcux.c
driver does the same thing as eth_stm32_hal.c
, i. e. it blocks during transfer. In this case however there is an additional thread within the driver involved to unblock, which adds extra overhead due to scheduling.nucleo_h723zg
(it took ~4 times longer to do the L4/L3 processing). This is a bit surprising to me, as both platform appear to be running on Cortex M7, with similar CPU speed (500 MHz vs 550 MHz). @dleach02 Do you know perhaps what could be the reason of this?<err> eth_mcux: ENET_GetRxFrameSize return: 4001
). I don't know the reason, but it could be a side effect of the above point.To summarize, I think that the results achieved on nucleo_h723zg
prove that it is possible to achieve competitive throughputs with Zephyr with proper configuration and a well-written ethernet drvier. Ideally it'd be good to test other platforms as well, but due to limited availability of development kits in general I couldn't get some obvious choices like the super popular frdm_k64f
. I therefore suggest to close this general issue, as it might be misleading, given the above results and open a board/driver specific issues instead.
As for the zperf
, the sample uses net_context
API directly, which gave me a bit of a headache due to some issues with TCP handing in the sample (the TCP context was freed too early as the sample did not add extra ref to the net_contex, also it does not take EAGAIN/ENOBUF returned by the TCP layer into consideration). I'm thinking however, that instead of fixing those issue it'd be worthwhile to rewrite the sample to use socket API instead, which is a more realistic scenario for actual apps. I plan to work on this in a near future.
Very interesting results. This clearly shows that in a happy flow situation the performance can be pretty decent. You are using a point to point wired link I assume. The polling implementation has definitely helps to improve throughput quite a bit.
You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.
On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.
You are using a point to point wired link I assume.
Yes, the whole point of this experiment was to compare how the actual throughput compares to the theoretical maximum throughput over 100 Mbit Ethernet, and it seems we're pretty close to the limit.
You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.
Yes, the default window size is calculated based on the buffer size and buffer count, i.e. the overall size of all of the buffers, so you could reach the same effect by increasing the buffer count. The sole reason to increase the buffer size here was to reduce the processing time of an individual frame, I would say however that this is only recommended if you really need to maximize your througputs, usuallly it's better to increase buffer count, as you don't waste space on small packets.
On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.
Well yes, it is expected that the throughputs will be worse in case of lossy networks. If there are mechanisms specified in TCP, that could help to improve performance in such case, we should consider implementing them. I think though that those should be conidered as enhancements, not reported as "bugs" like this issue is.
On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.
@rlubos and @ssharks can we create an enhancement issue for this?
@rlubos: Could you redo the upload tests with small window of https://github.com/zephyrproject-rtos/zephyr/issues/23302#issuecomment-1142138602, with the fix of https://github.com/zephyrproject-rtos/zephyr/pull/46584 in? The figures will look very different I believe.
@xhpohanka: PR https://github.com/zephyrproject-rtos/zephyr/pull/46584 was recently merged and I think it solves the issue you described. Are you in a position to check if your problem has been fixed. If so, this issue can be closed. In fact, the issue https://github.com/zephyrproject-rtos/zephyr/issues/45844, looks pretty similar to your description.
Hello @ssharks, I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)
@ssharks Hmm, but the Silly Window shouldn't affect the upload as it's related to the RX window size? Did you mean download?
Anyways, I've ran the test again, no difference on the upload side, the download throughput is slightly improved (in low window scenario) to 8.68 Mbps. When I tested the solution, the most significant performance boost happended in case we reported Zero window to peer, as this didn't take place anymore with #46584. This didn't happend though in the initial test I performed here.
Hello @ssharks, I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)
I suggest we thereby close this long-open issue.
I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With
nucleo_f429zi
board andbig_http_download
sample I got very slow download speed. This pushed me to check network performace withzperf
.For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.
I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in
zperf
. With bisecting i found that this issue starts with d88f25bd763e2dfa70873b3c2321f2f8677d643d.I hoped that reverting it will solve also the slow
big_http_download
but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.I do not know if these issues are related just with STM32 platform. I have just mentioned
nucleo_f429zi
and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?