Poor TCP performance - Githubissues

xhpohanka commented 4 years ago

I continue playing with Zephyr network stack and STM32 and I unfortunately found next issue. With nucleo_f429zi board and big_http_download sample I got very slow download speed. This pushed me to check network performace with zperf.

For UDP transfers I got around 10Mbps but for TCP the result was only 10kbps which is really bad.

I tried if some older versions of Zephyr behaves better - fortunately v2.0.0 got me also around 10Mbps for TCP in zperf. With bisecting i found that this issue starts with d88f25bd763e2dfa70873b3c2321f2f8677d643d.

I hoped that reverting it will solve also the slow big_http_download but surprisingly the download speed is still suspiciously low. I will continue to investigate this tomorrow.

I do not know if these issues are related just with STM32 platform. I have just mentioned nucleo_f429zi and custom board with STM32F750 which has slightly different ethernet peripheral and my driver is written also using HAL. Both behaves in same way.

The issues I met so far with Zephyr networking stack pose a question to me if it is mature enough for production?

rlubos commented 2 years ago

@hakehuang What test code did you use on the Zephyr side, zperf? Note, there's currently a bug, fixed in https://github.com/zephyrproject-rtos/zephyr/pull/43379, w/o it you won't get decent results (the recv window gets filled and the communication stalls).

Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).

hakehuang commented 2 years ago

Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).

Thanks @rlubos , I will try the latest code, which merged 2 hours ago. and I see if I set the WIN_SIZE to 0, it will automatcially choose the right WINSIZE am I right?

rlubos commented 2 years ago

Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).

Thanks @rlubos , I will try the latest code, which merged 2 hours ago. and I see if I set the WIN_SIZE to 0, it will automatcially choose the right WINSIZE am I right?

If window size is set to 0, the actual size will be chosen automatically, based on the number of avaialble buffers. The default value is pretty low though (~1k bytes) so I recommend increasing the buffer count anyway,

jukkar commented 2 years ago

2. I've noticed however, that ethenet_native_posix driver introduces an artificial delay on the RX path (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration.

I had to add artifical delay to native_posix Ethernet driver because if there was none, the application just started spinning and refused to do any work. Perhaps delay is no longer needed or it could be made smaller, in any case I would say native_posix is not a good platform for any performance related statistics.

rlubos commented 2 years ago

I've noticed however, that ethenet_native_posix driver introduces an artificial delay on the RX path (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/ethernet/eth_native_posix.c#L388) - reducing the delay size already improves the throughput w/o changes in the sample configuration.

I had to add artifical delay to native_posix Ethernet driver because if there was none, the application just started spinning and refused to do any work. Perhaps delay is no longer needed or it could be made smaller, in any case I would say native_posix is not a good platform for any performance related statistics.

From what I've experienced it can not be removed entirely, as the application then behaved exactly as you described (completely unresponsive).

hakehuang commented 2 years ago

If window size is set to 0, the actual size will be chosen automatically, based on the number of avaialble buffers. The default value is pretty low though (~1k bytes) so I recommend increasing the buffer count anyway,

@rlubos I update to the latest code base and add RX_COUNT as blow https://github.com/hakehuang/zephyr/tree/zperf_qemu however the zperf test result is still poor.

`

ubuntu@ubuntu-Latitude-E6420:~/code/iperf2-code$ sudo src/iperf -l 1K -c 192.0.2.1 -p 5001

Client connecting to 192.0.2.1, TCP port 5001 TCP window size: 16.0 KByte (default)

[ 1] local 192.0.2.2 port 56078 connected with 192.0.2.1 port 5001 (icwnd/mss/irtt=14/1460/6735) [ ID] Interval Transfer Bandwidth [ 1] 0.00-20.59 sec 50.0 KBytes 19.9 Kbits/sec `

rlubos commented 2 years ago

@hakehuang e1000 tends to drop packet in case of large transfers (I tested this in the past and it seems that we don't even get and IRQ for the packet so this could be a qemu thing). I don't have enough knowledge on qemu to investigate the problem efficiently. The thing is that this leads to multiple retransmissions and overall poor performace. The same configuration running SLIP gives considerably better results:

$ iperf -l 1K -V -c 2001:db8::1 -p 5001
------------------------------------------------------------
Client connecting to 2001:db8::1, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 2001:db8::2 port 42156 connected with 2001:db8::1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   681 KBytes   551 Kbits/sec

I couldn't get better than 1Mbps though with qemu + SLIP no matter how large the RX window is, I'm guessing SLIP could be a bottleneck here.

hakehuang commented 2 years ago

@rlubos so do we have a golden platform for zperf?

rlubos commented 2 years ago

@rlubos so do we have a golden platform for zperf?

I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.

hakehuang commented 2 years ago

I'm not aware if we have any "golden" platform for tests, I think we should aim to get a decent performance on any (non-emulated) ethernet board. I've ordered a few on my own to perform some measurements.

without a golden platform, we will have to take account of driver impacts, which may not be a good thing.

pfalcon commented 2 years ago

without a golden platform, we will have to take account of driver impacts, which may not be a good thing.

At Linaro, we standardized on frdm_k64f as the default networking platform from Zephyr's start. My records of Zephyr's networking testing against it (and qemu which is everyone's favorite platform) is at https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 (it's pretty spotty, as it's my personal initiative to maintain such as a spreadsheet, so it was done on the best effort basis). When I have a chance, I'll test current Zephyr using the testcases in the spreadsheet (I'm working on other projects now).

mdkf commented 2 years ago

Why is there a sleep within a send loop (https://github.com/mdkf/ZephyrTCPSlow/blob/main/src/socketTest.c#L159)? 2 ms sleep betwen each datagram sent doesn't sound like a best idea for a througput measuring test.

I inserted the sleep when testing the F746ZG. It died otherwise. It was not included in the other measurements.

Additionally, 90 byte payload size in datagram seems pretty small fro throughput measurement, did you send such a small packets with mbed as well?

I used the same test in Mbed and FreeRTOS. I also tried with a 1400byte packet. It did not improve the performance in Zephyr. After the first 90 byte packet was sent, Wireshark shows the subsequent packets sent were closer to 1480 bytes. Basically the 90 byte packets accumulated and were sent together.

AndreyDodonov-EH commented 2 years ago

@hakehuang What test code did you use on the Zephyr side, zperf? Note, there's currently a bug, fixed in #43379, w/o it you won't get decent results (the recv window gets filled and the communication stalls).

Also, please note that increasing the window size means that you also need to increase the RX pkt/buf count in the system, otherwise the effort is futile (net driver will start dopping packets, enforcing retransmissions from the server which has a great negative impact on the throughput). Increasing TX pkt/buf count a bit also makes sense, as we need to acknowledge each TCP packet received (I wonder if there's room for improvement here, in theory it should be enough to acknowledge once with larger ACK value).

@rlubos Great that you mentioned ACK issue. Yes, there is room for improvement in terms of ACK-ing multiple packages with a single ACK, there is even a (somewhat stale) issue for that: https://github.com/zephyrproject-rtos/zephyr/issues/30366

AndreyDodonov-EH commented 2 years ago

@jukkar Probably wrong thread to ask, but is there a reason behind magic constant 3 ? https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/tcp.c#L1801

Because I observed that max window size is quite critical. Actually not to get Data buffer allocation errors, I had to change that to 4, or define custom CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE

Sorry if I'm missing something here

jukkar commented 2 years ago

Probably wrong thread to ask, but is there a reason behind magic constant 3 ?

No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.

AndreyDodonov-EH commented 2 years ago

Probably wrong thread to ask, but is there a reason behind magic constant 3 ?

No specific reason, just a somewhat reasonable value when the code was written. If you find value 4 more suitable, please send a PR that changes it.

I don't think it makes sense to open PR with another magic constant, at the very least with KConfig flag.

It worked for me, yes, but I'd like to understand the meaning behind it. Ideally this coefficient should be calculated.

ssharks commented 2 years ago

With the latest patches in, in the field I get the cloud application dropping larger transfers as the transfer rate drops too low. This happens when there is less then 240 bytes transferred in the last 5 seconds.

Looking at the test results I can see something interesting happening.

Based on the qemu_cortex_a9 target (not different for qemu x86)

Transferring 60 kByte

With preemptive scheduling:

Without packet loss:
===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 19.84 seconds

With packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 25.102 seconds
===================================================================

With cooperative scheduling

Without packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 10.751 seconds

With packet loss:

===================================================================
START - test_v4_send_recv_large
 PASS - test_v4_send_recv_large in 22.12 seconds
===================================================================

In the case of no packet loss, I would expect the elapsed time to be less then a second, a no timeouts need to occur. It is also interesting to see the cooperative scheduling being almost 2 times faster. In case of packet loss, timeouts could need to occur for re-transmission, than it could take longer. Nevertheless is little difference in runtime between the case with and without packet loss.

@rlubos Mentioned that increasing the buffers

CONFIG_NET_BUF_RX_COUNT=64
CONFIG_NET_BUF_TX_COUNT=64

And putting a k_yield after the send call helps significantly to accelerate the test. But nevertheless, in a zero delay, no packet loss testcase, it should also be possible to get a fast throughput with just smaller number of buffers.

ssharks commented 2 years ago

I attempted to dive a little bit deeper in issue: https://github.com/zephyrproject-rtos/zephyr/issues/45367

rlubos commented 2 years ago

I've finally managed to do some throughput tests on actual hardware. I had mimxrt1020_evk and nucleo_h723zg on the table.

TL;DR The results for nucleo_h723zg are good, but for mimxrt1020_evk they're rather poor.

For testing, I've used iperf on the Linux host side and zperf sample on the Zephyr side. Let's focus on nucleo_h723zg first. As a reference, I've used UDP throughput as in this case we avoid protocol specific constraints (like TX/RX window size with TCP). Initial results for zperf running in default configuration are not bad but not great either. When analyzing eth_stm32_hal.c driver I've noticed though, that on the TX path, the driver blocks during the transmission, effectively negating any positive performance effects of using DMA. In result, the total transmission time of a single frame consists not only of the time needed to actually transmit the frame, but also the time needed to process UDP/IP is added to the overall transmit time. As in the default configuration Zephyr does the L4/L3/L2 and the driver processing in a single thread, all of the processing times adds up, affecting the final throughput.

I've managed to increase the throughput by enabling the TX queue (CONFIG_NET_TC_TX_COUNT=1). As a result, a packet, instead of being passed to L2 directy, is queued, and the actual L2 and driver processing is done in a separate thread. This allows for increased throughput, because when the driver blocks during the transmission, the other thread which does the L4/L3 processing is able to proceed with the next frame. I think this should be a default configuration in the zperf sample.

Another small throughput improvement can be achieved by setting the net buffer size to the actual network MTU (CONFIG_NET_BUF_DATA_SIZE=1500). In this case, L3/L4 processing takes less time, as the packet consists of a single buffer, instead of a chain of buffers that net stack needs to process. This also increases the default TCP TX/RX window size, which improves the TCP throughput, both ways.

Finally, TCP throughput can be further increased by maximizing the TCP window sizes. I've achieved that by increasing the net_pkt/net_buf count and relying on the default window size set by Zephyr.

The overall results are presented in the table below (the measurements were taken on the receiving node, i.e. iperf for upload, zperf for download):

Configuration	TCP RX/TX window	UDP upload	TCP upload	UDP download	TCP download
Default	1194	51.2 Mbits/sec	670 Kbits/sec	88.12 Mbits/sec	7.71 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1	1194	73.6 Mbits/sec	670 Kbits/sec	88.03 Mbits/sec	7.73 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500	14000	78.3 Mbits/sec	69.8 Mbits/sec	88.01 Mbits/sec	75.03 Mbits/sec
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80	40000	77.9 Mbits/sec	75.0 Mbits/sec	88.12 Mbits/sec	79.56 Mbits/sec

A side note, I was able to improve the UDP TX throughput even further by modifying the eth_stm32_hal.c, to block not after sending the packet to the HAL, but before (i. e. to block if the previous transfer hasn't finished yet). This allowed to reach ~87 Mbits/sec, however I'm not confident enough to push those changes upstream, as there are other aspects to consider (for instance PTP is processed after the packet is transmitted, I'm not sure if that change wouldn't break that). I'll leave it to the driver maintainers to decide whether to improve or not.

Now when it comes to mimxrt1020_evk, the results are presented below:

Configuration	TCP RX/TX window	UDP upload	TCP upload	UDP download	TCP download
CONFIG_NET_TC_TX_COUNT=1 CONFIG_NET_BUF_DATA_SIZE=1500 CONFIG_NET_PKT_RX/TX_COUNT=80 CONFIG_NET_BUF_RX/TX_COUNT=80	40000	17.2 Mbits/sec	7.88 Mbits/sec	16.23 Mbits/sec	456 Kbits/sec

I've investigated this platform a bit, and the conclusion for the poor performance is as follows:

(minor) The eth_mcux.c driver does the same thing as eth_stm32_hal.c, i. e. it blocks during transfer. In this case however there is an additional thread within the driver involved to unblock, which adds extra overhead due to scheduling.
(major) When measuring the time needed to process individual frames I've noticed that this platform is much slower than nucleo_h723zg (it took ~4 times longer to do the L4/L3 processing). This is a bit surprising to me, as both platform appear to be running on Cortex M7, with similar CPU speed (500 MHz vs 550 MHz). @dleach02 Do you know perhaps what could be the reason of this?
(major) When downloading at full speed, the driver reports lots of errors (<err> eth_mcux: ENET_GetRxFrameSize return: 4001). I don't know the reason, but it could be a side effect of the above point.

To summarize, I think that the results achieved on nucleo_h723zg prove that it is possible to achieve competitive throughputs with Zephyr with proper configuration and a well-written ethernet drvier. Ideally it'd be good to test other platforms as well, but due to limited availability of development kits in general I couldn't get some obvious choices like the super popular frdm_k64f. I therefore suggest to close this general issue, as it might be misleading, given the above results and open a board/driver specific issues instead.

rlubos commented 2 years ago

As for the zperf, the sample uses net_context API directly, which gave me a bit of a headache due to some issues with TCP handing in the sample (the TCP context was freed too early as the sample did not add extra ref to the net_contex, also it does not take EAGAIN/ENOBUF returned by the TCP layer into consideration). I'm thinking however, that instead of fixing those issue it'd be worthwhile to rewrite the sample to use socket API instead, which is a more realistic scenario for actual apps. I plan to work on this in a near future.

ssharks commented 2 years ago

Very interesting results. This clearly shows that in a happy flow situation the performance can be pretty decent. You are using a point to point wired link I assume. The polling implementation has definitely helps to improve throughput quite a bit.

You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

rlubos commented 2 years ago

You are using a point to point wired link I assume.

Yes, the whole point of this experiment was to compare how the actual throughput compares to the theoretical maximum throughput over 100 Mbit Ethernet, and it seems we're pretty close to the limit.

You increased the window by increasing the CONFIG_NET_BUF_DATA_SIZE to 1500 bytes over the default 128, do you know if this has the same affect as increasing the CONFIG_NET_BUF_RX/TX_COUNT by a factor 12? Apart from maybe some processing overhead I would expect it to have the same affect. Only small packets will consume considerably less space.

Yes, the default window size is calculated based on the buffer size and buffer count, i.e. the overall size of all of the buffers, so you could reach the same effect by increasing the buffer count. The sole reason to increase the buffer size here was to reduce the processing time of an individual frame, I would say however that this is only recommended if you really need to maximize your througputs, usuallly it's better to increase buffer count, as you don't waste space on small packets.

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

Well yes, it is expected that the throughputs will be worse in case of lossy networks. If there are mechanisms specified in TCP, that could help to improve performance in such case, we should consider implementing them. I think though that those should be conidered as enhancements, not reported as "bugs" like this issue is.

carlescufi commented 2 years ago

On wireless network (cellular or WiFi) to introduce some packet loss, with big latency (to the other side of the world) things will start to look quite different. First of all there is no collision avoidance, so the fairness to other network traffic is pretty bad. Secondly if there is one packet lost along the way, the stack will start re-transmitting the complete transmit buffer. A triple duplicate-ack triggered fast-retransmit will help here.

@rlubos and @ssharks can we create an enhancement issue for this?

ssharks commented 2 years ago

@rlubos: Could you redo the upload tests with small window of https://github.com/zephyrproject-rtos/zephyr/issues/23302#issuecomment-1142138602, with the fix of https://github.com/zephyrproject-rtos/zephyr/pull/46584 in? The figures will look very different I believe.

@xhpohanka: PR https://github.com/zephyrproject-rtos/zephyr/pull/46584 was recently merged and I think it solves the issue you described. Are you in a position to check if your problem has been fixed. If so, this issue can be closed. In fact, the issue https://github.com/zephyrproject-rtos/zephyr/issues/45844, looks pretty similar to your description.

xhpohanka commented 2 years ago

Hello @ssharks, I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)

rlubos commented 2 years ago

@ssharks Hmm, but the Silly Window shouldn't affect the upload as it's related to the RX window size? Did you mean download?

Anyways, I've ran the test again, no difference on the upload side, the download throughput is slightly improved (in low window scenario) to 8.68 Mbps. When I tested the solution, the most significant performance boost happended in case we reported Zero window to peer, as this didn't take place anymore with #46584. This didn't happend though in the initial test I performed here.

rlubos commented 2 years ago

Hello @ssharks, I have not done zperf testing for a long time, but I have checked the recent updates to the TCP stack including #46584. In our application the performace really improved a lot. From my POV this issue can be closed :)

I suggest we thereby close this long-open issue.

zephyrproject-rtos / zephyr

Poor TCP performance #23302