zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.9k stars 6.64k forks source link

TCP over subg on BeagleconnectFreedom is unusable #71191

Closed Ayush1325 closed 3 months ago

Ayush1325 commented 7 months ago

Describe the bug After investigating UDP over subg on beagleconnect freedom (https://github.com/zephyrproject-rtos/zephyr/issues/68674), I have moved to testing TCP. However, I cannot seem to get TCP reliably working on the echo_server and echo_client.

Note: The subg overlays present in echo_server and echo_client disable TCP by default, so to reproduce these results, it should be enabled.

Here are the logs from echo_server:

*** Booting Zephyr OS build v3.6.0-1935-gf87143d9198a ***
[00:00:00.045,104] <inf> net_config: Initializing network
[00:00:00.155,487] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:00.157,012] <inf> net_echo_server_sample: Run echo server
[00:00:00.157,073] <inf> net_echo_server_sample: Network connected
[00:00:00.157,135] <inf> net_echo_server_sample: Starting...
[00:00:00.157,592] <inf> net_echo_server_sample: Waiting for TCP connection on port 4242 (IPv6)...
[00:00:03.049,682] <inf> net_echo_server_sample: TCP (IPv6): Accepted connection
[00:00:03.050,506] <inf> net_echo_server_sample: Waiting for TCP connection on port 4242 (IPv6)...
[00:00:03.173,492] <dbg> net_echo_server_sample: handle_data: TCP (IPv6): Received and replied with 65 bytes
[00:00:03.715,881] <dbg> net_echo_server_sample: handle_data: TCP (IPv6): Received and replied with 65 bytes
[00:00:03.739,776] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.755,676] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.755,676] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.773,010] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.773,010] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.773,040] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.790,039] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.790,069] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.790,100] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.790,130] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.804,321] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.804,351] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.804,351] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.804,382] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.809,295] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.809,295] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.809,326] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.809,356] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.809,356] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:03.958,648] <dbg> net_echo_server_sample: handle_data: TCP (IPv6): Received and replied with 130 bytes
[00:00:04.676,635] <dbg> net_echo_server_sample: handle_data: TCP (IPv6): Received and replied with 8 bytes
[00:00:21.128,753] <err> net_echo_server_sample: TCP (IPv6): Connection error 116
[00:01:00.157,775] <inf> net_echo_server_sample: IPv6 TCP: Received 4 B/sec

And here are the logs from echo_client:

*** Booting Zephyr OS build v3.6.0-1935-gf87143d9198a ***
[00:00:00.046,569] <inf> net_config: Initializing network
[00:00:00.164,703] <inf> net_config: IPv6 address: 2001:db8::2
[00:00:00.164,825] <inf> net_echo_client_sample: Run echo client
[00:00:00.164,886] <inf> net_echo_client_sample: Network connected
[00:00:00.164,947] <inf> net_echo_client_sample: Starting...
[00:00:00.427,185] <dbg> net_echo_client_sample: send_tcp_data: IPv6 TCP: Sent 268 bytes
[00:00:01.144,378] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.155,883] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.155,883] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.162,139] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.162,170] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.178,253] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.178,283] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.178,283] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.202,453] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.202,453] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.202,484] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.202,514] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.220,947] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.220,947] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.220,977] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.220,977] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.221,008] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,560] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,560] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,591] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,591] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,621] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:01.243,652] <wrn> ieee802154_cc13xx_cc26xx_subg: Cannot allocate packet
[00:00:17.344,635] <inf> net_echo_client_sample: Stopping...

Some additional config options I tried adding to fix this are as follows:

CONFIG_NET_L2_IEEE802154_FRAGMENT_REASS_CACHE_SIZE=32
CONFIG_IEEE802154_CC13XX_CC26XX_SUB_GHZ_NUM_RX_BUF=6

To Reproduce Steps to reproduce the behavior:

Expected behavior All packets to be received by the server.

Environment (please complete the following information):

jukkar commented 7 months ago

You are running out of network buffers according to the log output. The CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE option tuning could help the situation. See the TCP configuration guide for help https://docs.zephyrproject.org/latest/connectivity/networking/net_config_guide.html#tcp-options

ghost commented 7 months ago

It would be useful to see sniffer data. Any TI Dev Board can be transformed into a sniffer via their PacketSniffer 2 firmware. Analyzing the wireshark protocols could help us identify the root cause.

Ayush1325 commented 7 months ago

It would be useful to see sniffer data. Any TI Dev Board can be transformed into a sniffer via their PacketSniffer 2 firmware. Analyzing the wireshark protocols could help us identify the root cause.

Hi. Is there a way to use PacketSniffer 2 in Linux? I can only see stuff for windows.

Ayush1325 commented 7 months ago

You are running out of network buffers according to the log output. The CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE option tuning could help the situation. See the TCP configuration guide for help https://docs.zephyrproject.org/latest/connectivity/networking/net_config_guide.html#tcp-options

Currently, it is set to 0, which means it will be calculated automatically.

Also, I did try increasing both CONFIG_NET_TCP_MAX_RECV_WINDOW_SIZE and CONFIG_NET_TCP_MAX_TX_WINDOW_SIZE to 20480 but no improvement.

ghost commented 7 months ago

Hi. Is there a way to use PacketSniffer 2 in Linux? I can only see stuff for windows.

I'm afraid that's true - several of TI's proprietary tools do not exist for the Linux platform. I use them in a Windows VM running on Linux which works well. As you're in the SubG range and TI's SUN SubG PHY settings are not 100% standard conforming, you're pretty much depending on their tools currently AFAIK.

I think there's a way to capture packages in Zephyr and forward them on another interface as alternative to using a dedicated sniffer, but I've never worked with that arrangement. And you'll only always see one side of the equation.

Re your TCP problems - it might just be that you don't get responses reliably, especially if you use ALOHA without ACK switched off. CSMA/CA and ACK are there for a reason... You'll see traces of collisions and missing packages when sniffing or debugging the TCP protocol details directly.

jukkar commented 7 months ago

Zephyr has packet capture support, see this document https://docs.zephyrproject.org/latest/connectivity/networking/network_monitoring.html There is a sample that can show what can be done in your own application, https://docs.zephyrproject.org/latest/samples/net/capture/README.html I recently created support for cooked mode capture in this PR #70926. It allows user to capture low level non-IP based packets. The PR has example for capturing HDLC frames in PPP traffic and send them for analysis to external systems, see this commit 7fdebb5a917685b88208675a31271d0bceab6f69 for details. For 802.15.4 L2 packet capturing, some extra work would need to be done, basically adding packet capture calls to the proper places in 802.15.4 stack (similarly what I did for PPP), if one wants to see 802.15.4 L2 traffic. If normal IP traffic is enough, then that IP traffic can be captured already. If the generic packet capture is setup and enabed (either from net-shell or directly from capture API), the packets are then automatically captured for analysis. One caveat in the capturing packets inside a Zephyr device, is that it is not storing the packets locally but send them to outside system where Wireshark or similar could be used to show the data. This requires extra IP based network connection in the device. For normal IP traffic capturing, the same interface can be used for capturing and transferring the captured packets, but for cooked capture mode, the captured packets cannot be sent via the same interface (as that would cause recursion atm, there might be a way to avoid that but it requires some extra coding).

Ayush1325 commented 7 months ago

Re your TCP problems - it might just be that you don't get responses reliably, especially if you use ALOHA without ACK switched off. CSMA/CA and ACK are there for a reason... You'll see traces of collisions and missing packages when sniffing or debugging the TCP protocol details directly.

The initial testing is being done with CSMA/CA and ACK since I was not really testing the performance in this issue (it has to work reliably before worrying about performance).

I will try sniffing and a git bisect to pin point the problem.

ghost commented 7 months ago

The initial testing is being done with CSMA/CA and ACK since I was not really testing the performance in this issue (it has to work reliably before worrying about performance).

Totally agree.

I will try sniffing and a git bisect to pin point the problem.

Nice. You may of course contact me any time on Discord as well if there's something you think I could help you with. I'd really love to see this hardware to be better supported.

Ayush1325 commented 6 months ago

So did a git bisect and found that the commit responsible is most likely 128354ae17fb95ec6c61d73b9060f1a274715464 .

However, it seems that the problem is a bit different from what I initially considered. What seems to be happening is that for some reason the driver does not recover if it runs out of available rx and/or tx packets. That is to say, the samples does not recover after op->status == PROP_ERROR_RXBUF. The TCP Socket closes with error 116 (timeout).

I seem to be able to recover if I reset the rx buffers each time on drv_rx_start as was present before the commit (https://github.com/zephyrproject-rtos/zephyr/commit/128354ae17fb95ec6c61d73b9060f1a274715464#diff-8278a3f5641432256702d96822ad636a501df47e873791c58ccdd79b77cad1bcL429), but not completely sure why that is.

To speed up testing, I had to set CONFIG_NET_L2_IEEE802154_FRAGMENT_REASS_CACHE_SIZE=8. This causes the Could not get a cache error to not come up after the above mentioned commit (seems it runs out of packets before running out of cache).

Ayush1325 commented 6 months ago

So #72977 does seem to prevent timeout errors, but it is not a complete fix. While it does not time out now, it still stops in under 2 mins. The error is as follows:

[00:01:30.354,797] <err> net_echo_server_sample: TCP (IPv6): Failed to send, closing socket

On further investigation, it seems send fails with error 105, i.e. ENOBUFS

ghost commented 5 months ago

@Ayush1325 Puh - sorry to hear that. Sounds like we had to go back to digging deeper in the code. Debugging with a breakpoint on the non-zero error code path would give you a good indication where this comes from as you see the stack then. Probably you'll have to do several rounds to track it across threads. Can I help something to make this easier for you? (And thanks again for investing all that debugging time here, very much appreciated!)

Ayush1325 commented 5 months ago

I am reopening the issue since as mentioned in the previous comment, while I am no longer geting receive timeout errors, TCP is still not usable due errors while sending.

Ayush1325 commented 4 months ago

I have found the reliability for long running tasks has recently improved https://github.com/zephyrproject-rtos/zephyr/pull/73488 Will close this once that is merged.

ghost commented 3 months ago

@Ayush1325 Can you check please, whether this issue is still required and close otherwise?