Data buffer allocation: TCP stops working

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.74k stars 6.56k forks source link

Data buffer allocation: TCP stops working #45426

Closed AndreyDodonov-EH closed 2 years ago

AndreyDodonov-EH commented 2 years ago

Describe the bug

Symptom: TCP connection is broken and neither recovers nor closes (remains ESTABLISHED). Indicating errors in log are: conn: ... packet allocation failed and Data buffer ... allocation failed

Supposed reason: TCP TX packets and buffers are shared resource. They are used for sending data over TCP, re-sending data over TCP and a bunch of other things - dhcp, icmp and probably some others.

In case of sending and resending data, first net_context.c and then tcp.c allocates packets and buffers from the same slab and net_buffer_pool, respectively.

Normally it wouldn't be a problem, but TCP is usually called via sockets, and in many cases the socket is accessed again and again without any kind of polling. Even if polling is explicitly called, it doesn't change anything. See: https://github.com/zephyrproject-rtos/zephyr/issues/5842

If socket is in blocking mode, then connection is slow because of sleeping (see e.g. https://github.com/zephyrproject-rtos/zephyr/issues/45367) If socket is in non-blocking mode, then net_context.context_sendto is called without any delay whatsoever.

context_sendto then keeps allocating memory from TX buffers. If the window is full, it's released almost right away. However, since it's called again and again (see above), it has a lot of impact.

At the same time, tcp_send_data tries to allocate buffers for the queue. And it might not be able to do that, since context_sendto keeps occupying the buffers. tcp_send_data is also called from tcp_resend_data, in this case lost packets are not resent.

The problem is likelier to happen "in the wild" where networks are unstable. E.g. for us, it was a showstopper in case of a weak Wi-Fi signal, where resends were necessary not only on a data link layer, but also on the TCP layer. I.e., as stated above, big chunks to non-blocking socket + TCP resends.

Propose fix Lock TCP (using tcp::lock) before allocating packet in net_context, preventing race condition. Then, once the mutex is unlocked and resend was pending, it will be executed right away. If we try to execute it at the same with context_sendto, then resend doesn't have memory.

Tested fix implemented in https://github.com/zephyrproject-rtos/zephyr/pull/45437, and unlike timing/buffers adjustments, this seems to solve the issue, not just reduce it's probability.

To Reproduce Steps to reproduce the behavior:

any networking application, e.g. http webserver
Keep sending data in big chunks in a non-blocking socket.
increase chunks fed to sockets to 4096+ Bytes (assuming we use default settings for CONFIG_NET_PKT_TX_COUNT (14), CONFIG_NET_BUF_TX_COUNT (36) and CONFIG_NET_BUF_DATA_SIZE (128)) and do NOT set CONFIG_NET_TCP_MAX_SEND_WINDOW_SIZE)
Make sure sometimes TCP resends are needed (pick unstable medium (weak Wi-Fi), or drop packets in driver sporadically, or use some problem simulator, e.g. netem https://wiki.linuxfoundation.org/networking/netem)

I'll try to add a demo app or twister test for QEMU to reproduce that and will reference it here.

Expected behavior TCP keeps working, regardless of how aggressive we call the socket and with which amount of user data.

Impact Showstopper for applications which need long-term stability.

Logs and console output [00:36:03.926,940] net_pkt: Data buffer (1328) allocation failed. [00:36:03.926,971] net_tcp: conn: 0x2003595c packet allocation failed, len=1280 [00:36:04.227,142] net_pkt: Data buffer (1328) allocation failed. [00:36:04.227,172] net_tcp: conn: 0x2003595c packet allocation failed, len=1280 [00:36:04.527,374] net_pkt: Data buffer (1328) allocation failed. [00:36:04.527,404] net_tcp: conn: 0x2003595c packet allocation failed, len=1280 [00:36:04.827,545] net_pkt: Data buffer (1328) allocation failed. [00:36:04.827,575] net_tcp: conn: 0x2003595c packet allocation failed, len=1280

Environment (please complete the following information):

OS: Any
Toolchain: Zephyr SDK NRF version 2.7.99

Additional context

ssharks commented 2 years ago

This description looks much like https://github.com/zephyrproject-rtos/zephyr/issues/44018. A fix has been made for this issue. Did you test with these fixes?

AndreyDodonov-EH commented 2 years ago

@ssharks Thanks for the tip!

I cherry-picked those changes (I'm on Zephyr SDK NRF version 2.7.99, Nordic version) as you suggest.

After that TCP is not dying, but I still get fair share of Data buffer allocation errors, even though yes, not infinite and connection leaves on.

AndreyDodonov-EH commented 2 years ago

@ssharks

As far as I understand, fix from @rlubos in https://github.com/zephyrproject-rtos/zephyr/pull/44227 reuses packets, and preventing from running dry, and my PR https://github.com/zephyrproject-rtos/zephyr/pull/45437 ensures that pending resend is called right away via mutex_unlock.

But apparently I introduced another dead lock.

It's a little bit hard for me test, because my primary target is Zephyr SDK NRF version 2.7.99), and there target Hardware works much better with this change (tried both server and client modes), but I couldn't even build socket/tcp test on that version of Zephyr...

I'll look further why even with https://github.com/zephyrproject-rtos/zephyr/pull/45018 deadlock stays.

Update:

My code wasn't accounting for send and recv at the same time.

So now by leveraging contexts lock I just ensure that resend is triggered once it is due, without skipping some iterations and getting some Data buffer allocaiton error

ssharks commented 2 years ago

Thanks for sharing your progress. The matter is not easy. A few things to keep in mind:

Both the sending, re-sending and send_window are competing for the same pool of buffers. Send packets are quickly feed up, but when the send window grabs buffers, it won't release them until the data is acknowledged by the receiver or the connection terminates. Using a too big send window (by one or more connections). Can result in a deadlock situation. At the moment managing the size of the send window is the responsibility of the user. To prevent catastrophic failures. the total used send buffers should allow for at least N MTU packts remaining buffer space, where N is the number of copies needed to perform a packet send.

Your proposed locking for competing buffer allocations only works for this particular connection. Other TCP connections, or network protocols, like DNS are also competing and not covered by this fix.

A third thing to keep in mind is that TCP is using a single workqueue. This almost looks like multi threading, but a single function blocking in the workqueue will result the other operations in the work-queue not being scheduled any more. Therefore using locking within the TCP stack should be used with great care.

I have been thinking on the a similar problem (filling of the send queue) for some time. Both problems can become so complex that I'm starting to doubt if you can solve the problem analytically (by some kind of mutexes). Currently I'm on the path of testing out polling with exponentially growing delays. This is exploits low delays when possible, but prevents a system from locking up due to high polling rates when it becomes busier.

AndreyDodonov-EH commented 2 years ago

@ssharks

Thanks for your input!

Regarding the first consideration: there is a default send window. However, I totally agree that user is responsible for explicitly setting the Window as well, if needed.

But another point for the user to consider, and normally this point shouldn't be considered, is the size of the chunk which the user is feeding to the socket send. In your socket test test_v4_send_recv_large, it's fairly small. The problem I'm stating is that the size of the chunk constantly occupies extra buffers - context_sendto() allocates buffers, calls net_tcp_queue_data(), which says that the window is full, then returns -EAGAIN, then buffers are released by context_sendto(), but it's called again and again, because we flood the socket (see https://github.com/zephyrproject-rtos/zephyr/issues/5842).
So the user has to keep in mind that it's needed to allocate more buffers than we would normally need, proportional to the size of the chunk. Otherwise, there is a possibility of this issue.

Regarding the second consideration. I agree that my fix addresses only TCP. Correct me if I am wrong, but the lock used is a member of TCP structure, https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/tcp_private.h#L223, thus the locking is connection specific, as far as I understand. If you have any ideas how we can make it work for DNS, DHCP and other protocols using same TX buffers, it would be great to discuss it. The easiest option I think is fixing https://github.com/zephyrproject-rtos/zephyr/issues/5842, and then user should be using poll/select (another ToDo here would be to use it in built-in Zephyr wrappers, such as in http_client

I totally agree with the third point, but the lock is used there anyway. Locking mutex of the context just ensures that the function is triggered once it's due, instead of skipping it, because resend getting buffers is vital for TCP connect. But of course I agree that here we have to be careful, so that no new deadlocks slip in. That's why it would be nice if more experienced members, e.g. @rlubos have a look at this.

As for solving problem analytically: I do think though that it is possible to solve the problem analytically. Let's consider an out-of-implementation case, just algorithm.

Basically, it is sufficient for the send not to accept new data from the application until we have enough memory for that (i.e. enough outgoing length has been acknowledged), and for resend not to consume any new memory, just taking references of the data which already have been fed to the socket. That's it.

It's almost like that in Zephyr (@rlubos correct me if I am wrong), but with two differences:

We do allocate memory anyway for a short period of time in context_sendto even if we are not able to accept more data on TCP level. Between context_alloc_pkt and net_pkt_unref. And since the thread where context_sendto runs might have lower priority than workqueue from where resend is called, this issue arises.
Resend doesn't really consume new memory, it just restructures it, by first allocating packet then copying, then unreferencing copied ones. If this happens by taking priority over context_sendto, this issue arises (just rephrasing 1. here)

@ssharks As for polling, I think instead of using exponential timeouts it's better to leverage Zephyr kernel mechanisms, such as Polling API, by polling the queue (e.g. of buffers) or by polling for signal, which is raised once we have again some space. This is of course a very crude consideration, but I think it's doable. I think the polling part of the discussion can be continued here https://github.com/zephyrproject-rtos/zephyr/issues/5842

rlubos commented 2 years ago

We do allocate memory anyway for a short period of time in context_sendto even if we are not able to accept more data on TCP level. Between context_alloc_pkt and net_pkt_unref. And since the thread where context_sendto runs might have lower priority than workqueue from where resend is called, this issue arises.

This sounds correct. That's why I think we should focus on implementing POLLOUT properly, so that a non-blocking socket will not unnecessarily occupy the buffers if it won't be able to send anyway. This would require some kind of signalling mechanism from TCP to net_context/socket, to notify that TX window is avaialble again. This could also be used in case of a blocking socket, so that it doesn't wait blindly before trying to requeue the data at the TCP level.

Resend doesn't really consume new memory, it just restructures it, by first allocating packet then copying, then unreferencing copied ones. If this happens by taking priority over context_sendto, this issue arises (just rephrasing 1. here)

Still, there need to be a buffer available to allocate the "copy" intended for send. If some other module in the system hogs TX buffers (like spinning TX thread with non-blocking socket) it may fail to retransmit the data.

AndreyDodonov-EH commented 2 years ago

This would require some kind of signalling mechanism from TCP to net_context/socket, to notify that TX window is avaialble again.

@rlubos Yes, this would be great, eliminating flooding the socket in case of non-blocking and need for sleep in case of blocking.

@ssharks That's what I meant when saying that ideally we shouldn't use sleep k_sleep(WAIT_BUFS); at all instead of adjusting its value.

rlubos commented 2 years ago

I'm working on a semaphore-based solution to notify the upper layers whether trasfers are possible. The idea is that the upper layer could monitor the semaphore status with k_poll() instead of waiting blindly. It should also allow to monitor the semaphore in poll() to implmemnt POLLOUT.

AndreyDodonov-EH commented 2 years ago

@rlubos Just a consideration, would it make sense to use signal of k_poll_signal_raise, because then we can pass extra integer?

Because ideally we need to know how much data we can send, not just that we can send. That it also consistent with return codes of poll and select.

As a side note, I think there is also a mapping error in VTABLE, that socket poll calls zephyr select.

rlubos commented 2 years ago

Just a consideration, would it make sense to use signal of k_poll_signal_raise, because then we can pass extra integer?

I considered using signals initially, but it'd be more complex to implement signalling properly due to synchronization issues. The socket layer would need to reset the signal with k_poll_signal_init() before using it in k_poll() (at least that's what documentation states), so we'd have a potential race condition, in case for example data is acknowledged at TCP level before the signal is reset, in such case we'd not catch it.

Semaphore on the other hand can be fully internally managed by the TCP layer, so the socket layer will only need to monitor it with k_poll().

Also, I don't think that socket layer should bother about the actual data length that can be sent, it only needs to know whether sending is possible or not. The net_context layer returns the information on how much data it was able to send, so any synchronization should take place between TCP and net context layer, ideally TCP should only queue as much data as it can. I need to check if it's actually implemented properly.

As a side note, I think there is also a mapping error in VTABLE, that socket poll calls zephyr select.

Can you elaborate on that? poll() function per se is not part of the socket vtable.

AndreyDodonov-EH commented 2 years ago

I checked POSIX, and you are right, it should just return number of sockets available, and since we are doing it anyway on a socket-per-socket basis, returning just 1 if we can send data again is fine. Middlewares, e.g. civetweb also expect that: https://github.com/civetweb/civetweb/blob/master/src/civetweb.c#L5966

As for the poll/select, I formulated it incorrectly, sorry. I just remembered that there is a mismatch after seeing this discussion: https://github.com/zephyrproject-rtos/zephyr/discussions/36072#discussioncomment-845554 which states that select() is using poll(), so the other way around from what I originally stated.

ssharks commented 2 years ago

Great work guys, the k_poll implementation will be a huge improvement over the current wait loop. My proposed exponential wait time was a low effort intermediate solution to gain a quick significant improvement, but definitely not final solution. This will work for the case where the transmit buffer is full.

I'm a little skeptical if it is feasible to make a rock solid implementation for the situation where there are no buffers available any more. Apparently you guys are more experienced in this matter the I am, so I'm patiently waiting for positive results.

AndreyDodonov-EH commented 2 years ago

@ssharks If there are no buffers, then with polling solution from @rlubos, poll will return only once there are free buffers again. Or once timeout happens (if specified). Then it's up to the app to retry.

Consider e.g. civetwebs usage of polling.

We send: https://github.com/civetweb/civetweb/blob/master/src/civetweb.c#L6083 Then if there are no buffers - we poll, until some buffers are free: https://github.com/civetweb/civetweb/blob/master/src/civetweb.c#L6137

rlubos commented 2 years ago

I've prepared a proof-of-concept implementation of the TX window signaling and POLLOUT: https://github.com/rlubos/zephyr/tree/tcp/tx-window-avaialble-signalling

I've tested it briefly and it seems to work fine, the transfer in TCP test gets a significant boost, I've also added a test that verifies the POLLOUT for stream sockets. I'm still not confident enough though to propose it as a PR for several reasons:

The ZWP that we send takes place from within net_tcp_queue_data() which isn't great, in case peer reports zero-window and we're inside poll() monitoring POLLOUT we might get stuck in there until timeout. I need to reimplement this so that ZWP sending is rather triggered by a delayed work instead periodically.
TCP_DATA_MODE_RESEND gave me a real headache, IMO the retransmission logic in our TCP stack is really hard to follow. What I specifically dislike is that entering this mode clears the unacked_len counter, which is used to track whether the window is full or not - there may be corner cases that I did not consider. Therefore I think we need more tests to verify its functionallity, or even consider reimplementing this.

Anyway, I wanted to share this already, it would be good to get some initial feedback or even test it.

rlubos commented 2 years ago

Actually, I have to agree with @ssharks here regarding net_buf/net_pkt availability, I don't really see a good/easy way to monitor this for several reasons:

Mem slab used as a net_pkt pool is not monitorable by k_poll(),
I don't see an easy way to predict how many net_bufs would actually be needed to sucessfully trasfer a packet, since due to all of the cloning taking place at the TCP layer,
It'd be needed to monitor both, net_pkt and net_buf availability, which adds yet another complexity level,
And finally, since the net_pkt/net_buf pools are shared across the whole networking stack, it'd be really tricky to ensure that net_buf/net_pkt are available even if we detect they are, i. e. in case scheduler kicks in and some other thread gets the packet in between poll() exit and the actual packet allocation in send().

So to summarize, providing a reasonable way to monitor buffer availability at socket level would require a lot of work and a lot of consideration. I do think however, that this might not be even that essential feature - the transmit window is directly realted to the amount of buffers available in the system, filling up the TX window means we have less buffers available in the system, so monitoring the transmit window might be enough to write robust applications with non-blocking sockets. Note, that even for non-blocking socket the pkt/buffer allocation is blocking , so there's no risk of busy-looping and hogging the CPU, which was the core of this report as far as I understood.

AndreyDodonov-EH commented 2 years ago

I've prepared a proof-of-concept implementation of the TX window signaling and POLLOUT: https://github.com/rlubos/zephyr/tree/tcp/tx-window-avaialble-signalling

I've tested it briefly and it seems to work fine, the transfer in TCP test gets a significant boost, I've also added a test that verifies the POLLOUT for stream sockets. I'm still not confident enough though to propose it as a PR for several reasons:

The ZWP that we send takes place from within net_tcp_queue_data() which isn't great, in case peer reports zero-window and we're inside poll() monitoring POLLOUT we might get stuck in there until timeout. I need to reimplement this so that ZWP sending is rather triggered by a delayed work instead periodically.

TCP_DATA_MODE_RESEND gave me a real headache, IMO the retransmission logic in our TCP stack is really hard to follow. What I specifically dislike is that entering this mode clears the unacked_len counter, which is used to track whether the window is full or not - there may be corner cases that I did not consider. Therefore I think we need more tests to verify its functionallity, or even consider reimplementing this.

Anyway, I wanted to share this already, it would be good to get some initial feedback or even test it.

I cherry-picked it on top of Zephyr-SDK 2.7.99, and it seems to work just fine. I did some stress-test with 4 KB chunks in send on Wi-Fi AP and civetweb as webserver, which were previously failing, and they are just fine.

Also I extended http_clinet with zsock_poll, it also seems to work stable.

AndreyDodonov-EH commented 2 years ago

Actually, I have to agree with @ssharks here regarding net_buf/net_pkt availability, I don't really see a good/easy way to monitor this for several reasons:

Mem slab used as a net_pkt pool is not monitorable by k_poll(),

I don't see an easy way to predict how many net_bufs would actually be needed to sucessfully trasfer a packet, since due to all of the cloning taking place at the TCP layer,

It'd be needed to monitor both, net_pkt and net_buf availability, which adds yet another complexity level,

And finally, since the net_pkt/net_buf pools are shared across the whole networking stack, it'd be really tricky to ensure that net_buf/net_pkt are available even if we detect they are, i. e. in case scheduler kicks in and some other thread gets the packet in between poll() exit and the actual packet allocation in send().

So to summarize, providing a reasonable way to monitor buffer availability at socket level would require a lot of work and a lot of consideration. I do think however, that this might not be even that essential feature - the transmit window is directly realted to the amount of buffers available in the system, filling up the TX window means we have less buffers available in the system, so monitoring the transmit window might be enough to write robust applications with non-blocking sockets. Note, that even for non-blocking socket the pkt/buffer allocation is blocking , so there's no risk of busy-looping and hogging the CPU, which was the core of this report as far as I understood.

The core issue was resend being called again and again forever to no avail.

Actually, when cherry-picking your changes, I found that the problem we had on NRF-SDK 2.7.99 was partly fixed here: https://github.com/zephyrproject-rtos/zephyr/commit/208927d640081c3d4d27c8a9adb960f543bf5bc8 So even just this fix already at least leads to TCP connection closing instead of running forever being dead.

Then in combination with your polling suggestion, resend also gets necessary memory, so it seems fine.

AndreyDodonov-EH commented 2 years ago

Side note. I also find current memory sharing system extremely complex and limiting in some aspects, as well as resend logic somewhat intertwined. I have a feeling that with how it is, it will be quite hard to extend TCP.

And judging by this https://github.com/zephyrproject-rtos/zephyr/issues/45379, as well as observations which I started to sum up here - https://github.com/zephyrproject-rtos/zephyr/issues/45367#issuecomment-1120905595, there are a lot of extensions/fixes needed.

rlubos commented 2 years ago

Well, a common packet/buffer pool might bring some complexity and unpredictable behaviour when it comes to timing, but on the other hand, it also allows to lower the overall memory consumption, if we had to keep a separate memory pool for each net_context for instance, we'd need way to use more memory. With TCP we need to be especially careful with buffer number selection as first it allocates two net_pkts permamently per TCP context, and second it hogs net_bufs when storing data for retransmission. The latter can be finetuned with TX window size, so we have tools to mitigate this.

As for retransmission logic, I have notihing to agree, when working with the new TCP stack I found this the most complicated part, that's why I think we need to consider if it could be simplified somehow (if possible).

For now, I plan to rewrite the ZWP transmission logic, so we no longer rely on calling net_tcp_queue_data() and do some more testing of TX window notification when retrasmission kicks in. I should be confident enough then to submit a PR.

AndreyDodonov-EH commented 2 years ago

Did some more testing with the cherry-picked polling implementation, and there are some issues with TLS.

tls_mbedtls_handshake function prints TLS handshake error, which was working previously

rlubos commented 2 years ago

@AndreyDodonov-EH Can you elaborate a bit on how to reproduce the failure? I tried with tls_ext tests and echo samples with TLS enabled and they just work.

AndreyDodonov-EH commented 2 years ago

@rlubos I have to apologize, TLS failure seems to be caused by my incorrect cherry-picking. I'm working with NRF's Zephyr-SDK 2.7.99, so had to cherry-pick.

Now I picked only the changes relevant to polling, and it works. Sorry for that.

Side note: I didn't manage to correctly cherry-pick receive window handling from you: often downloads of big files with default packages and window size just silently fail, downloading fewer bytes than expected. But as I said, it is most probably my fault.

It would be cool to back-port to Zephyr-SDK, or at least ensure that it lands in nRF Connect SDK 2.0 (which I am extremely anticipating because of TF-M) I see that it's main already on Zephyr 3.0, which is great.

rlubos commented 2 years ago

@AndreyDodonov-EH I've reimplmetd ZWP and tested the implmentation a bit more with retransmision scenario, and it seems to work fine, so I've opened https://github.com/zephyrproject-rtos/zephyr/pull/45626.

As for sdk-zephyr in NCS, I'll create a PR which brings all of the essential TCP fixes into the tree, hopefully we could squeeze it into NCS 2.0.

gsshewalkar commented 9 months ago

I've prepared a proof-of-concept implementation of the TX window signaling and POLLOUT: https://github.com/rlubos/zephyr/tree/tcp/tx-window-avaialble-signalling

I've tested it briefly and it seems to work fine, the transfer in TCP test gets a significant boost, I've also added a test that verifies the POLLOUT for stream sockets. I'm still not confident enough though to propose it as a PR for several reasons:

The ZWP that we send takes place from within net_tcp_queue_data() which isn't great, in case peer reports zero-window and we're inside poll() monitoring POLLOUT we might get stuck in there until timeout. I need to reimplement this so that ZWP sending is rather triggered by a delayed work instead periodically.

TCP_DATA_MODE_RESEND gave me a real headache, IMO the retransmission logic in our TCP stack is really hard to follow. What I specifically dislike is that entering this mode clears the unacked_len counter, which is used to track whether the window is full or not - there may be corner cases that I did not consider. Therefore I think we need more tests to verify its functionallity, or even consider reimplementing this.

Anyway, I wanted to share this already, it would be good to get some initial feedback or even test it.

@rlubos , I am facing the same issue in net_pkt buffers and want to cherry pick your POC changes but I am not able to access https://github.com/rlubos/zephyr/tree/tcp/tx-window-avaialble-signalling branch on your space. Could you please assist?

rlubos commented 9 months ago

@gsshewalkar The changes have already been integrated into Zephyr main, hence the branch was deleted. The PR https://github.com/zephyrproject-rtos/zephyr/pull/45626 was merged and the commits can be found here.