bluetooth: Behavior change in host causing MESH to fail on sending messages

m-alperen-sener commented 3 months ago

Describe the bug

The changes introduced with this PR and additional PR to avoid deadlocks by overriding K_FOREVER with K_NO_WAIT in buffer allocation requested by syswq. Thus, this changes introduces a big behavior change for the layers that uses host and syswq. For MESH stack it causes to loose mesh messages send by using syswq.

Those changes propagates the failures caused by host and controller interaction mechanism to the upper layers and for the app side it does not make so much sense but confusion. That changes necessitates changes that should handles this particular corner case by every bt_host user.

The possible mitigations:

Increasing the resources; like increasing the count of the HCI cmd buffers however it is likely to introduce scaling issues for upper layers end application.
Another solution seems to eliminate the usage of the sysqw in MESH stack, it would eliminate the failure for most part of the mesh however still the end application is susceptible to the same mistake by calling certain MESH APIs from sysqw and failing directly, the same way we got failures from the host. Thus, this option does not seems like a complete solution.

Alongside this bug in MESH, the behavioral change is not visible enough for bt_host users in documentation.

m-alperen-sener commented 3 months ago

@PavelVPV and @jori-nordic can you make a comment on that issue.

jori-nordic commented 3 months ago

I had added this to the documentation. Since it's a net change, not only for the host: https://github.com/zephyrproject-rtos/zephyr/blob/ad3e941ad3ed9e3e2f9032b0151558a948e99fb9/include/zephyr/net/buf.h#L1333-L1335

I agree, it's not super visible. But such are a lot of upstream changes, that's why we have tests. The host is very synchronous, and as such very prone to deadlocks. Additionally, there are a lot of things that run on the syswq now: application items, various subsystems, and even some GPIO drivers. Blocking this "central" scheduler is not nice for the other users. That's the rationale behind those two PRs.

Since we can't just malloc and spawn worker threads like on a linux machine, we have to change the host API to be non-blocking and allow operations to fail. That's how it is on (multi-user) embedded systems.

Unfortunately, it does push the problems upwards, like you've seen. The two mitigations you highlight are good:

Is not scalable, as you point out.
Won't work without additional API changes: I would suggest making changes to MESH API to be more asynchronous too, and only allocate and queue controller commands from a separate thread. Sending from syswq should be ok, allocating is more dicey. Application should also handle not being able to send a message at any given time.
If you have memory, you can try a simple trick: reschedule your work item when you get a failure. Chances are, when you are scheduled again you will get the resources to send.
Since the host manages the buffer pools, we need a new API, that signals other users that an allocation is likely to succeed. We don't have that right now. I made https://github.com/zephyrproject-rtos/zephyr/issues/77249 to track this. Feel free to comment on it too.

An additional note: it has taken 3+ months for us to detect this issue. It would be good to have whatever tests that detected the issue run on the latest upstream. I would even suggest on any PR tagged with "Bluetooth" or at least "Bluetooth host".

alxelax commented 3 months ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first to consider impact on all existing subsystems within Zephyr? at least related to Bluetooth. It is quite well that Host improved behavior. However, it made any Mesh communication unreliable, even specification requires opposite behavior. It even cannot pass qualification tests with default configuration anymore. Seems to consider new Host changes Mesh requires architecture redesign(or at least detailed review) all places those use system work queue for sending messages. Until above, Mesh, it seems, stopped being reliable in Zephyr. Now, every single configuration needs to be tuned to be sure that Host has sufficient resources to handle all tx requests.

jori-nordic commented 3 months ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first

It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.

even specification requires opposite behavior

What do you mean, the spec has a say on the implementation details?

stopped being reliable in Zephyr

We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.

jori-nordic commented 3 months ago

WG meeting:

@alwa-nordic and @ubieda : Could work around this by Mesh having its own workqueue
@jori-nordic : Mesh could have a common check in API and return an error if run on syswq
@ubieda @alwa-nordic : The rationale behind the change is on the original PR
@jhedberg : What about memory constrained platforms? A separate wq might not fit available resources
@jhedberg @alwa-nordic : We should make an API like bt_hci_cmd_send_sync(), but in which the user also passes in struct cmd_data. This could be a fully async call, and should be used internally by bt_hci_cmd_send_sync().
@alwa-nordic will look into implementing this API
@alwa-nordic also have a way of "locking" the cmd tx queue to make sure a bunch of commands can be sent back to back
@jhedberg linux has this in the form of a queue/list. But they don't have return values / callbacks
@alwa-nordic a potential use would be updating the resolving list. You want to generate the commands JIT, to save on resources. So a simple queue of buffers is not powerful enough.

alxelax commented 3 months ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first

It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.

This is quite sad. Seems the architecture discrepancy between new Host approach after fix and existing Mesh design wasn't taken into account on review.

even specification requires opposite behavior

What do you mean, the spec has a say on the implementation details?

Nop. I meant changing in behavior. Applications do not expect that device cannot deliver data because of lack of buffers in Host due to previous solution didn't assume it in general. All Zephyr Mesh users didn't add anything related to solve this instability. If somebody sent frame reliably (with acknowledgement on transport layer for example) they expect close to 100% delivery ratio. They do not expect skipping in Host.

stopped being reliable in Zephyr

We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.

I do not want to argue about reliability of approaches. I'm not against about any improvements. My main concern is that this improvement was stopped in a middle of way. The Host was improved, but it added architecture discrepancy between Host and Mesh. That caused significant unreliability in Mesh(cannot pass qualification anymore without buffers tuning for different use cases) and impacted on all Zephyr BLE Mesh users.

Now it requires significant redesign of some places Mesh code(a lot of them). I'd like to know plans about further improvements to remove architecture discrepancy before next Zephyr release (or when?).

If there are no sources to remove architecture discrepancy, could you add (under KConfig option) the old behavior to not break application behavior for Mesh users. It will be a workaround until Mesh will be aligned to new Host approach.

@jori-nordic, @alwa-nordic, @ubieda, @jhedberg any ideas?

jori-nordic commented 3 months ago

Discussed IRL with @PavelVPV

Mesh is using the host advertiser module. They are not sending HCI commands on their own. That means that even if we provide the async API ASAP (we won't, we should design it well), we basically have to refactor adv.c AND id.c to use that API.

They also have never seen the deadlocks due to HCI commands in production. I think the propensity to deadlock is still there, though.

The easy fix is to add an additional work thread to send from. Unfortunately, the mesh applications barely fit the nRF52832, which is still the main chip customers use. Adding the extra workqueue (with a big stack _{because the host likes big stacks and cannot lie}) will overflow the RAM.

My proposition to pavel is then to:

As a "hotfix": Add a promptless kconfig that disables the overwriting of K_FOREVER to K_NO_WAIT in net_buf_alloc. Basically what @alxelax suggested. https://github.com/zephyrproject-rtos/zephyr/pull/77678
As a temporary fix, add an hci_cmd_buf_destroyed event in the host. So Mesh has a signal to when they should reschedule sending their data.
On a longer-timeline: Gather requirements from mesh, on how they use the host, what patterns etc. Use that input to re-design the host's advertising module API so that it gives more control to the app and allows the mesh stack to work safely on resource-constrained devices.

jori-nordic commented 3 months ago

They also have never seen the deadlocks due to HCI commands in production

Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.

It's due to the conn_auto_initiate() function that's trying to allocate command buffers, at the same time as mesh is waiting for command buffers on the system workqueue.

So we're proposing a three-pronged fix instead:

@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.

PavelVPV commented 3 months ago

-> > They also have never seen the deadlocks due to HCI commands in production

Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.

Well, what I meant is it worked before these 2 PRs (and probably some other patches added after them (if there were some)). :)

@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.

I've finally managed to check patches and I think we need to apply only these 2:

With them our tests that that caught this issue pass (I mean on CI with real hardware plus PTS) like before with CONFIG_BT_BUF_CMD_TX_COUNT=2.

I suggest to close https://github.com/zephyrproject-rtos/zephyr/pull/77678 as I came to a conclusion that it doesn't help us really, it doesn't revert the whole thing as the major changes were done in the second PR. It will create problems for customers (and for us eventually) if they catch the deadlock. And we will end up in a situation where we just have to increase CONFIG_BT_BUF_CMD_TX_COUNT value. I was about to do that until I found the issue fixed in https://github.com/zephyrproject-rtos/zephyr/pull/77735. Mesh should be able to recover even without having a knowledge of available cmd buffers in host, so we in mesh should deeply investigate all failures before bumping up CONFIG_BT_BUF_CMD_TX_COUNT value.

I haven't tested https://github.com/zephyrproject-rtos/zephyr/pull/77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using K_FOREVER in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation. But I think this can happen potentiality. In this regard, as a part of the proper fix I'd recommend to revert https://github.com/zephyrproject-rtos/zephyr/pull/71697 because this IMHO fixes bluetooth specific issue by altering behavior of another subsystem. There can be many implementations without bluetooth which may try to allocate a net buf in syswq with K_FOREVER for their own reasons and I'm not sure why they can't do this anymore. IMHO this should be fixed in the host instead.

jori-nordic commented 3 months ago

IMHO this should be fixed in the host instead.

Yes, after thinking about it and talking with you and alex, I changed my mind. We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while? There is already a mechanism like that in net_bufs I think. Anyway, reverting that PR will have to be done when we properly fix the root cause(s) in the host like you said.

jori-nordic commented 3 months ago

I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow

I've verified that the two patches you mention do not break our downstream tests.

PavelVPV commented 3 months ago

We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while?

Agree, could this be a part of Thread Analyzer perhaps? It already gathers statistic about CPU usage, perhaps it could also gather statistic about how much a thread was pending on a object (semaphore or something else)?

github-actions[bot] commented 1 month ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

LingaoM commented 4 weeks ago

I haven't tested #77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using K_FOREVER in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation.

https://github.com/zephyrproject-rtos/zephyr/pull/80718

@PavelVPV Almost bbsim testcase failed due to attempt call K_FOREVER in BT Rx.

PavelVPV commented 4 weeks ago

I haven't tested #77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using K_FOREVER in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation.

80718

@PavelVPV Almost bbsim testcase failed due to attempt call K_FOREVER in BT Rx.

I'd rather revert some changes and think thoroughly why deadlocks happen. Now I see we start patching code here and there including internal zephyr modules, people also unhappy with net_buf change (https://github.com/zephyrproject-rtos/zephyr/issues/80167).

LingaoM commented 4 weeks ago

people also unhappy with net_buf change (#80167).

I think the restrictions on sysworkq are more severe than the side effects of net_buf, because sysworkq is a system work queue that can be called by any subsys or application, but now due to the mandatory dependency of Bluetooth host on sysworkq, sysworkq cannot be blocked elsewhere, also include alloc net_buf.

PavelVPV commented 4 weeks ago

people also unhappy with net_buf change (#80167).

I think the restrictions on sysworkq are more severe than the side effects of net_buf, because sysworkq is a system work queue that can be called by any subsys or application, but now due to the mandatory dependency of Bluetooth host on sysworkq, sysworkq cannot be blocked elsewhere, also include alloc net_buf.

It all depends on a use case. But what I see, instead of fixing this in the host (or at least trying to understand what exactly happens), we are patching the kernel code and 3rd party modules. The next step then is to change K_FOREVER to K_NO_WAIT in k_sem_take and k_mutex_lock functions.

zephyrproject-rtos / zephyr

bluetooth: Behavior change in host causing MESH to fail on sending messages #77241

80718