zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.49k stars 6.42k forks source link

bluetooth: Behavior change in host causing MESH to fail on sending messages #77241

Open m-alperen-sener opened 3 weeks ago

m-alperen-sener commented 3 weeks ago

Describe the bug

The changes introduced with this PR and additional PR to avoid deadlocks by overriding K_FOREVER with K_NO_WAIT in buffer allocation requested by syswq. Thus, this changes introduces a big behavior change for the layers that uses host and syswq. For MESH stack it causes to loose mesh messages send by using syswq.

Those changes propagates the failures caused by host and controller interaction mechanism to the upper layers and for the app side it does not make so much sense but confusion. That changes necessitates changes that should handles this particular corner case by every bt_host user.

The possible mitigations:

Alongside this bug in MESH, the behavioral change is not visible enough for bt_host users in documentation.

m-alperen-sener commented 3 weeks ago

@PavelVPV and @jori-nordic can you make a comment on that issue.

jori-nordic commented 3 weeks ago

I had added this to the documentation. Since it's a net change, not only for the host: https://github.com/zephyrproject-rtos/zephyr/blob/ad3e941ad3ed9e3e2f9032b0151558a948e99fb9/include/zephyr/net/buf.h#L1333-L1335

I agree, it's not super visible. But such are a lot of upstream changes, that's why we have tests. The host is very synchronous, and as such very prone to deadlocks. Additionally, there are a lot of things that run on the syswq now: application items, various subsystems, and even some GPIO drivers. Blocking this "central" scheduler is not nice for the other users. That's the rationale behind those two PRs.

Since we can't just malloc and spawn worker threads like on a linux machine, we have to change the host API to be non-blocking and allow operations to fail. That's how it is on (multi-user) embedded systems.

Unfortunately, it does push the problems upwards, like you've seen. The two mitigations you highlight are good:

  1. Is not scalable, as you point out.
  2. Won't work without additional API changes: I would suggest making changes to MESH API to be more asynchronous too, and only allocate and queue controller commands from a separate thread. Sending from syswq should be ok, allocating is more dicey. Application should also handle not being able to send a message at any given time.
  3. If you have memory, you can try a simple trick: reschedule your work item when you get a failure. Chances are, when you are scheduled again you will get the resources to send.
  4. Since the host manages the buffer pools, we need a new API, that signals other users that an allocation is likely to succeed. We don't have that right now. I made https://github.com/zephyrproject-rtos/zephyr/issues/77249 to track this. Feel free to comment on it too.

An additional note: it has taken 3+ months for us to detect this issue. It would be good to have whatever tests that detected the issue run on the latest upstream. I would even suggest on any PR tagged with "Bluetooth" or at least "Bluetooth host".

alxelax commented 3 weeks ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first to consider impact on all existing subsystems within Zephyr? at least related to Bluetooth. It is quite well that Host improved behavior. However, it made any Mesh communication unreliable, even specification requires opposite behavior. It even cannot pass qualification tests with default configuration anymore. Seems to consider new Host changes Mesh requires architecture redesign(or at least detailed review) all places those use system work queue for sending messages. Until above, Mesh, it seems, stopped being reliable in Zephyr. Now, every single configuration needs to be tuned to be sure that Host has sufficient resources to handle all tx requests.

jori-nordic commented 3 weeks ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first

It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.

even specification requires opposite behavior

What do you mean, the spec has a say on the implementation details?

stopped being reliable in Zephyr

We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.

jori-nordic commented 3 weeks ago

WG meeting:

alxelax commented 2 weeks ago

Shouldn't such kind of changes pass Architecture review in Bluetooth group first

It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.

This is quite sad. Seems the architecture discrepancy between new Host approach after fix and existing Mesh design wasn't taken into account on review.

even specification requires opposite behavior

What do you mean, the spec has a say on the implementation details?

Nop. I meant changing in behavior. Applications do not expect that device cannot deliver data because of lack of buffers in Host due to previous solution didn't assume it in general. All Zephyr Mesh users didn't add anything related to solve this instability. If somebody sent frame reliably (with acknowledgement on transport layer for example) they expect close to 100% delivery ratio. They do not expect skipping in Host.

stopped being reliable in Zephyr

We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.

I do not want to argue about reliability of approaches. I'm not against about any improvements. My main concern is that this improvement was stopped in a middle of way. The Host was improved, but it added architecture discrepancy between Host and Mesh. That caused significant unreliability in Mesh(cannot pass qualification anymore without buffers tuning for different use cases) and impacted on all Zephyr BLE Mesh users.

Now it requires significant redesign of some places Mesh code(a lot of them). I'd like to know plans about further improvements to remove architecture discrepancy before next Zephyr release (or when?).

If there are no sources to remove architecture discrepancy, could you add (under KConfig option) the old behavior to not break application behavior for Mesh users. It will be a workaround until Mesh will be aligned to new Host approach.

@jori-nordic, @alwa-nordic, @ubieda, @jhedberg any ideas?

jori-nordic commented 2 weeks ago

Discussed IRL with @PavelVPV

Mesh is using the host advertiser module. They are not sending HCI commands on their own. That means that even if we provide the async API ASAP (we won't, we should design it well), we basically have to refactor adv.c AND id.c to use that API.

They also have never seen the deadlocks due to HCI commands in production. I think the propensity to deadlock is still there, though.

The easy fix is to add an additional work thread to send from. Unfortunately, the mesh applications barely fit the nRF52832, which is still the main chip customers use. Adding the extra workqueue (with a big stack because the host likes big stacks and cannot lie) will overflow the RAM.

My proposition to pavel is then to:

jori-nordic commented 2 weeks ago

They also have never seen the deadlocks due to HCI commands in production

Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.

image

It's due to the conn_auto_initiate() function that's trying to allocate command buffers, at the same time as mesh is waiting for command buffers on the system workqueue.

So we're proposing a three-pronged fix instead:

@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.

PavelVPV commented 2 weeks ago

-> > They also have never seen the deadlocks due to HCI commands in production

Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.

Well, what I meant is it worked before these 2 PRs (and probably some other patches added after them (if there were some)). :)

@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.

I've finally managed to check patches and I think we need to apply only these 2:

With them our tests that that caught this issue pass (I mean on CI with real hardware plus PTS) like before with CONFIG_BT_BUF_CMD_TX_COUNT=2.

I suggest to close https://github.com/zephyrproject-rtos/zephyr/pull/77678 as I came to a conclusion that it doesn't help us really, it doesn't revert the whole thing as the major changes were done in the second PR. It will create problems for customers (and for us eventually) if they catch the deadlock. And we will end up in a situation where we just have to increase CONFIG_BT_BUF_CMD_TX_COUNT value. I was about to do that until I found the issue fixed in https://github.com/zephyrproject-rtos/zephyr/pull/77735. Mesh should be able to recover even without having a knowledge of available cmd buffers in host, so we in mesh should deeply investigate all failures before bumping up CONFIG_BT_BUF_CMD_TX_COUNT value.

I haven't tested https://github.com/zephyrproject-rtos/zephyr/pull/77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using K_FOREVER in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation. But I think this can happen potentiality. In this regard, as a part of the proper fix I'd recommend to revert https://github.com/zephyrproject-rtos/zephyr/pull/71697 because this IMHO fixes bluetooth specific issue by altering behavior of another subsystem. There can be many implementations without bluetooth which may try to allocate a net buf in syswq with K_FOREVER for their own reasons and I'm not sure why they can't do this anymore. IMHO this should be fixed in the host instead.

jori-nordic commented 2 weeks ago

IMHO this should be fixed in the host instead.

Yes, after thinking about it and talking with you and alex, I changed my mind. We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while? There is already a mechanism like that in net_bufs I think. Anyway, reverting that PR will have to be done when we properly fix the root cause(s) in the host like you said.

jori-nordic commented 2 weeks ago

I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow

I've verified that the two patches you mention do not break our downstream tests.

PavelVPV commented 2 weeks ago

We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while?

Agree, could this be a part of Thread Analyzer perhaps? It already gathers statistic about CPU usage, perhaps it could also gather statistic about how much a thread was pending on a object (semaphore or something else)?