Open m-alperen-sener opened 3 months ago
@PavelVPV and @jori-nordic can you make a comment on that issue.
I had added this to the documentation. Since it's a net change, not only for the host: https://github.com/zephyrproject-rtos/zephyr/blob/ad3e941ad3ed9e3e2f9032b0151558a948e99fb9/include/zephyr/net/buf.h#L1333-L1335
I agree, it's not super visible. But such are a lot of upstream changes, that's why we have tests. The host is very synchronous, and as such very prone to deadlocks. Additionally, there are a lot of things that run on the syswq now: application items, various subsystems, and even some GPIO drivers. Blocking this "central" scheduler is not nice for the other users. That's the rationale behind those two PRs.
Since we can't just malloc and spawn worker threads like on a linux machine, we have to change the host API to be non-blocking and allow operations to fail. That's how it is on (multi-user) embedded systems.
Unfortunately, it does push the problems upwards, like you've seen. The two mitigations you highlight are good:
An additional note: it has taken 3+ months for us to detect this issue. It would be good to have whatever tests that detected the issue run on the latest upstream. I would even suggest on any PR tagged with "Bluetooth" or at least "Bluetooth host".
Shouldn't such kind of changes pass Architecture review in Bluetooth group first to consider impact on all existing subsystems within Zephyr? at least related to Bluetooth. It is quite well that Host improved behavior. However, it made any Mesh communication unreliable, even specification requires opposite behavior. It even cannot pass qualification tests with default configuration anymore. Seems to consider new Host changes Mesh requires architecture redesign(or at least detailed review) all places those use system work queue for sending messages. Until above, Mesh, it seems, stopped being reliable in Zephyr. Now, every single configuration needs to be tuned to be sure that Host has sufficient resources to handle all tx requests.
Shouldn't such kind of changes pass Architecture review in Bluetooth group first
It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.
even specification requires opposite behavior
What do you mean, the spec has a say on the implementation details?
stopped being reliable in Zephyr
We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.
WG meeting:
bt_hci_cmd_send_sync()
, but in which the user also passes in struct cmd_data
. This could be a fully async call, and should be used internally by bt_hci_cmd_send_sync()
.Shouldn't such kind of changes pass Architecture review in Bluetooth group first
It did, I talked a whole bunch about the new arch in the public workgroup meetings, even had diagrams and stuff. There was some amount of discussion on the PR too. No-one objected in the end.
This is quite sad. Seems the architecture discrepancy between new Host approach after fix and existing Mesh design wasn't taken into account on review.
even specification requires opposite behavior
What do you mean, the spec has a say on the implementation details?
Nop. I meant changing in behavior. Applications do not expect that device cannot deliver data because of lack of buffers in Host due to previous solution didn't assume it in general. All Zephyr Mesh users didn't add anything related to solve this instability. If somebody sent frame reliably (with acknowledgement on transport layer for example) they expect close to 100% delivery ratio. They do not expect skipping in Host.
stopped being reliable in Zephyr
We can argue it never was. As a badly-tuned config + alloc at the wrong time/place could end up deadlocking the whole device. This is what we're trying to move away from, this kind of "works most of the time" behavior.
I do not want to argue about reliability of approaches. I'm not against about any improvements. My main concern is that this improvement was stopped in a middle of way. The Host was improved, but it added architecture discrepancy between Host and Mesh. That caused significant unreliability in Mesh(cannot pass qualification anymore without buffers tuning for different use cases) and impacted on all Zephyr BLE Mesh users.
Now it requires significant redesign of some places Mesh code(a lot of them). I'd like to know plans about further improvements to remove architecture discrepancy before next Zephyr release (or when?).
If there are no sources to remove architecture discrepancy, could you add (under KConfig option) the old behavior to not break application behavior for Mesh users. It will be a workaround until Mesh will be aligned to new Host approach.
@jori-nordic, @alwa-nordic, @ubieda, @jhedberg any ideas?
Discussed IRL with @PavelVPV
Mesh is using the host advertiser module. They are not sending HCI commands on their own.
That means that even if we provide the async API ASAP (we won't, we should design it well), we basically have to refactor adv.c
AND id.c
to use that API.
They also have never seen the deadlocks due to HCI commands in production. I think the propensity to deadlock is still there, though.
The easy fix is to add an additional work thread to send from. Unfortunately, the mesh applications barely fit the nRF52832, which is still the main chip customers use. Adding the extra workqueue (with a big stack because the host likes big stacks and cannot lie) will overflow the RAM.
My proposition to pavel is then to:
K_FOREVER
to K_NO_WAIT
in net_buf_alloc
. Basically what @alxelax suggested. https://github.com/zephyrproject-rtos/zephyr/pull/77678hci_cmd_buf_destroyed
event in the host. So Mesh has a signal to when they should reschedule sending their data.They also have never seen the deadlocks due to HCI commands in production
Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.
It's due to the conn_auto_initiate()
function that's trying to allocate command buffers, at the same time as mesh is waiting for command buffers on the system workqueue.
So we're proposing a three-pronged fix instead:
@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.
-> > They also have never seen the deadlocks due to HCI commands in production
Famous last words lol. The PR instantly deadlocks, also observed by @PavelVPV on hardware.
Well, what I meant is it worked before these 2 PRs (and probably some other patches added after them (if there were some)). :)
@PavelVPV could you verify the mesh tests (HW + bsim) pass with all three patches combined? On my side, I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow.
I've finally managed to check patches and I think we need to apply only these 2:
With them our tests that that caught this issue pass (I mean on CI with real hardware plus PTS) like before with CONFIG_BT_BUF_CMD_TX_COUNT=2
.
I suggest to close https://github.com/zephyrproject-rtos/zephyr/pull/77678 as I came to a conclusion that it doesn't help us really, it doesn't revert the whole thing as the major changes were done in the second PR. It will create problems for customers (and for us eventually) if they catch the deadlock. And we will end up in a situation where we just have to increase CONFIG_BT_BUF_CMD_TX_COUNT
value. I was about to do that until I found the issue fixed in https://github.com/zephyrproject-rtos/zephyr/pull/77735. Mesh should be able to recover even without having a knowledge of available cmd buffers in host, so we in mesh should deeply investigate all failures before bumping up CONFIG_BT_BUF_CMD_TX_COUNT
value.
I haven't tested https://github.com/zephyrproject-rtos/zephyr/pull/77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using K_FOREVER
in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation. But I think this can happen potentiality. In this regard, as a part of the proper fix I'd recommend to revert https://github.com/zephyrproject-rtos/zephyr/pull/71697 because this IMHO fixes bluetooth specific issue by altering behavior of another subsystem. There can be many implementations without bluetooth which may try to allocate a net buf in syswq with K_FOREVER
for their own reasons and I'm not sure why they can't do this anymore. IMHO this should be fixed in the host instead.
IMHO this should be fixed in the host instead.
Yes, after thinking about it and talking with you and alex, I changed my mind. We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while? There is already a mechanism like that in net_bufs I think. Anyway, reverting that PR will have to be done when we properly fix the root cause(s) in the host like you said.
I'll try to backport to NCS and run the dragoon bluetooth testsuite tomorrow
I've verified that the two patches you mention do not break our downstream tests.
We should allow applications to shoot themselves in the foot. But maybe print a warning or something if the thread has been blocked for a while?
Agree, could this be a part of Thread Analyzer perhaps? It already gathers statistic about CPU usage, perhaps it could also gather statistic about how much a thread was pending on a object (semaphore or something else)?
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I haven't tested #77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using
K_FOREVER
in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation.
https://github.com/zephyrproject-rtos/zephyr/pull/80718
@PavelVPV Almost bbsim testcase failed due to attempt call K_FOREVER
in BT Rx.
I haven't tested #77710 because I'm a bit uncertain about it. I still have concern that bt rx thread can also be the reason for the deadlocks as host keeps using
K_FOREVER
in this thread and processes incoming data from the controller, though I have no idea if there is a code that can cause such situation.80718
@PavelVPV Almost bbsim testcase failed due to attempt call
K_FOREVER
in BT Rx.
I'd rather revert some changes and think thoroughly why deadlocks happen. Now I see we start patching code here and there including internal zephyr modules, people also unhappy with net_buf change (https://github.com/zephyrproject-rtos/zephyr/issues/80167).
people also unhappy with net_buf change (#80167).
I think the restrictions on sysworkq
are more severe than the side effects of net_buf, because sysworkq
is a system work queue that can be called by any subsys or application, but now due to the mandatory dependency of Bluetooth host on sysworkq
, sysworkq
cannot be blocked elsewhere, also include alloc net_buf.
people also unhappy with net_buf change (#80167).
I think the restrictions on
sysworkq
are more severe than the side effects of net_buf, becausesysworkq
is a system work queue that can be called by any subsys or application, but now due to the mandatory dependency of Bluetooth host onsysworkq
,sysworkq
cannot be blocked elsewhere, also include alloc net_buf.
It all depends on a use case. But what I see, instead of fixing this in the host (or at least trying to understand what exactly happens), we are patching the kernel code and 3rd party modules. The next step then is to change K_FOREVER
to K_NO_WAIT
in k_sem_take
and k_mutex_lock
functions.
Describe the bug
The changes introduced with this PR and additional PR to avoid deadlocks by overriding K_FOREVER with K_NO_WAIT in buffer allocation requested by syswq. Thus, this changes introduces a big behavior change for the layers that uses host and syswq. For MESH stack it causes to loose mesh messages send by using syswq.
Those changes propagates the failures caused by host and controller interaction mechanism to the upper layers and for the app side it does not make so much sense but confusion. That changes necessitates changes that should handles this particular corner case by every bt_host user.
The possible mitigations:
Alongside this bug in MESH, the behavioral change is not visible enough for bt_host users in documentation.