Closed MariuszSkamra closed 5 years ago
So we been blocking on system wq since btgatt* do block when BT_ATT_ENFORCE_FLOW is set, we could perhaps make att.c queue if the semaphore cannot be taken and we don't really need to save the callback as the semaphore is only used when it is set to NULL, then we resume when the previous packet completes.
Actually that still likely to block if we hold on the buffer into a queue it will likely block on bt_conn_create_pdu waiting for buffer to become available, so this one maybe more complicated to deal with, perhaps we will have to have the bt_att_req for notification as well so we can queue them up without requiring a buffer though the bt_att_req is something Id like to make internal to the stack so we can queue things up on its own without requiring the client to manage the pool of request.
I'm trying to use smp_svr on zephyr 802718a773 with nrf52840_pca10056 and have problems I'm told are related to this issue.
When I run sudo mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' echo hello
, I eventually see Error: can't discover services: ATT request timeout: req timeout
.
Here is a btmon trace: https://gist.github.com/mbolivar/d0ea16d55ffaf9d05482eea36175a967
The problem persists even if I add # CONFIG_BT_ATT_ENFORCE_FLOW is not set
to my prj.conf, as suggested by @carlescufi. Enabling bluetooth debugging with CONFIG_BT_DEBUG_LOG=y
doesn't produce much difference either:
***** Booting Zephyr OS zephyr-v1.14.0-1473-g6e3083fe310f *****
Bluetooth initialized
Advertising successfully started
[00:00:00.008,850] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,850] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,880] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 99
[00:00:00.010,742] <inf> bt_hci_core: Identity: e4:ce:3c:20:bb:b2 (random)
[00:00:00.010,772] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.010,772] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
uart:~$ Connected
Disconnected (reason 19)
Advertising successfully started
Edit: note that the reason the boot banner says a different zephyr sha is because i have some unrelated west changes on top to fix the west sign
command, which I'm using to produce the signed mcuboot binary:
https://github.com/mbolivar/zephyr/tree/west-sign-fix-improve
They don't touch any BT code, and if they weren't producing a signed binary correctly, Zephyr never would have booted.
@mbolivar could you try reverting 8a7615f4a3d8932481277dffd41a4a72febd0a61 as well?
@mbolivar could you try reverting 8a7615f as well?
You may want to revert 2699d05e45f58897edd7d6d3718c254224390667 as well.
@mbolivar could you try reverting 8a7615f as well? You may want to revert 2699d05 as well.
If the issue is flow enforcement related I'd say leave commit 2699d05 as-is and revert 8a7615f
That way you're making TX callbacks happen sooner while keeping the RX callbacks later, which is essential for avoiding the race condition where we get new incoming data before we've been notified that some related piece of outgoing data has been transmitted over the air.
Btw, should there be a k_yield after this line:
https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/mgmt/smp.c#L277
could you try reverting 8a7615f as well?
same timeout result.
You may want to revert 2699d05 as well.
Same result as well (with both reverted).
Edit: I should note the following diff is applied to smp_svr/prj.conf in both cases:
+
+# CONFIG_BT_ATT_ENFORCE_FLOW is not set
+CONFIG_BT_DEBUG_LOG=y
+CONFIG_BT_DEBUG_ATT=y
could you try reverting 8a7615f as well?
same timeout result.
You may want to revert 2699d05 as well.
Same result as well (with both reverted).
I guess it is a different problem then, do you actually get any traffic? Or perhaps there is something with the user_data that has been changed, thought it had been increased from 7 to 8 bytes.
I guess it is a different problem then, do you actually get any traffic? Or perhaps there is something with the user_data that has been changed, thought it had been increased from 7 to 8 bytes.
I assume so; the device does say that it's connected and disconnected. Does the btmon trace I posted earlier help answer the question?
Here is another one for 8a7615f reverted, with the same diff on smp_svr/prj.conf: https://gist.github.com/mbolivar/c5d36baf581fc47a6d51257164adfe78
The device logs:
***** Booting Zephyr OS zephyr-v1.14.0-1467-g8cb8b2553045 *****
Bluetooth initialized
Advertising successfully started
[00:00:00.008,880] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,880] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,911] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 99
[00:00:00.010,803] <inf> bt_hci_core: Identity: e4:ce:3c:20:bb:b2 (random)
[00:00:00.010,803] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.010,803] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
uart:~$ Connected
[00:00:57.810,913] <dbg> bt_att.bt_att_accept: conn 0x20000630 handle 0
[00:00:57.810,913] <dbg> bt_att.bt_att_connected: chan 0x2000077c cid 0x0004
[00:00:57.818,695] <dbg> bt_att.bt_att_recv: Received ATT code 0x02 len 2
[00:00:57.818,695] <dbg> bt_att.att_mtu_req: Client MTU 512
[00:00:57.818,725] <dbg> bt_att.att_mtu_req: Server MTU 252
[00:00:57.818,725] <dbg> bt_att.att_mtu_req: Negotiated MTU 252
[00:00:57.833,099] <dbg> bt_att.att_rsp_sent: conn 0x20000630 att 0x2000077c
[00:00:57.833,221] <dbg> bt_att.bt_att_recv: Received ATT code 0x10 len 6
[00:00:57.833,282] <dbg> bt_att.att_read_group_req: start_handle 0x0001 end_handle 0xffff type 2800
[00:00:57.833,312] <dbg> bt_att.read_group_cb: handle 0x0001
[00:00:57.833,343] <dbg> bt_att.read_group_cb: handle 0x0009
[00:00:57.833,404] <dbg> bt_att.read_group_cb: handle 0x0010
[00:00:57.833,709] <err> log: argument 3 in log message "%s: start_handle 0x%04x end_handle 0x%04x type %s" missing log_strdup().
uart:~$ Disconnected (reason 19)
Advertising successfully started
[00:01:27.938,323] <dbg> bt_att.bt_att_disconnected: chan 0x2000077c cid 0x0004
It's a bit unclear from your logs, @mbolivar do I understand correctly that just reverting 8a7615f didn't help?
As far as I see there are two issues:
The reason I suggested reverting 8a7615f is that it should improve on both issues: with both RX and TX callbacks being on sys wq means you can't currently send more data from the callback without risking a deadlock. Reverting the patch also makes the TX callbacks happen sooner, since there's one less deferral to another thread, so the race condition should be less likely.
As for the log_strdup warning, I've opened a PR that should hopefully fix it: #16819
It's a bit unclear from your logs, @mbolivar do I understand correctly that just reverting 8a7615f didn't help?
Yes, that is correct.
The btmon trace in https://github.com/zephyrproject-rtos/zephyr/issues/16803#issuecomment-501808742 is with just 8a7615f reverted (edit: and the other diffs on prj.conf posted earlier disabling flow control and turning on debug logs). The timeout result is the same. The device logs in that comment are from the same test run.
@mbolivar ok, then I agree with @Vudentz that further studying is needed since this sounds like something else than we've originally been looking at.
Should I open a separate bug?
@mbolivar try with https://github.com/zephyrproject-rtos/zephyr/pull/16820 maybe it help with you issue as well.
@mbolivar try with #16820 maybe it help with you issue as well.
@Vudentz I tried with c01c68ab48 (and a clean tree, no prj.conf changes) and got similar results.
Mesh pb-gatt sends several GATT notifications in a row. It runs in system workqueue thread context, so that it is possible to cause deadlock after recent changes that deferred
bt_conn_tx
callback to system workqueue.While sending several notifications in a row,
bt_att_send
takestx_sem
that is later given by
bt_conn_tx
.As a result running
bt_att_send
andbt_conn_tx
from the same thread causes deadlock in ATT.This regression was introduced with: 362497235df575586a049dd92216910c99e79094