zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.52k stars 6.44k forks source link

Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data) #25917

Closed cvinayak closed 1 year ago

cvinayak commented 4 years ago

Describe the bug Performing GATT write commands from a peripheral and a central as fast as possible from the main loop while the peer central is performing security procedure to bond, causes the following assertion fail on both the peripheral and central device

In peripheral:

*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d  ***
Bluetooth initialized
[00:00:00.007,507] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,507] <inf> fs_nvs: alloc wra: 0, ff0
[00:00:00.007,507] <inf> fs_nvs: data wra: 0, 0
[00:00:00.008,941] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.009,216] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.031,829] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.031,860] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.031,860] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
Advertising successfully started
Connected: ea:15:23:d0:ac:03 (random)
MTU: 23
Exchange pending...Write: count= 0, len= 0, rate= 0 bps.
MTU exchange successful (247)
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
        k_sem_take failed with err -11
[00:00:27.460,693] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000001  r2/a3:  0x00000001
[00:00:27.460,693] <err> os: r3/a4:  0x00024c28 r12/ip:  0x00000000 r14/lr:  0x00007a01
[00:00:27.460,693] <err> os:  xpsr:  0x61000000
[00:00:27.460,723] <err> os: Faulting instruction address (r15/pc): 0x00007a0c
[00:00:27.460,723] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:27.460,723] <err> os: Current thread: 0x200014f0 (unknown)
[00:00:27.807,037] <err> os: Halting system

In central_hr:

*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d  ***
Bluetooth initialized
Scanning successfully started
[DEVICE]: 22:52:48:f8:a5:91 (random), AD evt type 3, AD data len 31, RSSI -45
[DEVICE]: 63:47:03:75:bc:8f (random), AD evt type 0, AD data len 29, RSSI -41
[AD]: 1 data_len 1
[AD]: 3 data_len 6
Connected: 63:47:03:75:bc:8f (random)
MTU: 23
Exchange pending...[00:00:00.006,042] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)MTU exchange successful (247)

[00:00:00.006,042] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.006,042] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.006,683] <inf> bt_hci_core: Identity: ea:15:23:d0:ac:03 (random)
[00:00:00.006,683] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.006,683] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
[00:00:00.688,507] <err> bt_conn: Unable to allocate TX context
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
        k_sem_take failed with err -11
[00:00:11.095,764] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000001  r2/a3:  0x00000001
[00:00:11.095,764] <err> os: r3/a4:  0x0002142a r12/ip:  0x00000000 r14/lr:  0x0000666d
[00:00:11.095,764] <err> os:  xpsr:  0x61000000
[00:00:11.095,764] <err> os: Faulting instruction address (r15/pc): 0x00006678
[00:00:11.095,794] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:11.095,794] <err> os: Current thread: 0x2000117c (unknown)
[00:00:11.702,239] <err> os: Halting system

To Reproduce Use the branch: https://github.com/cvinayak/zephyr/commit/a2289565363a210e33b8956eac46df55e19a790a

Steps to reproduce the behavior: Build and flash peripheral

  1. mkdir -p build/peripheral; cd build/peripheral
  2. cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/peripheral
  3. ninja
  4. ninja flash

Build and flash central_hr

  1. mkdir -p build/central_hr; cd build/central_hr

  2. cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/central_hr

  3. ninja

  4. ninja flash

  5. Open two terminals, minicom -D /dev/ttyACMx

  6. Observe both peripheral and central_hr fail after connection

Expected behavior peripheral and central_hr should connect, transfer write commands, perform SMP pairing, be encrypted and continue to transder write commands.

Impact showstopper

Environment (please complete the following information):

joerchan commented 4 years ago

Anylsis: Symmetrical deadlock in both devices in the connection between TX and RX thread. TX thread has filled up the all the buffers in the controller and has blocked waiting for num complete events. Controller has filled up all it's receive buffers and the RX thread is delivering ACL packets as fast as it can. The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command. The TX thread will not be able to proceed until a num complete arrives on both sides. The RX thread will not be able to proceed until TX thread is unblocked on both sides. Receive buffer is full on both sides so no new packets can be transmitted, meaning no new completes will happen.

galak commented 3 years ago

is this fixed?

joerchan commented 3 years ago

@galak No.

xudongzheng commented 3 years ago

I ran into this issue and my workaround was to do the sending from a workqueue thread with a lower priority.

joerchan commented 3 years ago

@xudongzheng Could you elaborate on your workaround? What exactly did you move to the workqueue? Sending of data, or sending of the command?

xudongzheng commented 3 years ago

@joerchan

I was occasionally getting the "Unable to allocate TX context" error when calling bt_gatt_notify() from a work running on the system workqueue. However moving my bt_gatt_notify() call to a work running on my custom workqueue (with a lower priority) did not give the same "Unable to allocate TX context" error.

I haven't looked into the details of this specific bug but it's possible that https://github.com/cvinayak/zephyr/blob/a2289565363a210e33b8956eac46df55e19a790a/samples/bluetooth/peripheral/src/main.c#L476 or one of the other calls is running into the same issue.

joerchan commented 3 years ago

@xudongzheng That behavior is documented here: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/bluetooth/gatt.h#L927L930

That is however not the same as is happening in this case, you haven't encountered this issue.

jefffhaynes commented 3 years ago

I'm seeing the same thing when I spam bt_gatt_notify. This seems like a dangerous bug since there's no way to really know if you'll exceed the threshold for overflow. Much better would be to simply return an error that could optionally be ignored for non-critical notifications. @galak can you comment on why you feel this is low priority?

LingaoM commented 3 years ago

The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command.

I can't understand this, can you point out the relevant code?

LingaoM commented 3 years ago

@joerchan Why we hci_le_read_max_data_len not use bt_hci_cmd_send to avoiding race condition ?

langestefan commented 3 years ago

I'm seeing the same thing when I spam bt_gatt_notify. This seems like a dangerous bug since there's no way to really know if you'll exceed the threshold for overflow. Much better would be to simply return an error that could optionally be ignored for non-critical notifications. @galak can you comment on why you feel this is low priority?

I also see this. Whenever I see Unable to allocate TX context the nRF52 I I control over HCI completely freezes. My bt_gatt_notify is now in a workqueue but if I decrease the notification speed to anything below 500ms the system still deadlocks. It's totally bizarre.

prathje commented 3 years ago

@cvinayak and I could generate the same assertion error using the sample in #37577

joerchan commented 3 years ago

@joerchan Why we hci_le_read_max_data_len not use bt_hci_cmd_send to avoiding race condition ?

@LingaoM Because the problem would still be there for all the other places where we use bt_hci_cmd_send_sync. Also you can't simply replace it, because you want to get the output of the command complete, which you wouldn't if you don't use send_sync.

prathje commented 2 years ago

So I think I could isolate causes of the mentioned assertion fail (at least in my case). First, I am running BabbleSim and probably need to enable flow control manually:

CONFIG_BT_HCI_ACL_FLOW_CONTROL=y

Second (and more important), I was sending whole SDU segments over an L2CAP channel which I assumed would not be fragmented as I was using the channel's MTU as the maximum payload size. As it turns out, the packets did in fact not fit into the ACL buffers and were thus also getting fragmented again. With a restricted amount of buffers available for fragmentation (see https://docs.zephyrproject.org/latest/reference/kconfig/CONFIG_BT_L2CAP_TX_FRAG_COUNT.html) this seems to have favored the sem_take timeout.

I am currently using the following config:

CONFIG_BT_L2CAP_TX_MTU=247
CONFIG_BT_BUF_ACL_RX_SIZE=256
CONFIG_BT_BUF_ACL_TX_SIZE=251
CONFIG_BT_HCI_ACL_FLOW_CONTROL=y

And changed my mtu calculation to actually incorporate the BT_L2CAP_SDU_TX_MTU:

uint32_t mtu = MIN(le_chan.tx.mtu, BT_L2CAP_SDU_TX_MTU);

Another reason might be mentioned in https://github.com/zephyrproject-rtos/zephyr/issues/34600: if many transmissions are scheduled rapidly, the number of TX buffers could get exhausted (have not yet confirmed this).

CONFIG_BT_L2CAP_TX_BUF_COUNT=128
carlescufi commented 2 years ago

Related Nordic Devzone thread: https://devzone.nordicsemi.com/f/nordic-q-a/89888/ncs-nordic-uart-service-bt-rx-hci-timeout

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

klapro commented 1 year ago

Is it still worked on ? Its still having the same issue.. Im on Zephyr 3.2.0

jori-nordic commented 1 year ago

@klapro this is supposed to be fixed. Could you open a new bug report with all the relevant details (how to reproduce etc) ? Maybe it's something different.