zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

Closed cvinayak closed 4 years ago

cvinayak commented 4 years ago

Describe the bug Performing GATT write commands from a peripheral as fast as possible from the main loop while the peer central is performing GATT service discovery and a supervision timeout puts the system in a deadlock.

Preliminary analysis shows the below callpath leading to the deadlock:

recv_thread
- bt_recv
-- hci_acl
--- bt_conn_recv
---- bt_l2cap_recv
----- l2cap_chan_recv
------ chan->ops->recv
------- bt_att_recv
-------- handler->func
--------- att_read_req
---------- bt_att_create_pdu
----------- bt_l2cap_send_cb
------------ bt_conn_send_cb
------------- conn_tx_alloc

The deadlock is caused by use of all free_tx by the below callpath:

bt_gatt_write_without_response_cb
- bt_att_create_pdu
- bt_att_send (with cb)
-- att_send
--- bt_l2cap_send_cb
---- bt_conn_send_cb
----- conn_tx_alloc

To Reproduce Steps to reproduce the behavior:

  1. Apply the attached patch to the upstream/master repo
  2. mkdir build; cd build
  3. cmake -GNinja -DBOARD=nrf52_pca10040 ../samples/bluetooth/peripheral
  4. ninja
  5. ninja flash

On a linux PC use BlueZ and hci_usb (nrf52840_pca10056) as the controller. (The issue should be reproducible using internal controller too, not tried though).

  1. bluetoothctl scan on
  2. <ctrl-c>
  3. bluetoothctrl connect <bdaddr of the peripheral sample>
  4. Take the peripheral device away to cause supervision timeout
  5. Try few times (issue reproducible in 1 in 5 supervision timeout tries)

Expected behavior Peripheral sample should gracefully disconnect with reason as supervision timeout and resume to connectable advertisements.

Impact showstopper

Screenshots or console output Below log is for the modified host by adding timeout in conn_tx_alloc

*** Booting Zephyr OS build v2.2.0-rc3-68-gbf8caf395646  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,354] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00Write: count= 0, len= 0, rate= 0 bps.
:00.007,354] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,789] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,819] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.2 Build 0
[00:00:00.009,094] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.011,230] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,359] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,359] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,359] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xffff
Write: count= 58, len= 13256, rate= 107932 bps.

D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.
Write cmd failed (-55).
conn_tx_alloc timeout.

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

Environment (please complete the following information):

Additional context Add any other context about the problem here.

Patch file: 0001-Peripheral-with-Write-Commands.patch.txt

btmon log: btmon_recv_thread_stall.txt

cvinayak commented 4 years ago

Console log with reproduced with no callback assigned:

*** Booting Zephyr OS build v2.2.0-rc3-69-gd8697ba32e2b  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,324] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,324] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,758] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,758] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: Firmware: Standard Bluetooth controller  App must call settings_load()
[00:00:00.011,199] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,329] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,329] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,329] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xfff
D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.

D6

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

And the btmon log: btmon_20200310_nocb.log

Vudentz commented 4 years ago

It is probably due to this line:

return k_fifo_get(&free_tx, K_FOREVER);

But I see it has been modified to follow the supervision timeout so it timeout properly, does that solves the problem or it still deadlocks after that?

cvinayak commented 4 years ago

@Vudentz yes, its the K_FOREVER. My modifications where to identify where the deadlock was. Using a timeout releases the deadlock, but not sure if thats a solution or workaround. Using a timeout means the disconnect complete is delayed by that timeout, hence use of timeout is only a workaround.

Vudentz commented 4 years ago

We may actually have to introduce a timeout bt_conn_send_cb or somehow detect what is the supervision timeout, @jhedberg thoughts? But I wonder if during that we receive a disconnect from the remote side does the RX thread is unblocked? We may have to treat disconnect event differently in order to handle this properly.

jhedberg commented 4 years ago

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

Vudentz commented 4 years ago

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

jhedberg commented 4 years ago

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

One idea that comes to mind is to split the disconnect processing into two parts: one that's safe to do in bt_recv_prio() (e.g. marking the TX path defunct, so that you can't send anymore but the connection can still receive packets), and another that's done in the normal RX thread (until which point you can process RX packets but not send anything, and here we finally set the disconnected state on the bt_conn object).

jhedberg commented 4 years ago

Btw, my suggestion above is not so different from how other bt_recv_prio() handling is done: the all the existing handlers do is to release some resource (e.g. give a semaphore) which releases some other thread that does the "finalisation" or second stage of the event handling.

carlescufi commented 4 years ago

@Vudentz @cvinayak @joerchan @jhedberg This is a high bug so we need to fix it during the 2.3 stabilization period. Is there something we can do here, either as a patch or a mitigation, during the next few days?

joerchan commented 4 years ago

I can look into it.

cvinayak commented 4 years ago

https://github.com/cvinayak/zephyr/commit/0b01ebdd559c009416458c583f122647d98d2bd1

Above branch will make it easy to reproduce (its a branch with the patch in the issue description applied).

Steps to reproduce (timing related):

  1. build and flash the peripheral sample from the branch.
  2. build and flash a vanilla central_hr sample in another board.
  3. When central_hr connects and peripheral sample prints out Connected: <bdaddr>, reset the central_hr board to cause linkloss (linkloss needs to occur during service discovery).
  4. peripheral sample will get stuck in conn_tx_alloc function in host/conn.c

Reproducible in latest master too (62b985409e): https://github.com/cvinayak/zephyr/commit/55153bed706b662018e5a07e20a7e26d4f6ed463