Closed cvinayak closed 4 years ago
Console log with reproduced with no callback assigned:
*** Booting Zephyr OS build v2.2.0-rc3-69-gd8697ba32e2b ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,324] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,324] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,758] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,758] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: Firmware: Standard Bluetooth controller App must call settings_load()
[00:00:00.011,199] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,329] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,329] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,329] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xfff
D1
D0
D3
D2
D4
D5
conn_tx_alloc timeout.
D6
D6
D99
Disconnected (reason 0x08)
Write cmd failed (-57).
And the btmon log: btmon_20200310_nocb.log
It is probably due to this line:
return k_fifo_get(&free_tx, K_FOREVER);
But I see it has been modified to follow the supervision timeout so it timeout properly, does that solves the problem or it still deadlocks after that?
@Vudentz yes, its the K_FOREVER. My modifications where to identify where the deadlock was. Using a timeout releases the deadlock, but not sure if thats a solution or workaround. Using a timeout means the disconnect complete is delayed by that timeout, hence use of timeout is only a workaround.
We may actually have to introduce a timeout bt_conn_send_cb or somehow detect what is the supervision timeout, @jhedberg thoughts? But I wonder if during that we receive a disconnect from the remote side does the RX thread is unblocked? We may have to treat disconnect event differently in order to handle this properly.
We may have to treat disconnect event differently in order to handle this properly.
One proposal from @joerchan was to use bt_recv_prio()
for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).
We may have to treat disconnect event differently in order to handle this properly.
One proposal from @joerchan was to use
bt_recv_prio()
for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).
Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.
Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.
One idea that comes to mind is to split the disconnect processing into two parts: one that's safe to do in bt_recv_prio()
(e.g. marking the TX path defunct, so that you can't send anymore but the connection can still receive packets), and another that's done in the normal RX thread (until which point you can process RX packets but not send anything, and here we finally set the disconnected state on the bt_conn object).
Btw, my suggestion above is not so different from how other bt_recv_prio()
handling is done: the all the existing handlers do is to release some resource (e.g. give a semaphore) which releases some other thread that does the "finalisation" or second stage of the event handling.
@Vudentz @cvinayak @joerchan @jhedberg This is a high bug so we need to fix it during the 2.3 stabilization period. Is there something we can do here, either as a patch or a mitigation, during the next few days?
I can look into it.
https://github.com/cvinayak/zephyr/commit/0b01ebdd559c009416458c583f122647d98d2bd1
Above branch will make it easy to reproduce (its a branch with the patch in the issue description applied).
Steps to reproduce (timing related):
Connected: <bdaddr>
, reset the central_hr board to cause linkloss (linkloss needs to occur during service discovery).Reproducible in latest master too (62b985409e): https://github.com/cvinayak/zephyr/commit/55153bed706b662018e5a07e20a7e26d4f6ed463
Describe the bug Performing GATT write commands from a peripheral as fast as possible from the main loop while the peer central is performing GATT service discovery and a supervision timeout puts the system in a deadlock.
Preliminary analysis shows the below callpath leading to the deadlock:
The deadlock is caused by use of all
free_tx
by the below callpath:To Reproduce Steps to reproduce the behavior:
On a linux PC use BlueZ and hci_usb (nrf52840_pca10056) as the controller. (The issue should be reproducible using internal controller too, not tried though).
bluetoothctl scan on
<ctrl-c>
bluetoothctrl connect <bdaddr of the peripheral sample>
Expected behavior Peripheral sample should gracefully disconnect with reason as supervision timeout and resume to connectable advertisements.
Impact showstopper
Screenshots or console output Below log is for the modified host by adding timeout in
conn_tx_alloc
Environment (please complete the following information):
Additional context Add any other context about the problem here.
Patch file: 0001-Peripheral-with-Write-Commands.patch.txt
btmon log: btmon_recv_thread_stall.txt