HCI_UART: Hang on high rate of notifications

KlemenDEV commented 1 year ago

Describe the bug

I am trying to use HCI_UART (https://docs.zephyrproject.org/latest/samples/bluetooth/hci_uart/README.html) for the controller on nRF52833 DK (https://www.nordicsemi.com/Products/Development-hardware/nrf52833-dk).

For the host, I am using the STM32L476 chip with BTStack (https://github.com/bluekitchen/btstack).

I have contacted the author of the project for some guidelines and successfully got it working as per https://github.com/bluekitchen/btstack/issues/475. It all seems to work right, I am able to use nRF52833 as both host and peripheral.

To Reproduce

I have observed the following scenarios:

(nRF52833 as peripheral) If nRF52833 is set to send a notification to characteristic X upon receiving notification on characteristic Y, it stops operating if the rate of such notifications is high (eg. 50Hz+). Notifications are sent from python based BLE client and my laptop.
(nRF52833 as client) If nRF52833 is made to send notifications to a Python-based client that acts as described in the above scenario, the nRF52833 hangs again when the rate of sending is high.

Sending from the Python BLE client to the Python BLE server works, so either nRF52833 as peripheral or as a client needs to be involved for this to happen, ruling out any issues with my test BLE Python scripts.

Reducing the rate lowers the chance of Zephyr/nRF52833 hanging, seems at around 10-20Hz the system operates reliably.

Expected behavior A 50Hz notification rate should be possible and if not, the HCI transport layer should handle this. I use mechanisms provided by the BTStack library to check on the host if sending data (notification) is possible or not and do not send if the library communicates the system is busy.

I have ruled out this being a bug of BTStack as the reset of the host does not help. A reset of the Zephyr controller is needed for the setup to start working again.

Impact Using the Zephyr with HCI_UART and high notification rates does not seem possible

Logs and console output

I have enabled RTT logs/console and I can see INF data from the HCI_UART main.c being printed, but when the system hangs, no further logs are printed and no errors are printed either. I am not that familiar with debugging Zephyr or nRF5 MCUs to be able to see if core halts or anything of such nature happens.

Environment (please complete the following information):

OS: Windows
Toolchain: Zephyr SDK west
Zephyr commit hash: 3e6bf0359bc65764fe830f6e6fa07b3a02996f37

More context

Some issues I have found that may be related: https://github.com/zephyrproject-rtos/zephyr/issues/30378, but I don't get any errors so this may be a complete miss.

github-actions[bot] commented 1 year ago

Hi @KlemenDEV! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

KlemenDEV commented 1 year ago

As suggested by @mringwal, here are PKLG files for scenario 1, where nRF52833 runs as peripheral (nRF52833 controller with STM32 with BTStack for the host) and BLE Python client that sends data to peripheral with high data rate and reads responses on another characteristic.

HCI dump at the time the nRF52833 hangs (complete dump from the setup, connection establishment, and data sending): hci dump at time of crash.zip

HCI dump after resetting host (Zephyr seems to completely hang at this point): later attempt once zephyr hangs.zip

I have zipped PKLG files as GitHub does not support PKLG files.

KlemenDEV commented 1 year ago

So I have enabled

CONFIG_BT_DEBUG_LOG=y
CONFIG_BT_HCI_CORE_LOG_LEVEL_DBG=n
CONFIG_BT_HCI_DRIVER_LOG_LEVEL_DBG=n
CONFIG_BT_LOG_LEVEL_DBG=y

and also added another thread that just prints a word every second to main.c to test if the core hangs and here are the findings:

the core/kernel does not hang as the ping thread still prints data
last data logged by the BLE debug is this:

00> [00:49:56.671,325] <dbg> bt_ctlr_hci_driver: recv_thread: Packet in: type:3 len:21
00> [00:49:56.671,325] <dbg> bt_hci_raw: bt_recv: buf 0x20007f2c len 21
00> [00:49:56.671,356] <dbg> bt_ctlr_hci_driver: recv_thread: blocking
00> [00:49:56.680,236] <dbg> bt_hci_raw: bt_send: buf 0x20006fe0 len 21
00> [00:49:56.680,267] <dbg> bt_ctlr_hci_driver: hci_driver_send: enter
00> [00:49:56.680,297] <dbg> bt_ctlr_hci_driver: hci_driver_send: exit: 0
00> [00:49:56.701,049] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.701,080] <dbg> bt_ctlr_hci_driver: prio_recv_thread: Num Complete: 0x0000:1
00> [00:49:56.701,110] <dbg> bt_hci_raw: bt_recv: buf 0x20007f2c len 7
00> [00:49:56.701,141] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.701,171] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.701,171] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.701,202] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.701,263] <dbg> bt_ctlr_hci_driver: recv_thread: unblocked
00> [00:49:56.701,293] <dbg> bt_ctlr_hci_driver: recv_thread: Packet in: type:3 len:21
00> [00:49:56.701,324] <dbg> bt_hci_raw: bt_recv: buf 0x20007f4c len 21
00> [00:49:56.701,354] <dbg> bt_ctlr_hci_driver: recv_thread: blocking
00> [00:49:56.706,939] <dbg> bt_hci_raw: bt_send: buf 0x20006fe0 len 21
00> [00:49:56.706,970] <dbg> bt_ctlr_hci_driver: hci_driver_send: enter
00> [00:49:56.707,000] <inf> bt_ctlr_ull_conn: ll_tx_mem_enqueue: incoming Tx: count= 30, len= 498, rate= 4073 bps.
00> [00:49:56.707,000] <dbg> bt_ctlr_hci_driver: hci_driver_send: exit: 0
00> [00:49:56.731,658] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.731,658] <dbg> bt_ctlr_hci_driver: prio_recv_thread: Num Complete: 0x0000:1
00> [00:49:56.731,689] <dbg> bt_hci_raw: bt_recv: buf 0x20007f2c len 7
00> [00:49:56.731,719] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.731,750] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.731,781] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.731,781] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.731,903] <dbg> bt_ctlr_hci_driver: recv_thread: unblocked
00> [00:49:56.731,933] <dbg> bt_ctlr_hci_driver: recv_thread: Packet in: type:3 len:21
00> [00:49:56.731,933] <dbg> bt_hci_raw: bt_recv: buf 0x20007f6c len 21
00> [00:49:56.731,964] <dbg> bt_ctlr_hci_driver: recv_thread: blocking
00> [00:49:56.761,047] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.761,077] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.761,108] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.761,138] <dbg> bt_ctlr_hci_driver: recv_thread: unblocked
00> [00:49:56.791,625] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.791,656] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.791,687] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.821,502] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.821,533] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.821,533] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.851,043] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.851,074] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.851,104] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.881,042] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.881,072] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.881,103] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.911,041] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.911,071] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.911,071] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:49:56.956,054] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken
00> [00:49:56.956,054] <dbg> bt_ctlr_hci_driver: prio_recv_thread: RX node enqueue
00> [00:49:56.956,085] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem take...
00> [00:50:22.441,284] <dbg> bt_ctlr_hci_driver: prio_recv_thread: sem taken

00> [00:50:22.441,284] bt_ctlr_hci_driver: prio_recv_thread: sem taken

is the last log related to BT and sending any new HCI packets from host to controller does not result in any logs, so some BLE related thread seems to be hanging

KlemenDEV commented 1 year ago

Some more findings: I have added a counter to the hci_uart/src/main.c and print counter value from another thread and the while(1) loop gets stuck after the hang, probably waiting on net_buf_get

KlemenDEV commented 1 year ago

It also seems the nRF52833 is still detected by BLE scanners, so advertisements still run (connecting is ofc not possible) even after the hang, but no further HCI communication seems to be possible.

Resetting the host (STM32) makes nRF52833 no longer advertise, and nRF52833 does no longer respond to HCI commands at that point and seems fully stuck (even my custom thread no longer runs)

KlemenDEV commented 1 year ago

I have found out this was a race condition in my code running HCI methods from different threads, probably resulting in wrong HCI commands being sent.

This does not change the fact that it seems one can send data via HCI that makes Zephyr irreversibly hang, not sure if this is ok or not so I will leave this issue open for now.

mringwal commented 1 year ago

@KlemenDEV BTstack is not thread-safe. It's recommended to have all Bluetooth logic on a dedicated thread and use a thread-safe IPC queue to trigger Bluetooth actions etc. - please contact us directly or ask on the BTstack mailing list for additional details).

Anyway, if the Zephyr HCI firmware still hangs, please post updated pklg / Zephyr debug output files to facilitate analysis.

KlemenDEV commented 1 year ago

BTstack is not thread-safe. It's recommended to have all Bluetooth logic on a dedicated thread and use a thread-safe IPC queue to trigger Bluetooth actions etc.

I was aware of this, but I had a bug/oversight in my proof-of-concept code where I still did access from another thread.

The pklg files from above are from the time Zephyr hangs for the Zephyr developers in case hang should not happen with invalid data.

If it is normal for the controller to hang if there is invalid data according to BT spec, then this is probably not a problem at all.

carlescufi commented 1 year ago

This does not change the fact that it seems one can send data via HCI that makes Zephyr irreversibly hang, not sure if this is ok or not so I will leave this issue open for now.

Thanks for getting back to us. Could you please create a new issue with, if possible, a simple way to reproduce the issue. However do bear in mind that HCI H4, the protocol used by the hci_uart sample, is not resilient, so if you send garbage data to it you may indeed hang it forever and that's expected.

zephyrproject-rtos / zephyr

HCI_UART: Hang on high rate of notifications #58236