zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.86k stars 6.62k forks source link

subsystem: Bluetooth LLL: ASSERTION FAIL [!link->next] @ ZEPHYR_BASE/subsys/bluetooth/controller/ll_sw/ull_conn.c:1952 #45314

Closed kuikka closed 2 years ago

kuikka commented 2 years ago

Describe the bug BLE link layer asserts and causes a kernel Oops. (see console log below)

HW: nRF52840-DK SW: Zephyr v3.0.0 This unit is a BLE Central, connecting to a peripheral.

I have a strong suspicion that the ongoing SMP procedure (see smp_security_request and others) with concurrent link usage is triggering this.

In the log below GATT service discovery is started immediately upon connection establishment, if I remove this, the issues is bypassed.

Similarly, if the BLE peripheral starts for example connection interval update procedure upon connection, the same issue happens.

If I wait on either, the issue will likely not happen.

To Reproduce Connect to a peripheral that immediately upon connection initiates security procedure while central is using the BLE link for GATT discovery.

Expected behavior No assert is expected.

Impact Assert and oops are showstoppers.

Logs and console output

*** Booting Zephyr OS build   ***
[00:00:03.084,899] <inf> usb: usb_init
[00:00:03.085,174] <dbg> bt_settings: bt_settings_init:
[00:00:03.089,202] <inf> usb_cdc_acm: Device suspended
[00:00:03.091,308] <inf> fs_nvs: 8 Sectors of 4096 bytes
[00:00:03.091,308] <inf> fs_nvs: alloc wra: 0, e98
[00:00:03.091,339] <inf> fs_nvs: data wra: 0, 344
[00:00:03.093,231] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:03.093,261] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:03.093,292] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.0 Build 0
[00:00:03.093,688] <dbg> bt_smp: bt_smp_init: LE SC enabled
[00:00:03.093,841] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:03.093,872] <inf> ble: Bluetooth initialized
[00:00:03.094,177] <dbg> bt_settings: bt_settings_decode_key: Decoded c0147a35d1281 as C0:14:7A:35:D1:28 (random)
[00:00:03.094,818] <dbg> bt_settings: commit:
[00:00:03.095,245] <inf> bt_hci_core: Identity: F3:03:DE:E2:62:AF (random)
[00:00:03.095,306] <inf> bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x05f1
[00:00:03.095,336] <inf> bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff
[00:00:03.353,881] <dbg> bt_smp: bt_smp_pkey_ready:
[00:00:03.477,111] <inf> usb_cdc_acm: Device resumed
[00:00:03.477,111] <inf> usb_cdc_acm: from suspend
[00:00:03.531,616] <inf> usb_cdc_acm: Device configured
[00:00:04.094,604] <inf> ble: Scanning started successfully
[00:00:04.174,530] <inf> ble: Device found: C0:14:7A:35:D1:28 (random) (RSSI -22)
[00:00:04.211,090] <dbg> bt_smp: bt_smp_accept: conn 0x20001ac8 handle 0
[00:00:04.211,090] <dbg> bt_smp: bt_smp_connected: chan 0x20001e88 cid 0x0006
[00:00:04.211,273] <inf> ble: Connected: C0:14:7A:35:D1:28 (random)
[00:00:04.211,303] <inf> ble: interval=40, rx_phy=1, rx_phy=1, rx_data_len=27, tx_data_len=27
[00:00:04.211,547] <dbg> bt_settings: bt_settings_encode_key: Encoded path bt/ccc/c0147a35d1281
[00:00:04.212,463] <dbg> bt_smp: bt_smp_recv: Received SMP code 0x0b len 1
[00:00:04.212,463] <dbg> bt_smp: smp_security_request:
ASSERTION FAIL [!link->next] @ ZEPHYR_BASE/subsys/bluetooth/controller/ll_sw/ull_conn.c:1952
[00:00:04.313,079] <inf> gatt: Found 1 00002800-0000-1000-8000-00805f9b34fb 00001800-0000-1000-8000-00805f9b34fb
[00:00:04.313,476] <inf> gatt: Found 10 00002800-0000-1000-8000-00805f9b34fb 00001801-0000-1000-8000-00805f9b34fb
[00:00:04.313,842] <inf> gatt: Found 11 00002800-0000-1000-8000-00805f9b34fb 0000180f-0000-1000-8000-00805f9b34fb
[00:00:04.370,513] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000001
[00:00:04.370,544] <err> os: r3/a4:  0x000233e1 r12/ip:  0xa0000000 r14/lr:  0x0001ce4b
[00:00:04.370,544] <err> os:  xpsr:  0x4100001b
[00:00:04.370,544] <err> os: Faulting instruction address (r15/pc): 0x0001ce56
[00:00:04.370,574] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:04.370,605] <err> os: Fault during interrupt handling

[00:00:04.370,635] <err> os: Current thread: 0x200023c8 (idle 00)
[00:00:04.476,470] <err> os: Halting system

Environment (please complete the following information):

Additional context N/A

kuikka commented 2 years ago

Any advise on the root cause of said assert and what may trigger is would also be appreciated.

cvinayak commented 2 years ago

@kuikka Please use the thread analyzer to profile your application threads program stack runtime requirements, check if all threads are listed by the analyzer and have enough margins.

carlescufi commented 2 years ago

You can use something like this:

CONFIG_THREAD_NAME=y
CONFIG_THREAD_ANALYZER=y
CONFIG_THREAD_ANALYZER_AUTO=y
CONFIG_THREAD_ANALYZER_RUN_UNLOCKED=y
CONFIG_THREAD_ANALYZER_USE_PRINTK=y
CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5
carlescufi commented 2 years ago

Since the suspicion is of a stack overflow, downgrading to Low until we get more information from the reporter.

carlescufi commented 2 years ago

Closing this for inactivity, please reopen if it's still an issue

kuikka commented 2 years ago

I don't seem to have the permission to reopen the case so I created another one: https://github.com/zephyrproject-rtos/zephyr/issues/46235.