zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.34k stars 6.33k forks source link

Bluetooth: Non functional on nRF51 with fault #74345

Open thedjnK opened 2 months ago

thedjnK commented 2 months ago

Describe the bug It seems that bluetooth is non functional on nRF51 devices, running the peripheral_hr sample application gives the following output:

*** Booting Zephyr OS build v3.6.0-6522-g90ecdf0dab8c ***
[00:00:00.316,192] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.316,345] <inf> bt_hci_core: HW Variant: nRF51x (0x0001)
[00:00:00.316,467] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.6 Build 99
[00:00:00.320,556] <inf> bt_hci_core: Identity: C1:58:CF:2B:AB:12 (random)
[00:00:00.320,739] <inf> bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1
[00:00:00.320,922] <inf> bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff
Bluetooth initialized
Starting Legacy Advertising (connectable and scannable)
Advertising successfully started
Connected
[00:00:04.741,607] <err> os: ***** HARD FAULT *****
[00:00:04.741,699] <err> os: r0/a1:  0x02000400  r1/a2:  0x20001acc  r2/a3:  0x000100b9
[00:00:04.741,790] <err> os: r3/a4:  0x200025b7 r12/ip:  0xff774201 r14/lr:  0x00013429
[00:00:04.741,821] <err> os:  xpsr:  0x01000011
[00:00:04.741,882] <err> os: Faulting instruction address (r15/pc): 0x000100ba
[00:00:04.742,034] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:04.742,095] <err> os: Fault during interrupt handling

[00:00:04.742,248] <err> os: Current thread: 0x20001010 (unknown)
[00:00:04.815,399] <err> os: Halting system

This address is zephyr/subsys/bluetooth/controller/ll_sw/ull_llcp_common.c:936. I have also tried this in a custom application which has more logic and that faulted with zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:583 (discriminator 3) instead, that one is shown here:

E: ***** HARD FAULT *****
E: ARCH_EXCEPT with reason 3

E: r0/a1:  0x00000003  r1/a2:  0x0203001b  r2/a3:  0x40001000
E: r3/a4:  0x00000003 r12/ip:  0x000000a5 r14/lr:  0x0001eae9
E:  xpsr:  0x01000011
E: Faulting instruction address (r15/pc): 0x0001eaf8
E: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
E: Fault during interrupt handling

E: Current thread: 0x200010a8 (unknown)
E: Halting system

To Reproduce Run peripheral_hr with default config on an nrf51-based board

Expected behavior Bluetooth to work

Impact Showstopper

Environment (please complete the following information):

thedjnK commented 2 months ago

Have had a look back and seemingly issue has been present since zephyr 3.2, in 3.1 it seems stable but in 3.2 it starts faulting

thedjnK commented 2 months ago

e1c2c36f569a8e06184a2c5574f1d25ba00df8d7 is the offending commit

cvinayak commented 2 months ago

Tried the peripheral_hr sample on a nrf51dongle/nrf51822, I do not see an issue:

[2024-06-16 06:42:00.006] Booting Zephyr OS build v3.6.0-6486-gae61a77781fc [2024-06-16 06:42:00.013] [00:00:00.007,873] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) [2024-06-16 06:42:00.025] [00:00:00.007,995] bt_hci_core: HW Variant: nRF51x (0x0001) [2024-06-16 06:42:00.034] [00:00:00.008,148] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.6 Build 99 [2024-06-16 06:42:02.702] [00:00:00.012,084] bt_hci_core: Identity: D1:54:BC:A1:F7:9B (random) [2024-06-16 06:42:02.711] [00:00:00.012,298] bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1 [2024-06-16 06:42:02.724] [00:00:00.012,481] bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff [2024-06-16 06:42:02.733] Bluetooth initialized [2024-06-16 06:42:02.735] Starting Legacy Advertising (connectable and scannable) [2024-06-16 06:42:02.741] Advertising successfully started [2024-06-16 06:42:02.745] Checking LED device...done. [2024-06-16 06:42:02.748] Configuring GPIO pin...done. [2024-06-16 06:42:02.751] Start blinking LED... [2024-06-16 06:42:08.035] Connected [2024-06-16 06:42:08.036] Stop blinking LED. [2024-06-16 06:42:26.764] [00:00:25.776,153] hrs: HRS notifications enabled [2024-06-16 06:42:26.771] HRS notification status changed: enabled [2024-06-16 06:42:40.915] [00:00:39.926,025] bas: BAS Notifications enabled [2024-06-16 06:42:50.816] [00:00:49.827,148] bas: BAS Notifications disabled [2024-06-16 06:42:50.822] [00:00:49.827,392] hrs: HRS notifications disabled [2024-06-16 06:42:50.829] HRS notification status changed: disabled [2024-06-16 06:42:50.835] Disconnected (reason 0x13) [2024-06-16 06:42:50.838] Starting Legacy Advertising (connectable and scannable) [2024-06-16 06:42:50.843] Start blinking LED...

  1. What toolchain are you using? I am on upstream zephyr 3b85e8b8ea2 and using the Zephyr SDK: zephyr-sdk-0.16.5-
  2. Which peer central device did you use to connect from? I used LightBlue v5.0.6, iPhone 13 Pro, iOS 17.5.1
thedjnK commented 2 months ago

Try connecting and disconnecting a few times, from trying many commits trying to find the issue it can happen instantly or it can be on the 2nd or 3rd connection. Current commit: 66630961627f4d8625e2dbe1cb7afe4abd690b2e but it happens on every commit after and including e1c2c36f569a8e06184a2c5574f1d25ba00df8d7. Was using sdk-ng 0.16.5-1 where I saw issue, updated to 0.16.8 and still see issue. Remote device is nRF connect 2.4.9, iPhone SE 2020, iOS 13.4.1 (same configuration for connecting to other zephyr devices e.g. nrf52 and nrf53 without issue)

thedjnK commented 2 months ago

Retried with LED support enabled:

*** Booting Zephyr OS build v3.6.0-6528-g66630961627f ***
[00:00:00.317,077] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.317,230] <inf> bt_hci_core: HW Variant: nRF51x (0x0001)
[00:00:00.317,352] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.7 Build 0
[00:00:00.321,441] <inf> bt_hci_core: Identity: C1:58:CF:2B:AB:12 (random)
[00:00:00.321,624] <inf> bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1
[00:00:00.321,807] <inf> bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff
Bluetooth initialized
Starting Legacy Advertising (connectable and scannable)
Advertising successfully started
Checking LED device...done.
Configuring GPIO pin...done.
Start blinking LED...
Connected
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull_llcp_common.c:707
[00:00:22.656,494] <err> os: ***** HARD FAULT *****
[00:00:22.656,524] <err> os: ARCH_EXCEPT with reason 3

[00:00:22.656,616] <err> os: r0/a1:  0x00000003  r1/a2:  0x0000007c  r2/a3:  0x00000000
[00:00:22.656,707] <err> os: r3/a4:  0x00000003 r12/ip:  0x8100efeb r14/lr:  0x0001810b
[00:00:22.656,738] <err> os:  xpsr:  0x01000011
[00:00:22.656,799] <err> os: Faulting instruction address (r15/pc): 0x000100b4
[00:00:22.656,951] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:22.657,012] <err> os: Fault during interrupt handling

[00:00:22.657,165] <err> os: Current thread: 0x20001040 (unknown)
[00:00:22.747,955] <err> os: Halting system

That line is:

static void lp_comm_send_req(struct ll_conn *conn, struct proc_ctx *ctx, uint8_t evt, void *param)
{
        switch (ctx->proc) {
...
        default:
                /* Unknown procedure */
                LL_ASSERT(0); <----
        }
cvinayak commented 2 months ago

Is this reproducible on nRF development boards or is this custom board?

thedjnK commented 2 months ago

Is this reproducible on nRF development boards or is this custom board?

This is a third party nRF51-based module (rm1xx_dvk) with XLR3, all that's different in the config for this application is the UART mapping, I was able to fault it also by building for nrf51dk and flashing that instead (albeit no UART output). Only nordic nRF51 dev board I have is from 2014 and pre-XLR3 silicon which I assume will not work at all

cvinayak commented 2 months ago

Can’t think of anything other than to increase call stack for Bluetooth rx thread. Not sure which kconfig, stack has been undergoing changes lately.

thedjnK commented 2 months ago

Have tried increasing stack sizes, no difference. Though changing enabled drivers changes where it happens, having a look at it it's generally on lines setting a variable. Are the structs all properly setup for cortex M0 i.e. 4-byte boundaries or packed, since it doesn't support non-aligned memory access?

cvinayak commented 2 months ago

Are the structs all properly setup for cortex M0 i.e. 4-byte boundaries or packed, since it doesn't support non-aligned memory access?

Your referenced line accesses 8-bit value: https://github.com/zephyrproject-rtos/zephyr/blob/90ecdf0dab8c5d63f4c5aff15a11802e25a0c57f/subsys/bluetooth/controller/ll_sw/ull_llcp_common.c#L936-L936

will not cause a hardfault due to memory alignment.

~I have nRF51422 QFACA1 on the dongle year 2015 week 13. I am not able to reproduce the reported issue.~

strike that... I got the hardfault, 66630961627:

[2024-06-16 21:29:31.167] Booting Zephyr OS build v3.7.0-rc1 [2024-06-16 21:29:31.173] [00:00:00.356,414] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) [2024-06-16 21:29:31.184] [00:00:00.356,536] bt_hci_core: HW Variant: nRF51x (0x0001) [2024-06-16 21:29:31.194] [00:00:00.356,689] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.7 Build 0 [2024-06-16 21:29:33.864] [00:00:00.360,687] bt_hci_core: Identity: D1:54:BC:A1:F7:9B (random) [2024-06-16 21:29:33.873] [00:00:00.360,870] bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1 [2024-06-16 21:29:33.885] [00:00:00.361,083] bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff [2024-06-16 21:29:33.894] Bluetooth initialized [2024-06-16 21:29:33.896] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:29:33.903] Advertising successfully started [2024-06-16 21:29:33.907] Checking LED device...done. [2024-06-16 21:29:33.910] Configuring GPIO pin...done. [2024-06-16 21:29:33.913] Start blinking LED... [2024-06-16 21:29:37.509] Connected [2024-06-16 21:29:37.510] Stop blinking LED. [2024-06-16 21:29:40.829] Disconnected (reason 0x13) [2024-06-16 21:29:40.832] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:29:40.838] Start blinking LED... [2024-06-16 21:29:48.220] Connected [2024-06-16 21:29:48.221] Stop blinking LED. [2024-06-16 21:29:51.689] Disconnected (reason 0x13) [2024-06-16 21:29:51.692] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:29:51.698] Start blinking LED... [2024-06-16 21:29:55.245] Connected [2024-06-16 21:29:55.248] Stop blinking LED. [2024-06-16 21:29:57.839] Disconnected (reason 0x13) [2024-06-16 21:29:57.842] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:29:57.848] Start blinking LED... [2024-06-16 21:30:00.317] Connected [2024-06-16 21:30:00.319] ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull_llcp_common.c:707 [2024-06-16 21:30:00.329] [00:00:29.486,816] os: HARD FAULT [2024-06-16 21:30:00.336] [00:00:29.486,846] os: ARCH_EXCEPT with reason 3 [2024-06-16 21:30:00.343] [2024-06-16 21:30:00.343] [00:00:29.486,968] os: r0/a1: 0x00000003 r1/a2: 0x000000dc r2/a3: 0x00000000 [2024-06-16 21:30:00.354] [00:00:29.487,030] os: r3/a4: 0x00000003 r12/ip: 0x14817eed r14/lr: 0x00017ffb [2024-06-16 21:30:00.365] [00:00:29.487,091] os: xpsr: 0x01000011 [2024-06-16 21:30:00.372] [00:00:29.487,121] os: Faulting instruction address (r15/pc): 0x000100b4 [2024-06-16 21:30:00.382] [00:00:29.487,304] os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0 [2024-06-16 21:30:00.391] [00:00:29.487,335] os: Fault during interrupt handling [2024-06-16 21:30:00.399] [2024-06-16 21:30:00.400] [00:00:29.487,518] os: Current thread: 0x20001040 (unknown) [2024-06-16 21:30:00.408] [00:00:29.578,155] os: Halting system

investigating...

cvinayak commented 2 months ago

hmmm...Does your board have external 32KHz crystal? if not, are you using CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y ?

My dongle has a broken 32KHz external crystal, and my hardfault reproduced was when I did not have CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y ...

Otherwise, may be I am not hitting a condition that would hardfault:

[2024-06-16 21:36:56.826] Booting Zephyr OS build v3.7.0-rc1 [2024-06-16 21:36:56.832] [00:00:00.007,904] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) [2024-06-16 21:36:56.843] [00:00:00.008,056] bt_hci_core: HW Variant: nRF51x (0x0001) [2024-06-16 21:36:56.852] [00:00:00.008,209] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.7 Build 0 [2024-06-16 21:36:59.519] [00:00:00.012,207] bt_hci_core: Identity: D1:54:BC:A1:F7:9B (random) [2024-06-16 21:36:59.528] [00:00:00.012,420] bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1 [2024-06-16 21:36:59.540] [00:00:00.012,603] bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff [2024-06-16 21:36:59.550] Bluetooth initialized [2024-06-16 21:36:59.552] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:36:59.558] Advertising successfully started [2024-06-16 21:36:59.562] Checking LED device...done. [2024-06-16 21:36:59.565] Configuring GPIO pin...done. [2024-06-16 21:36:59.568] Start blinking LED... [2024-06-16 21:37:07.058] Connected [2024-06-16 21:37:07.059] Stop blinking LED. [2024-06-16 21:37:09.864] Disconnected (reason 0x13) [2024-06-16 21:37:09.867] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:09.873] Start blinking LED... [2024-06-16 21:37:12.935] Connected [2024-06-16 21:37:12.936] Stop blinking LED. [2024-06-16 21:37:15.085] Disconnected (reason 0x13) [2024-06-16 21:37:15.088] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:15.094] Start blinking LED... [2024-06-16 21:37:18.047] Connected [2024-06-16 21:37:18.049] Stop blinking LED. [2024-06-16 21:37:20.724] Disconnected (reason 0x13) [2024-06-16 21:37:20.727] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:20.733] Start blinking LED... [2024-06-16 21:37:23.821] Connected [2024-06-16 21:37:23.822] Stop blinking LED. [2024-06-16 21:38:15.803] Disconnected (reason 0x13) [2024-06-16 21:38:15.806] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:15.813] Start blinking LED... [2024-06-16 21:38:19.322] Connected [2024-06-16 21:38:19.323] Stop blinking LED. [2024-06-16 21:38:22.283] Disconnected (reason 0x13) [2024-06-16 21:38:22.286] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:22.293] Start blinking LED... [2024-06-16 21:38:31.707] Connected [2024-06-16 21:38:31.708] Stop blinking LED. [2024-06-16 21:38:38.913] Disconnected (reason 0x13) [2024-06-16 21:38:38.916] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:38.922] Start blinking LED... [2024-06-16 21:38:42.105] Connected [2024-06-16 21:38:42.106] Stop blinking LED. [2024-06-16 21:38:46.703] Disconnected (reason 0x13) [2024-06-16 21:38:46.706] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:46.712] Start blinking LED... [2024-06-16 21:38:55.989] Connected [2024-06-16 21:38:55.990] Stop blinking LED. [2024-06-16 21:39:00.383] Disconnected (reason 0x13) [2024-06-16 21:39:00.386] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:39:00.392] Start blinking LED... [2024-06-16 21:39:13.445] Connected [2024-06-16 21:39:13.447] Stop blinking LED. [2024-06-16 21:39:17.392] Disconnected (reason 0x13) [2024-06-16 21:39:17.395] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:39:17.401] Start blinking LED...

thedjnK commented 2 months ago

hmmm...Does your board have external 32KHz crystal? if not, are you using CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y ?

My dongle has a broken 32KHz external crystal, and my hardfault reproduced was when I did not have CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y ...

Otherwise, may be I am not hitting a condition that would hardfault:

[2024-06-16 21:36:56.826] Booting Zephyr OS build v3.7.0-rc1 [2024-06-16 21:36:56.832] [00:00:00.007,904] bt_hci_core: HW Platform: Nordic Semiconductor (0x0002) [2024-06-16 21:36:56.843] [00:00:00.008,056] bt_hci_core: HW Variant: nRF51x (0x0001) [2024-06-16 21:36:56.852] [00:00:00.008,209] bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.7 Build 0 [2024-06-16 21:36:59.519] [00:00:00.012,207] bt_hci_core: Identity: D1:54:BC:A1:F7:9B (random) [2024-06-16 21:36:59.528] [00:00:00.012,420] bt_hci_core: HCI: version 5.4 (0x0d) revision 0x0000, manufacturer 0x05f1 [2024-06-16 21:36:59.540] [00:00:00.012,603] bt_hci_core: LMP: version 5.4 (0x0d) subver 0xffff [2024-06-16 21:36:59.550] Bluetooth initialized [2024-06-16 21:36:59.552] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:36:59.558] Advertising successfully started [2024-06-16 21:36:59.562] Checking LED device...done. [2024-06-16 21:36:59.565] Configuring GPIO pin...done. [2024-06-16 21:36:59.568] Start blinking LED... [2024-06-16 21:37:07.058] Connected [2024-06-16 21:37:07.059] Stop blinking LED. [2024-06-16 21:37:09.864] Disconnected (reason 0x13) [2024-06-16 21:37:09.867] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:09.873] Start blinking LED... [2024-06-16 21:37:12.935] Connected [2024-06-16 21:37:12.936] Stop blinking LED. [2024-06-16 21:37:15.085] Disconnected (reason 0x13) [2024-06-16 21:37:15.088] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:15.094] Start blinking LED... [2024-06-16 21:37:18.047] Connected [2024-06-16 21:37:18.049] Stop blinking LED. [2024-06-16 21:37:20.724] Disconnected (reason 0x13) [2024-06-16 21:37:20.727] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:37:20.733] Start blinking LED... [2024-06-16 21:37:23.821] Connected [2024-06-16 21:37:23.822] Stop blinking LED. [2024-06-16 21:38:15.803] Disconnected (reason 0x13) [2024-06-16 21:38:15.806] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:15.813] Start blinking LED... [2024-06-16 21:38:19.322] Connected [2024-06-16 21:38:19.323] Stop blinking LED. [2024-06-16 21:38:22.283] Disconnected (reason 0x13) [2024-06-16 21:38:22.286] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:22.293] Start blinking LED... [2024-06-16 21:38:31.707] Connected [2024-06-16 21:38:31.708] Stop blinking LED. [2024-06-16 21:38:38.913] Disconnected (reason 0x13) [2024-06-16 21:38:38.916] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:38.922] Start blinking LED... [2024-06-16 21:38:42.105] Connected [2024-06-16 21:38:42.106] Stop blinking LED. [2024-06-16 21:38:46.703] Disconnected (reason 0x13) [2024-06-16 21:38:46.706] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:38:46.712] Start blinking LED... [2024-06-16 21:38:55.989] Connected [2024-06-16 21:38:55.990] Stop blinking LED. [2024-06-16 21:39:00.383] Disconnected (reason 0x13) [2024-06-16 21:39:00.386] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:39:00.392] Start blinking LED... [2024-06-16 21:39:13.445] Connected [2024-06-16 21:39:13.447] Stop blinking LED. [2024-06-16 21:39:17.392] Disconnected (reason 0x13) [2024-06-16 21:39:17.395] Starting Legacy Advertising (connectable and scannable) [2024-06-16 21:39:17.401] Start blinking LED...

It has the 32KHz crystal yes, and that option is selected in Kconfig. Have tried falling back to RC and I still get the fault

cvinayak commented 2 months ago

FYI... i have been able to reproduce all your reported occurrence of hardfaults/assertions in variety of compilations I was doing... investigating them...

cvinayak commented 2 months ago

sniffer log with assertion feature_ex_collision_assertion_20240617.zip

@erbr-ot @thoh-ot is there any reason for an assertion when feature exchange crossover in the same connection event?

cvinayak commented 2 months ago

v3.7.0-rc1 is definitely broken for CM0 for thread swapping (i may be wrong though). I made two branch of changes to make the Controller more robust, and I see that changes over v3.6.0 is stable and has no weird callstack... below is weird callstack at hardfault when based on v3.7.0-rc1:

image

two branches: v3.6.0 based: https://github.com/cvinayak/zephyr/tree/github_llcp_lock_data_v3_6_0_base (https://github.com/cvinayak/zephyr/commit/6a085876957ac336346c9dce93f13a4c897a5cff) v3.7.0-rc1 based: https://github.com/cvinayak/zephyr/tree/github_llcp_remote_fault_fix (https://github.com/cvinayak/zephyr/commit/5f166aa42a98bc88611f2dc312eab4142ef0ef71)

thedjnK commented 2 months ago

3.7 introduced changes to the code by arm, which happened to break a lot of things in the first implementation because it was not done right but the second iteration supposedly fixed them (changing ASM to C), trying to remember who it was...

thedjnK commented 2 months ago

https://github.com/zephyrproject-rtos/zephyr/pull/65071 here it is, try reverting these and see if it improves

thedjnK commented 2 months ago

v3.7.0-rc1 is definitely broken for CM0 for thread swapping (i may be wrong though).

ping @ithinuel

ithinuel commented 2 months ago

I ran Zephyr’s tests on a yet-to-upstream target mps2/an383 configured with 8 MPU regions, SYSTick & VTOR using Arm FVP. The only failures I have to report/investigate are around llext, So I’m fairly confident things work as expected on Cortex-M0/+ 👍 (appart for llext).

I may have a bbc:microbit somewhere, I’ll try to find it and reproduce the failure reported here.

cvinayak commented 2 months ago

@ithinuel am I seeing right below? later has more instructions and possible compiler instructions re-ordering, compare to former v2.6.0 based code?

v2.6.0 based: image v3.7.0-rc1 based: image

I am also seeing store instructions interleaved, like here: image

Are these intentional?

ithinuel commented 2 months ago

I cannot comment on what was there back in v2.6, I don’t know what was the state of Zephyr back then. It seems like the z_pend_curr function is doing a few more things than before, so a slightly longer function is to be expected I guess. Order and interleaving of instructions is up to the compiler’s optimisations.

cvinayak commented 2 months ago

z_pend_curr is just an example, but I believe that not much has changed in c code for this function between v3.6 and v3.7 as evident in the screenshot of list file i pasted earlier.

Also, could you (add) explanation like before for why b/bx has been replaced with bl/blx with c implementation of isr_wrapper? image

Also please take a look at this, cost of maintainability versus efficient implementation that @stephanosio has already stated here https://github.com/zephyrproject-rtos/zephyr/pull/65071#pullrequestreview-2020028109

image

Every added instruction (branches) cost current consumption on a, say CR2032 (~200mAh), battery! I would think it affected all varient of ARM cortex SoC using the new C implementations.

Assigning the issue to @ithinuel to analyze the issue at hand and also provide CPU usage comparisons. (we do not have automated test to catch such regressions, but do have some manual way like observing the in-system profiling like CONFIG_BT_CTLR_PROFILE_ISR in Bluetooth Controller. (again, i do not run them manually often, hence, would not notice issues until there is dramatic change leading to system faults/assertion failures).

FYI @carlescufi

ithinuel commented 2 months ago

The reason for using bx/blx instead of b/bl can be found in https://github.com/zephyrproject-rtos/zephyr/pull/11189, and see https://github.com/zephyrproject-rtos/zephyr/pull/11189#issuecomment-437285901 for a more specific reason as to why applying this to more occurences. We should not sacrifice correctness for speed and using C removes that burden from us and let the compiler/linker choose the best instruction/memory mapping to accomodate the constraints.

In your last screenshots, we can see that the jump to z_arm_int_exit is infered as a direct branch rather than the indirect one using an extra ldr. There must be a good reason for the compiler not to have infered a tail-call optimisation there, but I have not the bandwidth to investigate this choice (even if I’m curious about it).

The fact that z_arm_int_exit is longer is because it does more things (see https://github.com/zephyrproject-rtos/zephyr/issues/61761).

Regardless, I fail to see how this is related to this issue.

cvinayak commented 2 months ago

Regardless, I fail to see how this is related to this issue.

Bactrace observed during debugging lead to inspection of the swap implementation. I do not rule out bugs in other part of the sample/application written in C, which already is very difficult to follow when backtrace skip function call path like in the case observed in this issue.

Can the debugger be fooled into thinking that it is in exception when returned from the thread mode function?

We should not sacrifice correctness for speed and using C removes that burden from us

Correct efficient code for the CPU, I hardly read a stable implementation.

The reason for using bx/blx instead of b/bl can be found in #11189, and see #11189 (comment) for a more specific reason as to why applying this to more occurences.

Are you sure you are correctly justify my question? "why b/bx has been replaced with bl/blx" .

(i will be on vacation, but the issue can easily be reproduced on nRF51 SoC based boards anyone willing to investigate).

ithinuel commented 2 months ago

Can the debugger be fooled into thinking that it is in exception when returned from the thread mode function?

Debugger can only be mildly trusted as the system can be corrupted in very confusing ways for the debugger. Backtrace particularly are dependent on several factors such as (but not limited to):

Are you sure you are correctly justify my question? "why b/bx has been replaced with bl/blx" .

The replacement I know about is b/bl with bx/blx in ASM files and inline ASM. the change from b/bx to bl/blx in compiled C code might be due to inlining, tailcall opt etc. I am not aware of such systematic substitution in ASM code.

cvinayak commented 2 months ago

Bingo! Sorry for all the noise... seems I may be onto something... self induced ISR latency, could be due to more CPU use in exception entry/exit?

[2024-06-19 12:54:36.684] [00:03:17.297,241] bt_ctlr_hci: l: 9, 8, 33; t: 40, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.695] [00:03:17.298,004] bt_ctlr_hci: l: 13, 8, 33; t: 19, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.708] [00:03:17.298,736] bt_ctlr_hci: l: 13, 8, 33; t: 20, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.719] [00:03:17.399,261] bt_ctlr_hci: l: 9, 8, 33; t: 40, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.730] [00:03:17.399,902] bt_ctlr_hci: l: 9, 8, 33; t: 40, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.740] [00:03:17.400,634] bt_ctlr_hci: l: 13, 8, 33; t: 19, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.932] [00:03:17.500,701] bt_ctlr_hci: l: 13, 8, 33; t: 20, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.942] [00:03:17.501,312] bt_ctlr_hci: l: 9, 8, 33; t: 40, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.953] [00:03:17.502,044] bt_ctlr_hci: l: 9, 8, 33; t: 40, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.964] [00:03:17.606,353] bt_ctlr_hci: l: 13, 8, 33; t: 19, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.974] [00:03:17.606,903] bt_ctlr_hci: l: 12, 8, 33; t: 20, 19, 60; cpu: 110, 0, 245, 0. [2024-06-19 12:54:36.985] Connected [2024-06-19 12:54:36.986] ASSERTION FAIL [!radio_is_ready()] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:569 [2024-06-19 12:54:36.999] [00:03:17.631,866] os: HARD FAULT [2024-06-19 12:54:37.006] [00:03:17.631,896] os: ARCH_EXCEPT with reason 3 [2024-06-19 12:54:37.012] [2024-06-19 12:54:37.013] [00:03:17.631,988] os: r0/a1: 0x00000003 r1/a2: 0x000000f6 r2/a3: 0x00000000 [2024-06-19 12:54:37.024] [00:03:17.632,110] os: r3/a4: 0x00000003 r12/ip: 0x00000000 r14/lr: 0x000188cb [2024-06-19 12:54:37.035] [00:03:17.632,141] os: xpsr: 0x01000011 [2024-06-19 12:54:37.042] [00:03:17.632,171] os: Faulting instruction address (r15/pc): 0x00012e14 [2024-06-19 12:54:37.052] [00:03:17.632,354] os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0 [2024-06-19 12:54:37.061] [00:03:17.632,385] os: Fault during interrupt handling [2024-06-19 12:54:37.069] [2024-06-19 12:54:37.069] [00:03:17.632,537] os: Current thread: 0x200012a0 (unknown) [2024-06-19 12:54:37.078] [00:03:17.779,144] os: Halting system

Enabling CONFIG_BT_CTLR_PROFILE_ISR may be indicating to me that Radio IRQ after transmitting a PDU if has a latency higher than 150 us to vector in the Tx Radio ISR, the Radio DMA could be receiving a packet at the previous setup empty PDU buffer...

which looks like this in my map file:

fill 0x000000002000250f 0x1 .bss._pkt_empty 0x0000000020002510 0x3 zephyr/subsys/bluetooth/controller/ll_sw/nordic/libsubsysbluetoothcontrollerll_sw__nordic.a(radio.c.obj) fill 0x0000000020002513 0x1 .bss.isr_cb_param 0x0000000020002514 0x4 zephyr/subsys/bluetooth/controller/ll_sw/nordic/libsubsysbluetoothcontrollerll_swnordic.a(radio.c.obj) .bss.isr_cb 0x0000000020002518 0x4 zephyr/subsys/bluetooth/controller/ll_sw/nordic/libsubsysbluetoothcontrollerll_swnordic.a(radio.c.obj) .bss.link.0 0x000000002000251c 0x8 zephyr/subsys/bluetooth/controller/ll_sw/nordic/libsubsysbluetoothcontroller__ll_swnordic.a(ticker.c.obj)

isr_cb_param and isr_cb is being corrupted?.... will investigate further.

cvinayak commented 1 month ago

@thedjnK Could you give a try with https://github.com/zephyrproject-rtos/zephyr/pull/74916, this should avoid buffer overrun under high ISR latencies and hopefully let the Controller be able to assert when Radio has started tx/rx without radio ISR setting up new radio packet pointer.

Thinking aloud here, Controller implementation uses CPU in the priority value of 0 (highest) in nRF51 for all scheduling, control procedure and radio packet management (LLL_PRIO = ULL_HIGH_PRIO = ULL_LOW_PRIO for nRF51). This means, as a peripheral after reception of a PDU in a connection event, when transmitting back empty PDU (smallest) then next Radio ISR (tx completed) arrives in 230 us (150 tIFS + 80 us for empty PDU). In-system profiling (only valid inside radio event), indicates the ULL_HIGH CPU use has increased from ~104 us to ~215 us when we switched to redesigned LLCP in the controller, this is tight for nRF51 between tx/rx radio ISR execution.

snippet from testing on nRF51:

LLCP: [00:01:26.369,049] bt_ctlr_hci: l: 13, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.369,812] bt_ctlr_hci: l: 12, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.471,649] bt_ctlr_hci: l: 13, 8, 38; t: 19, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.472,381] bt_ctlr_hci: l: 9, 8, 38; t: 40, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.472,747] bt_ctlr_hci: l: 9, 8, 38; t: 40, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.473,571] bt_ctlr_hci: l: 13, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.577,850] bt_ctlr_hci: l: 13, 8, 38; t: 19, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.578,491] bt_ctlr_hci: l: 9, 8, 38; t: 40, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.579,193] bt_ctlr_hci: l: 9, 8, 38; t: 40, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.685,119] bt_ctlr_hci: l: 13, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.685,852] bt_ctlr_hci: l: 12, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.686,187] bt_ctlr_hci: l: 9, 8, 38; t: 41, 19, 60; cpu: 110, 0, 215, 0. [00:01:26.687,042] bt_ctlr_hci: l: 12, 8, 38; t: 20, 19, 60; cpu: 110, 0, 215, 0. LLCP_LEGACY: [00:00:10.574,554] bt_ctlr_hci: l: 9, 8, 23; t: 51, 19, 60; cpu: 110, 0, 104, 0. [00:00:10.874,542] bt_ctlr_hci: l: 9, 8, 23; t: 45, 19, 60; cpu: 110, 0, 104, 0. [00:00:10.904,693] bt_ctlr_hci: l: 9, 8, 23; t: 51, 19, 60; cpu: 110, 0, 104, 0. [00:00:11.534,393] bt_ctlr_hci: l: 9, 8, 23; t: 56, 19, 60; cpu: 110, 0, 104, 0. [00:00:11.535,766] bt_ctlr_hci: l: 9, 8, 23; t: 60, 19, 60; cpu: 110, 0, 104, 0. [00:00:11.564,849] bt_ctlr_hci: l: 9, 8, 23; t: 57, 19, 60; cpu: 110, 0, 104, 0. [00:00:11.594,696] bt_ctlr_hci: l: 9, 8, 23; t: 51, 19, 60; cpu: 110, 0, 104, 0. [00:00:11.894,714] bt_ctlr_hci: l: 10, 8, 23; t: 50, 19, 60; cpu: 110, 0, 104, 0. [00:00:14.414,367] bt_ctlr_hci: l: 9, 8, 23; t: 56, 19, 60; cpu: 110, 0, 104, 0. [00:00:14.474,548] bt_ctlr_hci: l: 9, 8, 23; t: 51, 19, 60; cpu: 110, 0, 104, 0. [00:00:17.964,569] bt_ctlr_hci: l: 9, 8, 23; t: 45, 19, 60; cpu: 110, 0, 104, 0. [00:00:18.014,709] bt_ctlr_hci: l: 9, 8, 23; t: 51, 19, 60; cpu: 110, 0, 104, 0.

cvinayak commented 1 month ago

Found a regression related to nRF51 encrypted connections, fix added as https://github.com/zephyrproject-rtos/zephyr/pull/74916/commits/7787fc7f7daee570005e23202d06c90c2701eab1

@thoh-ot or @erbr-ot is there in anyway possible to split up the implementation of ull_cp_rx() function?, on a CM0 16MHz nRF51 it is taking ~215 us compared to ~104 us used by legacy LLCP. In nRF51 due to limited interrupt levels, LLL and ULL need to be co-operative in the time available between Radio IRQs, and ideally would like the CPU use by LLCP to be limited to ~100 us.

erbr-ot commented 1 month ago

@cvinayak I'd say that it would be quite non-trivial to split up ull_cp_rx() Would be interesting to profile to understand more - for instance are there specific scenarios that push it past 200us?! Perhaps that would indicate a possible path for optim/re-factor

cvinayak commented 1 month ago

for instance are there specific scenarios that push it past 200us?!

@erbr-ot when there is both local and remote initiated control procedures active then the CPU use is the highest.

Its takes 230 us (150 + 80) between Rx PDU end to Tx Empty PDU end IRQs, there is then additional 150 us by which the radio would need to have its packet ptr assigned for the following reception of new PDU. That means, the Tx complete ISR can tolerate its execution completion by 380 us since the reception of first PDU that it will process.

Possibly running local and remote state machines mutually exclusive could reduce the CPU usage that would yield for the Radio ISR.

aescolar commented 1 month ago

@cvinayak, thanks for your work so far on this. Is there any update? Please note that rc3/hard freeze is planned for this Friday.

cvinayak commented 1 month ago

nRF51 cannot be supported upstream with the current CPU usages in the Controller's LLCP implementation.

Possibilities:

  1. Instead of currently used one IRQ priority by the Controller for nRF51, switch to using two IRQ priority levels; this will need that default IRQ priority value for nRF SoC peripherals be lowered to 2 (from current value of 1). This needs applications to migrate, and be restricted to only 2 of max. 4 IRQ priority levels in the cortex-m0 implementations in nRF51 SoCs.
  2. Drop support for boards using nRF51 (shame, nRF51 was truly resource constrained; low RAM, low flash, slow CPU speed, low on h/w IRQ priority levels and lack of unaligned memory access).

A fork with legacy LLCP implementation with lower CPU usage can be found here for those willing to support it for nRF51: https://github.com/zephyrproject-rtos/zephyr/pull/75197

aescolar commented 1 month ago

Brief of a voice talk with @cvinayak : Fixing this for the LTS does not seem possible (see just above https://github.com/zephyrproject-rtos/zephyr/issues/74345#issuecomment-2213575679 ). Therefore he suggests that for this 3.7LTS release we mention in the release notes that this is a known issue (nRF51 BT LE stack not working). Later an attempt can be done for the proposed workaround but it may not be possible to make it work in a sane manner. If that is the case, the issue would be just accepted and support for the nRF51 BT stack dropped in Zephyr.

Note that Nordic has recommended not making new products with the nRF51 for some years now, and this feature works fine in the previous LTS.