zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.48k stars 6.41k forks source link

Crash in ull.c when stressing periodic advertising sync (scanner side) #40317

Closed jakkra closed 2 years ago

jakkra commented 2 years ago

Describe the bug Having 10 periodic advertisers sending periodic advertisements at a "faster" rate triggers various asserts often:

ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1195

Edit: I saw another crash using this app: ASSERTION FAIL [e] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_sync.c:892

To Reproduce Steps to reproduce the behavior:

  1. Make sure to cherry-pick this commit (unless it's been merged) 06d7e47e58580b17e1d04ed94f0525209752cae8 (pull) otherwise the application will hang every now and then (you can still get the crash both with and without that, just easier with as it won't hang the test app)
  2. Build the attached code (pretty much samples/bluetooth/direction_finding_connectionless_tx) and samples/bluetooth/direction_finding_connectionless_rx. _Note that for some reason CONFIG_BT_CTLR_DF_PER_SCAN_CTE_NUM_MAX=1 in prj.conf does not stick, so when you compile you will get bss overflow. You need to manually go in and change to =1 in the generated .config in the build folder and build again. Not sure if it's a bug or something weird in the prj.conf, maybe you can check if that's a bug._
  3. Flash one device with rx code and 10+ with tx code (less may work, but will take longer).
  4. Wait for the receiver code to crash.

Expected behavior No crashes

Impact Moderate, it is related with periodic advertising only.

Logs and console output

Environment (please complete the following information):

Additional context Code to reproduce: reproduce.zip

ppryga-nordic commented 2 years ago

@jakkra since you have got running setup, can you add printk("rx type: %d\n", rx->type) before LL_ASSERT in ull.c:1195 and provide us output? That would lead to rx node type that is failing.

jakkra commented 2 years ago

Sure here it is.

(76257) Found device sending per adv: 2C:6B:CB:FF:EB:F7 (random)
(76262) Creating Periodic Advertising Sync...
success.
Sync index: 7
0       1       2       3       4       5       6       7       8       9
40      38      39      39      39      39      0       0       -       -
Free: 20, num_used_iq_bufs: 0, max_1s_used_iq_bufs: 3, max_used_total: 3
------------------------------------------------------------------------------------
Enable receiving of CTE...
---->rx type: 124
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1196

I cannot find 124 in the node_rx_type. Ran a few more times also got rx type: 156 and rx type: 252 so something seems corrupt Capture_crash .

cvinayak commented 2 years ago

@jakkra please check your testing alongwith changes in this PR: https://github.com/zephyrproject-rtos/zephyr/pull/40308

jakkra commented 2 years ago

Hi @cvinayak still getting at least this crash with that PR ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1195

cvinayak commented 2 years ago

@jakkra to understand, is the crash reproducible without direction finding? i.e. does stressing using periodic_adv and periodic_sync samples too lead to this assertion? (sorry, have not been able to allocate time to reproduce using your samples due to limited device availability, hope investigate in the coming days).

jakkra commented 2 years ago

It is with direction finding, I haven't not tried without df. My guess its reproduceable without df.

cvinayak commented 2 years ago

@jakkra I have simulated 10 periodic advertisers synchronized without DF and do not see any crash. Your sample and the simulation log is present in this draft PR: https://github.com/zephyrproject-rtos/zephyr/pull/40655

@ppryga Memory corruption in periodic synchronization is ruled out, please reproduce the same in DF scenario, I suspect probable memory corruption in allocation/release of rx/link buffers used for CTE reports.

jakkra commented 2 years ago

Hi @cvinayak thanks alot for looking into the issue and setting up a simulation for it. I have been running the modfications you made without DF and so far no crash, so it seems as it's caused by DF.

ppryga-nordic commented 2 years ago

Note that for some reason CONFIG_BT_CTLR_DF_PER_SCAN_CTE_NUM_MAX=1 in prj.conf does not stick, so when you compile you will get bss overflow. You need to manually go in and change to =1 in the generated .config in the build folder and build again. Not sure if it's a bug or something weird in the prj.conf, maybe you can check if that's a bug.

There is no issue with CONFIG_BT_CTLR_DF_PER_SCAN_CTE_NUM_MAX. @jakkra set it twice, first time in prj.conf and then (probably you didn't spot that) here: https://github.com/zephyrproject-rtos/zephyr/blob/fad78a2c078cb753c0d9ddc84b9abb0abee7581f/samples/bluetooth/direction_finding_connectionless_rx/boards/nrf52833dk_nrf52833.conf#L14

At the end of the day, CONFIG_BT_CTLR_DF_PER_SCAN_CTE_NUM_MAX=16. Times CONFIG_BT_PER_ADV_SYNC_MAX=10 and times two (two complete chains, one reported and one waiting for receive) gives 320. That is beyond of capacity of MFIFO used to store pointers to memory pool elements struct node_rx_iq_report: https://github.com/zephyrproject-rtos/zephyr/blob/fad78a2c078cb753c0d9ddc84b9abb0abee7581f/subsys/bluetooth/controller/ll_sw/ull_df.c#L72 Also extended number of bytes allocated for struct node_rx_iq_report consumes so much SRAM that .bss cannot fit in and linker complains about it.

I'm currently trying to reproduce the issue with use of limited number of DKs. As of now, I've got no luck about it.

jakkra commented 2 years ago

There is no issue with CONFIG_BT_CTLR_DF_PER_SCAN_CTE_NUM_MAX. @jakkra set it twice, first time in prj.conf and then (probably you didn't spot that) here:

Ah I see, missed that, maybe could be moved from direction_finding_connectionless_rx/boards/nrf52833dk_nrf52833.conf to the prj.conf to avoid confusion like this.

Anyway, how many DKs have you got? I think I can reproduce with like 6-7, but I can confirm if you let me know how many you have. Also try resetting the receiver as it seems to be a problem when syncing is started.

ppryga-nordic commented 2 years ago

@jakkra

Ah I see, missed that, maybe could be moved from direction_finding_connectionless_rx/boards/nrf52833dk_nrf52833.conf to the prj.conf to avoid confusion like this.

The controller configuration is deliberately stored in boards/*.conf because for nRF53 it will be part of child image flashed to NET core. In such case it will cause problem with build and flash of the application that goes for APP core.

I'd be thankful if you check the lowest required number of DKs required. I have 4nRF52 but got two nRF53 so can try to use them also.

jakkra commented 2 years ago

@ppryga tried with 4 tags, got it once, tried with 6, got it a bit more often. So I'd say try to get 6 tags, it's easier the more tags you have defenately, with 10 it's quite easy to get the asserts. As issue seems to happen after Enable receiving of CTE... I think the key is to loose sync with devices and re-sync again as this happens quite a bit when you run with 10 tags. So try resetting the tags whenever receiver manages to sync with all.

Re-ran on latest main branch also now when testing, some assert logs:

(267548) Found device sending per adv: 28:D4:EA:2A:6D:B8 (random)
(267554) Creating Periodic Advertising Sync...
success.
Sync index: 3
0       1       2       3       4       5       6       7       8       9
40      40      39      0       0       -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
36      39      40      0       0       -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
38      40      40      0       0       -       -       -       -       -
------------------------------------------------------------------------------------
Enable receiving of CTE...
ASSERTION FAIL [rx->type != NODE_RX_TYPE_NONE] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1221
[00:04:31.071,411] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000448  r2/a3:  0x00000000
[00:04:31.071,441] <err> os: r3/a4:  0x200064de r12/ip:  0x00000000 r14/lr:  0x0000955d
[00:04:31.071,441] <err> os:  xpsr:  0x61000000
[00:04:31.071,441] <err> os: EXC_RETURN: 0x0
[00:04:31.071,441] <err> os: Faulting instruction address (r15/pc): 0x00009568
[00:04:31.071,441] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:04:31.071,472] <err> os: Current thread: 0x20000988 (unknown)
[00:04:31.921,539] <err> os: Halting system

And another:

(225469) sync_create_timeout for 3E:05:8E:DF:8A:08 (random)
Sync deleted
(225708) Found device sending per adv: 3E:05:8E:DF:8A:08 (random)
(225713) Creating Periodic Advertising Sync...
success.
Sync index: 5
0       1       2       3       4       5       6       7       8       9
29      39      39      41      0       0       -       -       -       -
------------------------------------------------------------------------------------
Enable receiving of CTE...
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/ull.c:1225
[00:03:46.927,185] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000448  r2/a3:  0x00000000
[00:03:46.927,215] <err> os: r3/a4:  0x200064de r12/ip:  0x00000000 r14/lr:  0x00009579
[00:03:46.927,215] <err> os:  xpsr:  0x61000000
[00:03:46.927,215] <err> os: EXC_RETURN: 0x0
[00:03:46.927,215] <err> os: Faulting instruction address (r15/pc): 0x00009584
[00:03:46.927,215] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:03:46.927,246] <err> os: Current thread: 0x20000988 (unknown)
[00:03:47.776,733] <err> os: Halting system

Another:

Enable receiving of CTE...
(78856) success. CTE receive enabled.
[00:01:18.981,079] <err> os: ***** MPU FAULT *****
[00:01:18.981,109] <err> os:   Data Access Violation
[00:01:18.981,109] <err> os:   MMFAR Address: 0x9a
[00:01:18.981,109] <err> os: r0/a1:  0x00000000  r1/a2:  0x20001580  r2/a3:  0x00000000
[00:01:18.981,109] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x000096f9
[00:01:18.981,140] <err> os:  xpsr:  0x61000000
[00:01:18.981,140] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0x0000003e
[00:01:18.981,140] <err> os: r7/v4:  0x20006964  r8/v5:  0x20002cfc  r9/v6:  0x00000000
[00:01:18.981,140] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20006920
[00:01:18.981,140] <err> os: EXC_RETURN: 0xfffffffd
[00:01:18.981,170] <err> os: Faulting instruction address (r15/pc): 0x0000aa24
[00:01:18.981,170] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:01:18.981,170] <err> os: Current thread: 0x20000a08 (unknown)
[00:01:20.378,204] <err> os: Halting system

Another thing I noted now was this new print randomly printed that I have not seen before: [00:00:06.741,851] <err> bt_scan: Unexpected per adv sync established event

cvinayak commented 2 years ago

@jakkra I have made a consolidated PR of all the open changes to the controller, please test this PR with your application: https://github.com/zephyrproject-rtos/zephyr/pull/41091

jakkra commented 2 years ago

Hi @cvinayak and @ppryga I have started now to test it out, I'm not done testing but here are some feedback so far. First I'd like to say it works way better, so that is great. So far I have gotten the two following crashes. I attached the application I get the crashes in, but it's pretty much the same as I have provided earlier in this ticket but I may have made som minor changes so here it is again: aoa_receiver_multiple.zip

Also note there is a log: <wrn> bt_hci_core: opcode 0x2045 status 0x0c which sometimes happends, I have not looked into it, but I think it is an error in the application, I will check that.

1st crash:

(1179382) sync_create_timeout for E3:26:55:1E:0F:2D (random)
Sync deleted
0       1       2       3       4       5       6       7       8       9
41      39      38      21      -       39      38      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
38      40      38      21      -       39      42      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
41      39      36      15      -       40      40      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      40      39      19      -       40      40      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
39      38      38      24      -       39      38      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
39      39      39      19      -       40      38      -       -       -
------------------------------------------------------------------------------------
Scan is running...
0       1       2       3       4       5       6       7       8       9
40      39      39      21      -       39      40      -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
39      40      40      18      -       41      38      -       -       -
------------------------------------------------------------------------------------
(1187604) Found device sending per adv: E3:26:55:1E:0F:2D (random)
(1187611) Creating Periodic Advertising Sync...
suASSERTION FAIL [node_rx] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_df.c:231
[00:19:47.624,755] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000578  r2/a3:  0x00000000
[00:19:47.624,786] <err> os: r3/a4:  0x2000767e r12/ip:  0x00000000 r14/lr:  0x0000f58d
[00:19:47.624,786] <err> os:  xpsr:  0x61000028
[00:19:47.624,786] <err> os: EXC_RETURN: 0x0
[00:19:47.624,786] <err> os: Faulting instruction address (r15/pc): 0x0000f598
[00:19:47.624,786] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:19:47.624,816] <err> os: Fault during interrupt handling

[00:19:47.624,816] <err> os: Current thread: 0x20000ba0 (unknown)
[00:19:47.702,941] <err> os: Halting system
  1. seconds crash
    
    term_cb: clear device index: 7
    0       1       2       3       4       5       6       7       8       9
    40      -       40      39      39      38      25      -       -       -
    ------------------------------------------------------------------------------------
    0       1       2       3       4       5       6       7       8       9
    39      -       40      40      40      37      29      -       -       -
    ------------------------------------------------------------------------------------
    0       1       2       3       4       5       6       7       8       9
    39      -       40      38      40      40      32      -       -       -
    ------------------------------------------------------------------------------------
    0       1       2       3       4       5       6       7       8       9
    39      -       38      38      38      39      28      -       -       -
    ------------------------------------------------------------------------------------
    (106199) Found device sending per adv: CB:23:0B:96:BC:A9 (random)
    (106205) Creating Periodic Advertising Sync...
    success.
    Sync index: 1
    Scan is running...
    0       1       2       3       4       5       6       7       8       9
    40      sync    33      41      32      32      31      -       -       -
    ------------------------------------------------------------------------------------
    0       1       2       3       4       5       6       7       8       9
    40      sync    31      33      33      30      26      -       -       -
    ------------------------------------------------------------------------------------
    ASSERTION FAIL [(ret == 0) || (ret == 2)] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:723
    [00:01:47.747,772] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000578  r2/a3:  0x00000000
    [00:01:47.747,802] <err> os: r3/a4:  0x200076ae r12/ip:  0x00000000 r14/lr:  0x0000d0cf
    [00:01:47.747,802] <err> os:  xpsr:  0x61000011
    [00:01:47.747,802] <err> os: EXC_RETURN: 0x0
    [00:01:47.747,802] <err> os: Faulting instruction address (r15/pc): 0x0000d0da
    [00:01:47.747,802] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
    [00:01:47.747,833] <err> os: Fault during interrupt handling

[00:01:47.747,833] os: Current thread: 0x20000c58 (unknown) [00:01:48.686,370] os: Halting system

jakkra commented 2 years ago

@cvinayak @ppryga Just realised new commits was pushed yesterday that I did not test with. So don't dig too deep into the above crashes. Will come back with more input after I test with the latest commits.

cvinayak commented 2 years ago

@jakkra Thank you once again for a quick response. In my PR, buffer leaks and bugs related to conformance testing were covered.

It appears the newest crashes in lll_df.c and lll.c is probably not covered in my PR. I will continue my investigations based on these new crashes.

ppryga-nordic commented 2 years ago

ASSERTION FAIL [node_rx] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_df.c:231 This assertion is caused by lack of free node_rx_iq_report when attempt to enable reception of CTE. This may be caused by one of following:

cvinayak commented 2 years ago

@ppryga Fix for the lll_df.c assertion on node_rx would be to add a ull_df_iq_report_alloc_peek(2) check here: https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_sync.c#L944-L950

so that there is always one spare node_rx for the lll_df_conf_cte_rx_enable() to not fail.

We are close to a stable implementation :-)

ppryga-nordic commented 2 years ago

@cvinayak yes, indeed. There is more to be done there. That is what I'm going to do when I start my task with memory usage optimization. This assert will be fixed then also.

cvinayak commented 2 years ago

Hi @cvinayak @ppryga so far running with latest commits I only so far manage to get this crash:

@jakkra did you forget to paste something here ^^

jakkra commented 2 years ago

Hi @cvinayak and @ppryga I have now tested more see below for details:

@jakkra did you forget to paste something here ^^

Yes I somehow accadently pressed comment to early :)

1. Happens every ~20 min

ASSERTION FAIL [next] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:648
[00:03:04.679,046] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000578  r2/a3:  0x00000000
[00:03:04.679,046] <err> os: r3/a4:  0x200076a6 r12/ip:  0x00000000 r14/lr:  0x0000d0f3
[00:03:04.679,046] <err> os:  xpsr:  0x61000028
[00:03:04.679,077] <err> os: EXC_RETURN: 0x0
[00:03:04.679,077] <err> os: Faulting instruction address (r15/pc): 0x0000d0fe
[00:03:04.679,077] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:03:04.679,077] <err> os: Fault during interrupt handling

2. This assert only happend once so far:

------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
39      38      41      40      40      sync... -       -       -       -
------------------------------------------------------------------------------------
Enable receiving of CTE...
(244454) success. CTE receive enabled.
0       1       2       3       4       5       6       7       8       9
39      37      37      39      39      37      -       -       -       -
------------------------------------------------------------------------------------
(245464) Found device sending per adv: D8:06:6D:59:68:72 (random)
(245470) Creating Periodic Advertising Sync...
success.
Sync index: 6
ASSERTION FAIL [e] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_sync.c:925
[00:04:05.900,604] <err> os: r0/a1:  0x00000003  r1/a2:  0x20000578  r2/a3:  0x00000000
[00:04:05.900,634] <err> os: r3/a4:  0x200076fe r12/ip:  0x00000000 r14/lr:  0x0000ebdf
[00:04:05.900,634] <err> os:  xpsr:  0x61000011
[00:04:05.900,634] <err> os: EXC_RETURN: 0x0
[00:04:05.900,634] <err> os: Faulting instruction address (r15/pc): 0x0000ebea
[00:04:05.900,634] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:04:05.900,665] <err> os: Fault during interrupt handling

[00:04:05.900,665] <err> os: Current thread: 0x20000c58 (unknown)
[00:04:06.839,202] <err> os: Halting system

3. A new behaviour: the sync term_cb is sometimes called before my "create sync timeout" hits with error code 62 (0x3E) It comes from here: https://github.com/cvinayak/zephyr/blob/github_periodic_chain_fix/subsys/bluetooth/host/scan.c#L741 This could be expected behaviour, but wanted to mention it anyway. This is also the reason for <wrn> bt_hci_core: opcode 0x2045 status 0x0c I mentioned earlier. I now uppdated test application so that it handles term_cb called for a none synced device and this warn message is not output anymore. Attached updates application. Log:

(534745) Found device sending per adv: 15:58:F0:CD:7B:96 (random)
(534751) Creating Periodic Advertising Sync...
success.
Sync index: 4
0       1       2       3       4       5       6       7       8       9
40      38      39      40      sync... 37      -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
38      40      38      37      sync... 41      -       -       -       -
------------------------------------------------------------------------------------
------->Device not synced cancel sync tmo:<--------
term_cb: clear device index: 4, reason: 62
0       1       2       3       4       5       6       7       8       9
36      38      39      38      -       37      -       -       -       -
------------------------------------------------------------------------------------
(537371) Found device sending per adv: 15:58:F0:CD:7B:96 (random)
(537377) Creating Periodic Advertising Sync...
success.
Sync index: 4
------->Device not synced cancel sync tmo:<--------
term_cb: clear device index: 4, reason: 62
0       1       2       3       4       5       6       7       8       9
36      39      36      39      -       39      -       -       -       -
------------------------------------------------------------------------------------

4. I also sometimes see this crash:

------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
38      39      40      39      39      28      sync... -       -       -
------------------------------------------------------------------------------------
Enable receiving of CTE...
[00:00:36.092,926] <err> os: ***** MPU FAULT *****
[00:00:36.092,956] <err> os:   Data Access Violation
[00:00:36.092,956] <err> os:   MMFAR Address: 0x11
[00:00:36.092,956] <err> os: r0/a1:  0x00000000  r1/a2:  0x2d000041  r2/a3:  0x00000001
[00:00:36.092,956] <err> os: r3/a4:  0x2d000041 r12/ip:  0x00000000 r14/lr:  0x0001982b
[00:00:36.092,987] <err> os:  xpsr:  0x01000000
[00:00:36.092,987] <err> os: r4/v1:  0x20000ab8  r5/v2:  0x00000000  r6/v3:  0x00000020
[00:00:36.092,987] <err> os: r7/v4:  0x20008740  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:00:36.092,987] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x200068b0
[00:00:36.092,987] <err> os: EXC_RETURN: 0xfffffffd
[00:00:36.093,017] <err> os: Faulting instruction address (r15/pc): 0x0001983c
[00:00:36.093,017] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:36.093,017] <err> os: Current thread: 0x20000ba0 (unknown)
[00:00:37.565,795] <err> os: Halting system
  1. A question This is not something new, but I have a question regarding the bt_le_per_adv_sync_param skip parameter: In the BT spec is states this:

The Skip parameter specifies the maximum number of consecutive periodic advertising events that the receiver may skip after successfully receiving a periodic advertising packet.

The behaviour of the controller for me is that it will always skip the skip count of periodic advertising events. Which probably is not invalid as the spec. says "may". But how I interpret it and how I think would make most sense is that based on how "busy" the radio is it may skip some events, and if the radio is not doing anything, then it should not skip any events. Do you know if this is the idea of it, or that the current implementation to always skip the skip count is correct?

Updated application: aoa_receiver_multiple.zip

More updates to come, I will now start testing with our own application, so far I have mostly tested with the more basic application I have attached to this ticket.

Thanks alot!

cvinayak commented 2 years ago

@jakkra

Asserts 1 and 2: These occur when the lower link layer radio event scheduling pipeline overflows. Would happen if many radio events are getting scheduled too quickly, will need to review the time reservations used.

For 3. Yes, term_cb will be called when sync failed to be established. If there are no buffers to receive a radio PDU during sync create, and 6 periodic intervals have passed, this callback will be raised. This is Controller specific implementation, other option was to infinitely wait until free rx buffer availability to generate the sync established.

For 4. Please use arm-none-eabi-addr2line -e zephyr/zephyr.elf 0x0001983c to get the file and line number of the faulting instruction address, this will help me narrow the cause.

For 5.

Which probably is not invalid as the spec. says "may".

may does not mean a violation, as reports can be missed for other reasons (like incorrect reception). The implementation decision was to reduce power consumption by not scheduling to listen to radio events that is permitted to be skipped. An application always has the choice to not use 'skip` value to try receiving without skipping.

Thank you for reporting and helping debug the issues.

cvinayak commented 2 years ago

More updates to come, I will now start testing with our own application, so far I have mostly tested with the more basic application I have attached to this ticket.

@jakkra I have fixed an incorrect buffer usage, please use latest changes in the PR https://github.com/zephyrproject-rtos/zephyr/pull/41091 for your testing

jakkra commented 2 years ago

Hi @cvinayak thanks for the update and answered quiestions! Makes sense. I'm now using your latest updates. Adding this crash to the list and I have also attached the zephyr.elf so that you can decode if I missed to decode something: zephyr.zip (elf is inside the zip as GitHub does not let me upload .elf)

r15/pc: 0x0001583e: zephyr-main/zephyrproject/zephyr/lib/libc/minimal/source/string/string.c:338
MMFAR Address: 0x3b35: zephyr-main/zephyrproject/zephyr/arch/arm/core/aarch32/isr_wrapper.S:50
r14/lr: 0x00015deb: zephyr-main/zephyrproject/zephyr/subsys/bluetooth/controller/util/mem.c:82
r8/v5: 0x20002de0: zephyr-main/zephyrproject/zephyr/subsys/bluetooth/controller/ll_sw/ull_scan.c:72
Current thread: 0x20000ba0: zephyr-main/zephyrproject/zephyr/subsys/bluetooth/controller/hci/hci_driver.c:72
(844976) Found device sending per adv: 26:BB:A5:D9:5F:C4 (random)
(844977) Creating Periodic Advertising Sync...
success.
Sync index: 4
0       1       2       3       4       5       6       7       8       9
38      37      35      37      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      36      36      38      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
39      35      39      38      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      37      38      36      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      35      37      37      sync... -       -       -       -       -
------------------------------------------------------------------------------------
(849979) sync_create_timeout for 26:BB:A5:D9:5F:C4 (random)
Sync deleted
0       1       2       3       4       5       6       7       8       9
36      36      37      36      -       -       -       -       -       -
------------------------------------------------------------------------------------
(850774) Found device sending per adv: 26:BB:A5:D9:5F:C4 (random)
(850775) Creating Periodic Advertising Sync...
success.
Sync index: 4
0       1       2       3       4       5       6       7       8       9
39      38      37      36      sync... -       -       -       -       -
------------------------------------------------------------------------------------
___Device not synced cancel sync tmo:
term_cb: clear device index: 4
0       1       2       3       4       5       6       7       8       9
39      39      34      36      -       -       -       -       -       -
------------------------------------------------------------------------------------
(853399) Found device sending per adv: 26:BB:A5:D9:5F:C4 (random)
(853400) Creating Periodic Advertising Sync...
success.
Sync index: 4
0       1       2       3       4       5       6       7       8       9
33      39      38      39      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      38      38      36      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
36      36      36      39      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
36      33      37      37      sync... -       -       -       -       -
------------------------------------------------------------------------------------
0       1       2       3       4       5       6       7       8       9
37      36      35      35      sync... -       -       -       -       -
------------------------------------------------------------------------------------
(858402) sync_create_timeout for 26:BB:A5:D9:5F:C4 (random)
Failed cancel sync: -5
[00:14:18.404,083] <wrn> bt_hci_core: opcode 0x2045 status 0x0c
[00:14:18.460,723] <err> os: ***** MPU FAULT *****
[00:14:18.460,723] <err> os:   Data Access Violation
[00:14:18.460,723] <err> os:   MMFAR Address: 0x3b35
[00:14:18.460,723] <err> os: r0/a1:  0x00003b35  r1/a2:  0x2000149d  r2/a3:  0x00000004
[00:14:18.460,723] <err> os: r3/a4:  0x00003b35 r12/ip:  0x00000000 r14/lr:  0x00015deb
[00:14:18.460,754] <err> os:  xpsr:  0x21000200
[00:14:18.460,754] <err> os: r4/v1:  0x00000068  r5/v2:  0x00003b35  r6/v3:  0x00000010
[00:14:18.460,754] <err> os: r7/v4:  0x20006aa4  r8/v5:  0x20002de0  r9/v6:  0x00000000
[00:14:18.460,754] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20006a48
[00:14:18.460,784] <err> os: EXC_RETURN: 0xfffffffd
[00:14:18.460,784] <err> os: Faulting instruction address (r15/pc): 0x0001583e
[00:14:18.460,784] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:14:18.460,784] <err> os: Current thread: 0x20000ba0 (unknown)
[00:14:19.881,805] <err> os: Halting system
jakkra commented 2 years ago

Ok realised I did not have CONFIG_ASSERT=y on in my test application so I got some new asserts which I saw when testing the more advanced application, however can be reproduced with the sample I attached earlier, just set CONFIG_ASSERT=y

ASSERTRTION FAIL [index >= 0 && ((long) (((int) sizeof(char[1 - 2 * !(!__builtin_types_compatible_p(__typeof__(per_adv_sync_pool), __typeof__(&(per_adv_sync_pool)[0])))]) - 1) + (sizeof(per_adv_sync_pool) / sizeof((per_adv_sync_pool)[0])))) > index] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/scan.c:1114
        Invalid per_adv_sync pointer

You also need CONFIG_BT_BUF_ACL_RX_SIZE=300 otherwise you get this assert also: ASSERTION FAIL [net_buf_simple_tailroom(buf) >= len] @ WEST_TOPDIR/zephyr/subsys/net/buf.c:808

So here: https://github.com/cvinayak/zephyr/blob/44bf63e2bf39f4ac7342d06fc120b6bd044839e8/subsys/bluetooth/host/scan.c#L1114

It seems as the IQ reports are currupt quite often and not only when it asserts, for example I added a simple check to verify that the sample count IQs always is 82 (should always be that for the config set), but every ~10s I see 10 or 255 as IQ sample_count. @ppryga @cvinayak this crash is not new since patches by @cvinayak, I still have them on main branch. crash_corrupt

cvinayak commented 2 years ago

@jakkra The PR #41091 is updated (force pushed), please pull the latest set of commits in that PR. Two fixes added, increase event done count and to ensure auxiliary context parent field is reset to NULL.

I believe the fixes should mitigate the lll.c and lll_syn.c assertions (points 1 and 2 here https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-996675933). Please test and revert back. Thank you.

jakkra commented 2 years ago

@cvinayak I have now been running 3 instances for a few hours and it seems as the ll.c and ll_sync.c crashes are resolved. To summarize the remaining crashes:

  1. Crashes related to currupt/invalid IQ reports: https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-997927667

  2. 
    <wrn> bt_hci_core: opcode 0x2045 status 0x0c
    [00:28:24.319,091] <wrn> u_df_scan: Failed cancel sync: -5

[00:28:24.369,598] u_df_scan: PER_ADV_SYNC[0]: [DEVICE]: 13:FD:EE:FA:80:DF (random) sync terminated index 62

[00:28:24.500,854] u_df_scan: Create new sync, use syncedDevice_t i: 0

[00:28:24.501,007] os: BUS FAULT [00:28:24.501,007] os: Imprecise data bus error [00:28:24.501,037] os: r0/a1: 0x0000eae9 r1/a2: 0x0bb672bc r2/a3: 0x0bb672b5 [00:28:24.501,037] os: r3/a4: 0x20001b4c r12/ip: 0x20003504 r14/lr: 0x00016607 [00:28:24.501,037] os: xpsr: 0x2100f000

[00:28:24.501,129] os: fpscr: 0x00000000 [00:28:24.501,129] os: r4/v1: 0x00000000 r5/v2: 0x20003524 r6/v3: 0x20017651 [00:28:24.501,129] os: r7/v4: 0x20003e78 r8/v5: 0x00000000 r9/v6: 0x00000001 [00:28:24.501,159] os: r10/v7: 0x0000000a r11/v8: 0x20003608 psp: 0x20008498 [00:28:24.501,159] os: EXC_RETURN: 0xfffffffd [00:28:24.501,159] os: Faulting instruction address (r15/pc): 0x00032c82 [00:28:24.501,190] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0 [00:28:24.501,190] os: Current thread: 0x20000938 (unknown) [00:28:24.720,642] os: Halting system

(r15/pc): 0x00032c82: zephyrproject/zephyr/subsys/bluetooth/controller/util/mem.c:66 r14/lr: 0x00016607: zephyrproject/zephyr/subsys/bluetooth/controller/ll_sw/ull_sync.c:121 Current thread: 0x20000938: zephyrproject/zephyr/subsys/bluetooth/host/hci_core.c:68

3.  https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-997732185

0 1 2 3 4 5 6 7 8 9 37 35 sync... 37 38 - - - - -

(1467431) sync_create_timeout for 2E:6D:A0:74:BF:F1 (random) sync_create_timeout: Failed cancel sync: -5 [00:24:27.437,103] bt_hci_core: opcode 0x2045 status 0x0c [00:24:27.478,546] os: MPU FAULT [00:24:27.478,546] os: Data Access Violation [00:24:27.478,546] os: MMFAR Address: 0x4699 [00:24:27.478,576] os: r0/a1: 0x00004699 r1/a2: 0x20001505 r2/a3: 0x00000004 [00:24:27.478,576] os: r3/a4: 0x00004699 r12/ip: 0x00000000 r14/lr: 0x0001aea5 [00:24:27.478,576] os: xpsr: 0x21000200 [00:24:27.478,576] os: r4/v1: 0x00000018 r5/v2: 0x00004699 r6/v3: 0x00000010 [00:24:27.478,607] os: r7/v4: 0x20006b44 r8/v5: 0x20002e48 r9/v6: 0x00000000 [00:24:27.478,607] os: r10/v7: 0x00000000 r11/v8: 0x00000000 psp: 0x20006ae8 [00:24:27.478,637] os: EXC_RETURN: 0xfffffffd [00:24:27.478,637] os: Faulting instruction address (r15/pc): 0x0001a928 [00:24:27.478,637] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0 [00:24:27.478,637] os: Current thread: 0x20000c00 (unknown) [00:24:28.983,001] os: Halting system

MMFAR Address: 0x4699: zephyrproject/zephyr/arch/arm/core/aarch32/isr_wrapper.S:50 (r15/pc): 0x0001a928: zephyrproject/zephyr/lib/libc/minimal/source/string/string.c:338 Current thread: 0x20000c00: zephyrproject/zephyr/subsys/bluetooth/controller/hci/hci_driver.c:72 r14/lr: 0x0001aea5: zephyrproject/zephyr/subsys/bluetooth/controller/util/mem.c:82

cvinayak commented 2 years ago

Thank you @jakkra

I will work towards analyzing/reproducing the remain crashes (I do not have DF setup, hence will have to see if just modified periodic_adv and periodic_sync can reproduce them).

cvinayak commented 2 years ago

@jakkra I believe issue 1 here https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-997927667 is solved by using CONFIG_BT_BUF_ACL_RX_SIZE=300, right?

jakkra commented 2 years ago

Hi, @cvinayak you don't need any special setup to reproduce using DF. You can run it on any HW pretty much, like a few nRF52833 DKs. Obviously you won't have an antenna array connected, so that the IQ data you get won't make sense, but you still get them, that is what matters only. All issues can reproduced using the aoa_receiver_multiple.zip I attached to this comment: https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-996675933 (remember to turn on CONFIG_ASSERT=y in prj.conf)

@jakkra I believe issue 1 here #40317 (comment) is solved by using CONFIG_BT_BUF_ACL_RX_SIZE=300, right?

The real issue in 1. https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-997927667 is below, not the CONFIG_BT_BUF_ACL_RX_SIZE crash, that is fine.

ASSERTRTION FAIL [index >= 0 && ((long) (((int) sizeof(char[1 - 2 * !(!__builtin_types_compatible_p(__typeof__(per_adv_sync_pool), __typeof__(&(per_adv_sync_pool)[0])))]) - 1) + (sizeof(per_adv_sync_pool) / sizeof((per_adv_sync_pool)[0])))) > index] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/scan.c:1114
        Invalid per_adv_sync pointer

(ASSERT is caused by IQ reports that are corrupt).

cvinayak commented 2 years ago

@jakkra ok. I have 2 nRF52833 DK and 2 nRF5340 DK. Will use them next week (FYI, I will be away until 28th Dec) and respond here with my findings then.

I want to address any issues related to memory corruption i.e. issue 2 and 3 before 1 noted here https://github.com/zephyrproject-rtos/zephyr/issues/40317#issuecomment-998672279

I will sync up with @ppryga next week on issue with iq report. (I suspect memory corruption could be a common cause).

jakkra commented 2 years ago

@cvinayak Great, that will be enough, flash 2 nRF5340 and 1 nRF52833 with DF transmitter and 1 nRF52833 DK with receiver and you should be able to reproduce the issues. FYI I'll also be out of office until Jan 3, so won't have any feedback until then. Thank you!

saleh-unikie commented 2 years ago

Hi all

I was working on DF samples and I encountered a problem that I think is relevant to the problem reported here, similarly I see "Assertion failure" at ull.c, and the behavior is random. Also sometimes assertion failure is seen in another files, ull_scan_aux.c for example, but mostly it raises from ull.c file. BTW, my problem has happened in another scenario as it has been described here: https://github.com/zephyrproject-rtos/zephyr/issues/41465

cvinayak commented 2 years ago

@jakkra and @saleh-unikie Wish you a very good New Year.

There has been some updates and rebasing of the PR #41091, I believe to have fixed a corruption in a receive queue that caused internal memory leak of contexts used in reception of auxliary PDUs.

(I also believe there is one another race condition causing an assert in ull_scan.c)

Hope you will be able to test your application, and revert back with your observations, on top of this commit in the PR branch: feee1f6b3662f6750f385196b1a741af07f44924

Thanks once again on following up and supporting towards resolving the issues.

saleh-unikie commented 2 years ago

Thanks @cvinayak , I hope a good year, full of happiness for you too.

I've tested and "ASSERTION FAIL" problem is not seen anymore, but still there is another issue which I've explained it some days ago, you can find it here: https://github.com/zephyrproject-rtos/zephyr/issues/41465#issuecomment-1001943926

In summary, when BT_LE_PER_ADV_SYNC_OPT_DONT_SYNC_AOA or BT_LE_PER_ADV_SYNC_OPT_DONT_SYNC_AOD_1US or BT_LE_PER_ADV_SYNC_OPT_DONT_SYNC_AOD_2US (or any combination of them) options has been used, then the receiver (AoA_Locator or AoD_Mobile) doesn't work properly if 1) more than one transmitter is in the environment 2) different types of transmitter are available (e.g. one AoA_Tag + one AoD_Beacon)

cvinayak commented 2 years ago

@saleh-unikie Thanks for the quick testing and quick response on confirming that assertion issue is resolved by using the work in PR #41091

I have been running some long duration tests and discovered couple more development assertions related to race when terminate is initiated by application. I have not pushed the latest changes yet, will do next week and I would like you and @jakkra to do respective testing and revert back if he still is facing issues.

I have not started analysis of #41465 yet, but hopefully I should be able to start on it in week 3.

jakkra commented 2 years ago

Hi @cvinayak wish you a good new year also!

I have today started testing out your updates, I'll continue rest of the day but so far I have the following observations: I'm using commit 8a815484e813425646181448e287de7638a3eaa7 from your PR.

  1. On transmitter side I hade to change bt_df_adv_cte_tx_param: .cte_type: BT_HCI_LE_AOA_CTE => BT_DF_CTE_TYPE_AOA
  2. Receiver is not finding any transmitters if the transmitter have set extended adv. data. If I remove the bt_le_ext_adv_set_data from my transmitter code then it work. Maybe unrelated to your changes and some config changed in the upstream zephyr repo? I searched through kconfig but didn't find anything related so far.
    
    bt_le_ext_adv_create...

printk("Set ext adv data..."); err = bt_le_ext_adv_set_data(adv_set, ad, ARRAY_SIZE(ad), NULL, 0); if (err) { printk("Failed setting ext adv data: %d\n", err); } printk("success\n");

bt_df_set_adv_cte_tx_param ...

3. This assert I got after a few minutes.

(65003) Found device sending per adv: 0D:65:67:83:4D:4A (random) (65010) Creating Periodic Advertising Sync... success. Sync index: 0 ___Device not synced cancel sync tmo info->reason: 62 term_cb: clear device index: 0 0 1 2 3 4 5 6 7 8 9

[00:01:06.124,694] os: Current thread: 0x20000cb8 (unknown) [00:01:07.063,415] os: Halting system

4. Same with this one

(124810) Found device sending per adv: 38:F9:D6:61:48:66 (random) (124816) Creating Periodic Advertising Sync... success. Sync index: 6 0 1 2 3 4 5 6 7 8 9 16 49 47 49 47 24 sync... - - -

0 1 2 3 4 5 6 7 8 9 8 42 46 43 46 26 sync... - - -

0 1 2 3 4 5 6 7 8 9 8 48 46 47 43 16 sync... - - -

0 1 2 3 4 5 6 7 8 9 8 44 47 45 45 8 sync... - - -

0 1 2 3 4 5 6 7 8 9 15 46 47 45 45 24 sync... - - -

ASSERTION FAIL [next] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:648 [00:02:09.442,504] os: r0/a1: 0x00000003 r1/a2: 0x2000059c r2/a3: 0x00000000 [00:02:09.442,504] os: r3/a4: 0x20007846 r12/ip: 0x00000000 r14/lr: 0x0000e73b [00:02:09.442,535] os: xpsr: 0x61000028 [00:02:09.442,535] os: EXC_RETURN: 0x0 [00:02:09.442,535] os: Faulting instruction address (r15/pc): 0x0000e746 [00:02:09.442,535] os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0 [00:02:09.442,565] os: Fault during interrupt handling

[00:02:09.442,565] os: Current thread: 0x20000cb8 (unknown) [00:02:10.381,195] os: Halting system

cvinayak commented 2 years ago

@jakkra Thank you for testing.

1 and 2 could be upstream changes.

3 and 4 is something worrying me. I suspect it to be outside the influence by periodic sync implementation and relates with #40916. I will continue the investigation and revert back to you when I have a definite fix.

jakkra commented 2 years ago

Hi @cvinayak Thanks for continuing working on those issues. I'll see if I can figure out the reason for 1 and 2. In the mean time I have collected some data and made some experiments with periodic advertising since it now is quite stable using your branch! I have data and plots using 8 tags and different periodic adv. intervals (many iterations, different per_adv_int, random offset between tags etc.) For example I see about 50% failure to sync all 8 devices within 60s using per_adv_int 100ms. So I have some questions about this, what to expect, what you guys have tested etc. Would you like me to take it here in this issue, or create a new one?

FYI this is what I get alot when it fails to sync.

(75536) Found device sending per adv: 09:29:EA:68:BB:ED (random)
(75542) Creating Periodic Advertising Sync...success.Sync index: 7
(76217) Device not synced per_sync_term_cb, info->reason: 62, term_cb: clear device index: 7
jakkra commented 2 years ago

Hi again @cvinayak , I found another thing, it does not seem to be a big issue, as it seems to recover itself, but still looks like not expected behaviour. I sometime see this in the logs: Check my comments starting with <------

TAGS:1,0,1,0,0,1,1,-,-,-,'
['1', '0', '1', '0', '0', '1', '1', '-', '-', '-']
Num synced: 7

Device was synced: term_cb: clear device index: 1
(409715) Found device sending per adv: 16:42:9F:36:70:41 (random)
(409721) Creating Periodic Advertising Sync...success.Sync index: 1
Device was synced: term_cb: clear device index: 3
TAGS:0,sync...,1,-,1,1,1,-,-,-,'
Scan is running...
TAGS:0,sync...,0,-,1,1,1,-,-,-,'
['0', 'sync...', '0', '-', '1', '1', '1', '-', '-', '-']
Num synced: 5

TAGS:1,sync...,1,-,1,1,1,-,-,-,'
['1', 'sync...', '1', '-', '1', '1', '1', '-', '-', '-']
Num synced: 5

TAGS:1,sync...,0,-,1,1,1,-,-,-,'
['1', 'sync...', '0', '-', '1', '1', '1', '-', '-', '-']
Num synced: 5

TAGS:1,sync...,1,-,1,1,1,-,-,-,'
['1', 'sync...', '1', '-', '1', '1', '1', '-', '-', '-']
Num synced: 5

(414727) sync_create_timeout for 16:42:9F:36:70:41 (random) <------ Application sync timout after 5 seconds, calls bt_le_per_adv_sync_delete
sync_create_timeout: Failed cancel sync: -5 <------ bt_le_per_adv_sync_delete fails here with err -5, expected to succeed
(414980) Found device sending per adv: 3E:AC:09:A5:2B:43 (random)
(414986) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
(415365) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415370) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
TAGS:1,sync...,1,-,1,1,0,-,-,-,'
(415474) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415480) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
(415583) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415589) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
(415693) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415699) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
[00:06:54.733,489] \x1b[1;33m<wrn> bt_hci_core: opcode 0x2045 status 0x0c\x1b[0m <------ Warning log from stack
Scan is running...
(415802) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415808) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
(415912) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(415918) Creating Periodic Advertising Sync...Sync for some device already in progress.failed (err -16)
(415928) Device not synced per_sync_term_cb, info->reason: 62, term_cb: clear device index: 1 <------ Here it solves itself
(416021) Found device sending per adv: 25:B2:69:48:E2:2B (random)
(416027) Creating Periodic Advertising Sync...success.Sync index: 1
TAGS:1,sync...,1,-,1,1,1,-,-,-,'
Enable receiving of CTE...
(416602) success. CTE receive enabled.
(417387) Found device sending per adv: 3E:AC:09:A5:2B:43 (random)
(417393) Creating Periodic Advertising Sync...success.Sync index: 3
TAGS:1,0,1,sync...,1,1,1,-,-,-,'
Enable receiving of CTE...
(418078) success. CTE receive enabled.
TAGS:0,0,1,0,1,1,1,-,-,-,'
['0', '0', '1', '0', '1', '1', '1', '-', '-', '-']
Num synced: 7
cvinayak commented 2 years ago

In the mean time I have collected some data and made some experiments with periodic advertising since it now is quite stable using your branch! I have data and plots using 8 tags and different periodic adv. intervals (many iterations, different per_adv_int, random offset between tags etc.) For example I see about 50% failure to sync all 8 devices within 60s using per_adv_int 100ms.

As there is high traffic, I have seen that application processing if slow would starve free buffer availability in the controller, and if this happens during synchronization establishment then at the 6th synchronization interval the synchronization failed to be established (reason 0x3E) HCI event is generated.

So I have some questions about this, what to expect, what you guys have tested etc. Would you like me to take it here in this issue, or create a new one?

what to expect

Sync failure when controller does not have enough free rx buffer during new synchronization establishment.

what you guys have tested etc.

  1. Babblesim test of simple system requirement of Periodic Advertising and Synchronization.
  2. Full Bluetooth Core Specification conformance test cases for the full Controller supported features over 500 testcases over multiple iterations over the last few months.
  3. Long duration tests using update samples (multiple period_adv/periodic_sync samples) like for overnight or few days of continued uptime everytime an identified bug is fixed.

Would you like me to take it here in this issue, or create a new one?

Yes, separate issue as unrelated to crash in ull.c which is the current issue being focused here.

cvinayak commented 2 years ago

(414727) sync_create_timeout for 16:42:9F:36:70:41 (random) <------ Application sync timout after 5 seconds, calls bt_le_per_adv_sync_delete sync_create_timeout: Failed cancel sync: -5 <------ bt_le_per_adv_sync_delete fails here with err -5, expected to succeed

When the controller has generated Synchronization Failed to be Established, it is not necessary for the host to call bt_le_per_adv_sync_delete (for the Zephyr Controller). This error/warning you see can be ignored, the controller maintains correct states in this scenario of sync failure.

cvinayak commented 2 years ago

@jakkra Two days back I believe to have analyzed and solved the lll.c/lll_sync.c assertions, I have been running a modified periodic_sync sample that previously reproduced the assertion, now for over 30 hours without any assertions. I will let the board continue running and in the next days will update the PR with the additional that will make the Controller stable.

Once again thank you for your quick responses and help solve many instability issues in the Controller.

I will leave a message here when I am satisfied with my testing and would update the PR.

jakkra commented 2 years ago

Hi @cvinayak thanks alot for your input!

As there is high traffic, I have seen that application processing if slow would starve free buffer availability in the controller, and if this happens during synchronization establishment then at the 6th synchronization interval the synchronization failed to be established (reason 0x3E) HCI event is generated.

Application processing should not be slow, I'm doing pretty much nothing except keeping track of the currently tracked tags and how many CTE samples I receive. But I'll try without the logs as it does print something when it's starts syncing, etc.

I removed enable_cte_rx and without that I have no problems syncing with 8 devices, even with 25ms per. adv. interval. Which I guess confirms that the failure is due to no memory when CTE sampling is enabled and trying to sync new.

Ok so that term_cb with info->reason: 62 (0x3E) is due to no free buffers? If so can you point me to where I can increase those (controller free rx buffers) to see if it can handle more syncronized tags better?

So this is probably related to https://github.com/zephyrproject-rtos/zephyr/issues/39376 so I will bring this info to that ticket instead.

When the controller has generated Synchronization Failed to be Established, it is not necessary for the host to call bt_le_per_adv_sync_delete (for the Zephyr Controller). This error/warning you see can be ignored, the controller maintains correct states in this scenario of sync failure.

But in this case the controller have not generated Synchronization Failed to be Established, check the logs. I cancel the sync myself after a 5s timeout and get error back from bt_le_per_adv_sync_delete. Controller have not generated any failure yet. You can see the controller generating a failure however about 1000 ms later in the logs.

1.Babblesim test of simple system requirement of Periodic Advertising and Synchronization. 2.Full Bluetooth Core Specification conformance test cases for the full Controller supported features over 500 testcases over multiple iterations over the last few months. 3.Long duration tests using update samples (multiple period_adv/periodic_sync samples) like for overnight or few days of continued uptime everytime an identified bug is fixed.

I was thinking more of how many periodic advertiser can be synced with realistically for example, so I guess maybe related to your point 3. Unless there are tests for max number devices synced etc. in the Core specification tests. How many devices was synchronized with in the test in 3. and at what periodic adv. interval? Anyway it does not seem to be an issue in periodic advertisements, instead memory requirement when CTE sampling is enabled, hence you can ignore this.

cvinayak commented 2 years ago

Ok so that term_cb with info->reason: 62 (0x3E) is due to no free buffers? If so can you point me to where I can increase those (controller free rx buffers) to see if it can handle more syncronized tags better?

Increase CONFIG_BT_CTLR_RX_BUFFERS=10 or more as necessary.

But in this case the controller have not generated Synchronization Failed to be Established, check the logs. I cancel the sync myself after a 5s timeout and get error back from bt_le_per_adv_sync_delete. Controller have not generated any failure yet. You can see the controller generating a failure however about 1000 ms later in the logs.

Ok, then it is this case https://github.com/zephyrproject-rtos/zephyr/issues/40204 This is something I have not started work on it yet. We can track it there then.

in the Core specification tests. How many devices was synchronized with in the test in 3. and at what periodic adv. interval?

Only one supported periodic sync set, as the hci_uart sample build for conformance tests is configured with one. But if the configuration is set to 6, the conformance tester would test a maximum of upto 6 simultaneous synchronizations, this is something I have not tested yet (only one configured for now). Conformance tester is a limited resource to lay hands on, but the next reservation to use I will experiment with running those tests with simultaneous syncs.

jakkra commented 2 years ago

Increase CONFIG_BT_CTLR_RX_BUFFERS=10 or more as necessary.

Put it to 18 (max) but did not make any difference unfortunately.

cvinayak commented 2 years ago

Put it to 18 (max) but did not make any difference unfortunately.

Ok, you should try reducing the printk over UART that could slow down the thread that consumes periodic advertising reports, or you can disable periodic advertising reports when creating sync so that the periodic advertising reports do not hog the CPU, (or turn on duplicate filtering with use of the ADI support feature).

I have collected some data and made some experiments with periodic advertising since it now is quite stable using your branch!

Are you not getting the assertions in lll_sync.c and lll.c any more?

jakkra commented 2 years ago

Put it to 18 (max) but did not make any difference unfortunately.

Ok, you should try reducing the printk over UART that could slow down the thread that consumes periodic advertising reports, or you can disable periodic advertising reports when creating sync so that the periodic advertising reports do not hog the CPU, (or turn on duplicate filtering with use of the ADI support feature).

Even with those changes I don't really see any difference.

I have also tried during create sync to disable CTE sampling, but I don't see any significant change. I also tried to only have CTE sampling enabled once all 8 of my tags are synced sync, and as soon as a sync was lost, disable CTE sampling, in this case I mostly loose 1-3 syncs immediately after I enable CTE sampling after all 8 got synced and then slow/hard to get all synced again.

I think there is something more going on with term_cb info->reason: 62, as I still sometimes get it even with tags 1s per.adv.int.

So with 18 RX buffers, CTE sampling disabled, regular scanning disabled, I still get info->reason: 62 in sync. term_cb.

I have collected some data and made some experiments with periodic advertising since it now is quite stable using your branch!

Are you not getting the assertions in lll_sync.c and lll.c any more?

I have not seen any asserts lately, but I think it's due to me running my current tests with tag per.adv.int. of >= 100, rather than 25ms.