zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

reel board: Mesh badge demo fails to send BT Mesh message #54198

Closed jonas-rem closed 1 year ago

jonas-rem commented 1 year ago

Describe the bug When flashing and setting up two reel boards with the mesh badge demo (samples/boards/reel-board/mesh_badge), sending a message via BT Mesh on button press fails with "Sending Failed!" on the reel board display.

Debug Log shows the error: [00:00:55.785,339] <err> bt_mesh_access: bt_mesh_model_send: Model not bound to AppKey 0x0000

I am building the sample with the most recent main branch (5a62f2592f) and Zephyr SDK 0.16.0-beta1: west build -b reel_board_v2 samples/boards/reel_board/mesh_badge/ -p && west flash

Connecting and sending the Name via the nRF-Connect works as described in the sample description. Just the BT Mesh send fails.

Please also mention any information which could help others to understand the problem you're facing:

To Reproduce Steps to reproduce the behavior:

  1. west build -b reel_board_v2 samples/boards/reel_board/mesh_badge/
  2. west flash (for two reel boards)
  3. Connect with the nRF Connect App as described and set up the name
  4. Disconnect from the nRF Connect App
  5. BT Mesh network got successfully created
  6. Press the button 'a Button' on one of the boards

Expected behavior The two boards should communicate with each other on a button press, showing the name of the sending board in the receiving boards display.

Impact BT Mesh feature of the Demo does not work

Logs and console output

*** Booting Zephyr OS build v3.3.0-rc1-2-g5a62f2592ffe ***
Starting Board Demo
[00:00:00.879,760] <inf> fs_nvs: nvs_mount: 8 Sectors of 4096 bytes
[00:00:00.879,791] <inf> fs_nvs: nvs_mount: alloc wra: 0, ed8
[00:00:00.879,791] <inf> fs_nvs: nvs_mount: data wra: 0, 1b8
[00:00:00.884,368] <inf> bt_hci_core: hci_vs_init: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.884,399] <inf> bt_hci_core: hci_vs_init: HW Variant: nRF52x (0x0002)
[00:00:00.884,429] <inf> bt_hci_core: hci_vs_init: Firmware: Standard Bluetooth controller (0x00) Version 3.2 Build 99
[00:00:00.884,796] <inf> bt_hci_core: bt_init: No ID address. App must call settings_load()
Bluetooth initialized
Mesh initialized
[00:00:00.895,965] <inf> bt_hci_core: bt_dev_show_info: Identity: EB:7C:3B:78:30:F1 (random)
[00:00:00.895,996] <inf> bt_hci_core: bt_dev_show_info: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x05f1
[00:00:00.896,026] <inf> bt_hci_core: bt_dev_show_info: LMP: version 5.3 (0x0c) subver 0xffff
Already provisioned
Board started
Button pressed
Button released
[00:00:06.972,442] <err> bt_mesh_access: bt_mesh_model_send: Model not bound to AppKey 0x0000
motion!

Environment:

Additional context

carlescufi commented 1 year ago

@jremmert-phytec-iot could you please bisect the regression please?

carlescufi commented 1 year ago

@PavelVPV could you please take a look at this regression, did we break an API somewhere, given that this application used to work and now it no longer does?

PavelVPV commented 1 year ago

Could be that default transport configuration has changed and one of these functions return error: https://github.com/zephyrproject-rtos/zephyr/blob/74c44dc273c11da3836f9dc77b1921fd0c694484/samples/boards/reel_board/mesh_badge/src/mesh.c#L540-L560

@jremmert-phytec-iot, could you please add error code checking of these functions, then erase the device, flash the demo again and check if they succeed after provisioning.

As I see the demo provisions itself after disconnection from GATT here: https://github.com/zephyrproject-rtos/zephyr/blob/74c44dc273c11da3836f9dc77b1921fd0c694484/samples/boards/reel_board/mesh_badge/src/main.c#L138-L142

So after disconnect you should see one of these messages: https://github.com/zephyrproject-rtos/zephyr/blob/cc9d2102f2c2e8d3001c846f7a9f7f524490bb5f/samples/boards/reel_board/mesh_badge/src/mesh.c#L572-L580

jonas-rem commented 1 year ago

@PavelVPV I added error checking and provisioned again, but could not observe an error with those function calls.

I can observe the text "Mesh Started\n Addr: 0x0bd3"

However, while testing I can sometimes observe that the devices crashes when initially connecting with my Android Smartphone or during paring after entering the PassKey in my Smartphone. I'm not sure if this is an other bug or if it is related.

As @carlescufi suggested, I will try narrow it down and git bisect the regression.

Error during connecting with BLE:

ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:473
    lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 6896
[00:00:17.116,363] <err> os: esf_dump: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000003
[00:00:17.116,394] <err> os: esf_dump: r3/a4:  0x20001c00 r12/ip:  0x00000010 r14/lr:  0x00003d81
[00:00:17.116,394] <err> os: esf_dump:  xpsr:  0x41000028
[00:00:17.116,424] <err> os: esf_dump: Faulting instruction address (r15/pc): 0x00003d8c
[00:00:17.116,455] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:17.116,455] <err> os: z_fatal_error: Fault during interrupt handling

[00:00:17.116,516] <err> os: z_fatal_error: Current thread: 0x20002d40 (unknown)
[00:00:17.191,680] <err> os: k_sys_fatal_error_handler: Halting system

Error during pairing after PassKey entry:

ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:473
    lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 8850
[00:00:49.516,571] <err> os: esf_dump: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000002
[00:00:49.516,571] <err> os: esf_dump: r3/a4:  0x20001c00 r12/ip:  0x00000010 r14/lr:  0x00003d81
[00:00:49.516,601] <err> os: esf_dump:  xpsr:  0x41000028
[00:00:49.516,601] <err> os: esf_dump: Faulting instruction address (r15/pc): 0x00003d8c
[00:00:49.516,632] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:49.516,662] <err> os: z_fatal_error: Fault during interrupt handling

[00:00:49.516,693] <err> os: z_fatal_error: Current thread: 0x20003610 (unknown)
[00:00:49.589,996] <err> os: k_sys_fatal_error_handler: Halting system
jonas-rem commented 1 year ago

git bisect showed that after the following commit the described error can be observed.

commit 867b88635d4c6963ab37dfe131b5bebc094c4f28
Author: Pavel Vasilyev <pavel.vasilyev@nordicsemi.no>
Date:   Thu Jun 30 10:58:41 2022 +0200

    Bluetooth: Mesh: Don't do bulk tx of seg msgs

    In the current implementation of segmented messages transmission, the
    transport layer will put as many segments as it can into the advertiser
    buffer. If a long segmented message is transmitted, there is a chance
    that an ack will be received before the transport layer finishes sending
    the first batch of segments. This will trigger retransmissions of those
    segments that are marked as missing in the ack, that are also the
    segments that are scheduled (or about to be scheduled), but not sent
    out by the advertiser yet. Thus, the receiver may receive already
    received segments.

    This commit changes 2 things:
    - it prevents rescheduling transmission of missing segments upon
    reception of ack. Thus, ack doesn't cause segments duplication in the
    advertiser buffer;
    - it doesn't put all segments to the advertiser buffer. Thus, ack that
    acknowledges all segments can immediately stop rescheduling segments as
    they are already received;

    Signed-off-by: Pavel Vasilyev <pavel.vasilyev@nordicsemi.no>

Note: While checking those relevant revisions you have to apply the following commit as well, otherwise the app won't start.

From the following commit the app generally starts again, but does not work
commit 0514c3b139bffd1813e862a1de42aece0f454c1e
Author: Maureen Helm <maureen.helm@intel.com>
Date:   Mon Aug 15 11:45:41 2022 -0500

    drivers: sensor: Fix apds9960 device pointer initialization

    Fixes a bus fault observed in samples/boards/reel_board/mesh_badge
    caused by dereferencing an uninitialized device pointer in the apds9960
    sensor driver. The device pointer needs to be initialized regardless of
    how CONFIG_APDS9960_TRIGGER is configured.

    Signed-off-by: Maureen Helm <maureen.helm@intel.com>
PavelVPV commented 1 year ago

@jremmert-phytec-iot could you clarify: is the commit that you posted makes the assertion to happen or "Model not bound to AppKey" error?

jonas-rem commented 1 year ago

Commit 1 (867b88635d4c6963ab37dfe131b5bebc094c4f28) causes the "Model not bound to AppKey" error.

Commit 2 (0514c3b139bffd1813e862a1de42aece0f454c1e) ~causes~ fixes the assertion error.

PavelVPV commented 1 year ago

Could you please repeat the provisioning again with enabled debug logs in cfg_srv.c (CONFIG_BT_MESH_DEBUG_MODEL in the failing commit)? Please enable option in logging subsystem to print function names (it is called CONFIG_LOG_FUNC_NAME_PREFIX_DBG in the latest zephyr).

jonas-rem commented 1 year ago
git log --oneline
22c7c864fb (HEAD) drivers: sensor: Fix apds9960 device pointer initialization
867b88635d Bluetooth: Mesh: Don't do bulk tx of seg msgs

Additional settings in prj.conf: CONFIG_LOG_FUNC_NAME_PREFIX_DBG=y
CONFIG_BT_MESH_DEBUG_MODEL=y
CONFIG_LOG_BUFFER_SIZE=4096

Starting Board Demo
Bluetooth initialized
Mesh initialized
Board started
[00:00:01.673,187] <inf> TI_HDC: Initialized device successfully
[00:00:01.688,446] <inf> fs_nvs: 8 Sectors of 4096 bytes
[00:00:01.688,446] <inf> fs_nvs: alloc wra: 0, fe8
[00:00:01.688,476] <inf> fs_nvs: data wra: 0, 0
[00:00:01.690,002] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:01.690,032] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:01.690,063] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99
[00:00:01.690,429] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:01.709,503] <inf> bt_hci_core: Identity: EB:7C:3B:78:30:F1 (random)
[00:00:01.709,533] <inf> bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x05f1
[00:00:01.709,564] <inf> bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff
Connected (err 0x00)
Passkey:
872284
Pairing Failed (1)
Disconnected (reason 0x13)
Connected (err 0x00)
Passkey:
014574
Pairing Complete
Disconnected (reason 0x13)
[00:01:09.113Configuring...
Configuration complete
,647] <inf> bt_mesh_main: Primary Element: 0x4192
[00:01:09.113,647] <dbg> bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000
[00:01:09.197,235] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x4192
[00:01:09.197,265] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000308 key_idx 0x000
[00:01:09.197,265] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x03
[00:01:09.197,509] <dbg> bt_mesh_cfg_srv: get_model: ID 0x1000 addr 0x4192
[00:01:09.197,540] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000384 key_idx 0x000
[00:01:09.197,540] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x03
[00:01:09.197,784] <dbg> bt_mesh_cfg_srv: get_model: ID 0x1100 addr 0x4192
[00:01:09.197,814] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x200003a0 key_idx 0x000
[00:01:09.197,814] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x03
[00:01:09.198,059] <dbg> bt_mesh_cfg_srv: get_model: ID 0x0002 addr 0x4192
[00:01:09.198,059] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000368 key_idx 0x000
[00:01:09.198,089] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x03
[00:01:09.198,333] <dbg> bt_mesh_cfg_srv: mod_sub_add: elem_addr 0x4192, sub_addr 0xc123
[00:01:09.198,364] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x4192
[00:01:09.198,364] <dbg> bt_mesh_cfg_srv: send_mod_sub_status: status 0x00 elem_addr 0x4192 sub_addr 0xc123
[00:01:09.198,638] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 9: 0392410000f1050000
[00:01:09.198,791] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 7: 03924100000010
[00:01:09.198,944] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 7: 03924100000011
[00:01:09.199,096] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 7: 03924100000200
[00:01:09.199,249] <dbg> bt_mesh_cfg_cli: mod_sub_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 9: 00924123c1f1050000
[00:01:09.199,523] <dbg> bt_mesh_cfg_srv: app_key_add: AppIdx 0x0000 NetIdx 0x0000
[00:01:09.201,263] <dbg> bt_mesh_cfg_cli: app_key_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 4: 00000000
[00:01:09.201,599] <dbg> bt_mesh_cfg_srv: mod_pub_set: elem_addr 0x4192 pub_addr 0xc123 cred_flag 0
[00:01:09.201,599] <dbg> bt_mesh_cfg_srv: mod_pub_set: pub_app_idx 0x000, pub_ttl 31 pub_period 0x4a
[00:01:09.201,629] <dbg> bt_mesh_cfg_srv: mod_pub_set: retransmit 0x00 (count 0 interval 50ms)
[00:01:09.201,629] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x4192
[00:01:09.202,148] <dbg> bt_mesh_cfg_cli: mod_pub_status: net_idx 0x0000 app_idx 0xfffe src 0x4192 len 14: 03924100000000000000f1050000
jonas-rem commented 1 year ago

As a comparison the debug log, without 867b88635d4c6963ab37dfe131b5bebc094c4f28 and otherwise same settings.

** Booting Zephyr OS build zephyr-v3.1.0-1332-g8a9eeafb0357  ***
Starting Board Demo
Bluetooth initialized
Mesh initialized
Board started
[00:00:01.675,109] <inf> TI_HDC: Initialized device successfully
[00:00:01.690,368] <inf> fs_nvs: 8 Sectors of 4096 bytes
[00:00:01.690,399] <inf> fs_nvs: alloc wra: 0, fe8
[00:00:01.690,399] <inf> fs_nvs: data wra: 0, 0
[00:00:01.691,955] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:01.691,955] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:01.691,986] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.1 Build 99
[00:00:01.692,352] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:01.711,425] <inf> bt_hci_core: Identity: EB:7C:3B:78:30:F1 (random)
[00:00:01.711,456] <inf> bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x05f1
[00:00:01.711,486] <inf> bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff
Connected (err 0x00)
Passkey:
843355
Pairing Failed (1)
Disconnected (reason 0x13)
Connected (err 0x00)
Passkey:
105182
Pairing Complete
Disconnected (reason 0x13)
Configuring...
Configuration complete
[00:00:59.018,310] <inf> bt_mesh_main: Primary Element: 0x1741
[00:00:59.018,341] <dbg> bt_mesh_main: bt_mesh_provision: net_idx 0x0000 flags 0x00 iv_index 0x0000
[00:00:59.101,623] <dbg> bt_mesh_cfg_srv: app_key_add: AppIdx 0x0000 NetIdx 0x0000
[00:00:59.103,271] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x1741
[00:00:59.103,271] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000308 key_idx 0x000
[00:00:59.103,302] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x00
[00:00:59.103,546] <dbg> bt_mesh_cfg_srv: get_model: ID 0x1000 addr 0x1741
[00:00:59.103,576] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000384 key_idx 0x000
[00:00:59.103,576] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x00
[00:00:59.103,820] <dbg> bt_mesh_cfg_srv: get_model: ID 0x1100 addr 0x1741
[00:00:59.103,851] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x200003a0 key_idx 0x000
[00:00:59.103,881] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x00
[00:00:59.104,125] <dbg> bt_mesh_cfg_srv: get_model: ID 0x0002 addr 0x1741
[00:00:59.104,125] <dbg> bt_mesh_cfg_srv: mod_bind: model 0x20000368 key_idx 0x000
[00:00:59.104,156] <dbg> bt_mesh_cfg_srv: mod_app_bind: status 0x00
[00:00:59.104,400] <dbg> bt_mesh_cfg_srv: mod_sub_add: elem_addr 0x1741, sub_addr 0xc123
[00:00:59.104,400] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x1741
[00:00:59.104,431] <dbg> bt_mesh_cfg_srv: send_mod_sub_status: status 0x00 elem_addr 0x1741 sub_addr 0xc123
[00:00:59.104,736] <dbg> bt_mesh_cfg_cli: app_key_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 4: 00000000
[00:00:59.104,888] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 9: 0041170000f1050000
[00:00:59.105,010] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 7: 00411700000010
[00:00:59.105,163] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 7: 00411700000011
[00:00:59.105,316] <dbg> bt_mesh_cfg_cli: mod_app_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 7: 00411700000200
[00:00:59.105,468] <dbg> bt_mesh_cfg_cli: mod_sub_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 9: 00411723c1f1050000
[00:00:59.105,804] <dbg> bt_mesh_cfg_srv: mod_pub_set: elem_addr 0x1741 pub_addr 0xc123 cred_flag 0
[00:00:59.105,834] <dbg> bt_mesh_cfg_srv: mod_pub_set: pub_app_idx 0x000, pub_ttl 31 pub_period 0x4a
[00:00:59.105,834] <dbg> bt_mesh_cfg_srv: mod_pub_set: retransmit 0x00 (count 0 interval 50ms)
[00:00:59.105,834] <dbg> bt_mesh_cfg_srv: get_model: Company 0x05f1 ID 0x0000 addr 0x1741
[00:00:59.105,865] <dbg> bt_mesh_cfg_srv: _mod_pub_set: period 10000 ms
[00:00:59.106,384] <dbg> bt_mesh_cfg_cli: mod_pub_status: net_idx 0x0000 app_idx 0xfffe src 0x1741 len 14: 00411723c100001f4a00f1050000
PavelVPV commented 1 year ago

Ok, now I see. Thanks for the log. Previously (before https://github.com/zephyrproject-rtos/zephyr/commit/867b88635d4c6963ab37dfe131b5bebc094c4f28), the transport layer would push as much segments to the network layer as it can allocate using bt_mesh_adv_create. Because the default mesh stack configuration had enough adv buffers (CONFIG_BT_MESH_ADV_BUF_COUNT), the transport layer could push a complete Config AppKey Add message at once to the network layer without rescheduling inside. And the network layer could process the message immediately. So that when a next message comes (that is Config Model App Bind in the sample), the previous one would have an effect already. Therefore, adding appkey and binding worked when 2 functions are called in a row without waiting for the response.

Now, the transport layer pushes one segment at a time. And because the Configuration Client APIs are used asynchronous manner, the next function doesn't wait for the previous one to complete, tries to bind the not added appkey yet, and fails (mod_app_bind: status 0x03 in the first log). And the appkey is added after all binds attempts.

However, the previous implementation won't work if adv buffers were not enough to push a complete segmented message to the network layer. Then the same issue would happen as it happens with new implementation. So in my opinion, the application code is not correct anyway.

To solve this in the sample we need to either use Configuration Client API in a synchronous manner, but then we need to run start_mesh function from a different cooperative thread (not from the system workqueue thread, otherwise the API will timeout). Or assign app_key_status callback to Configuration Client, and then bind the appkeys to models in from the callback. Ideally, all methods should wait for the corresponding callback before calling the next API.

jonas-rem commented 1 year ago

@PavelVPV thanks for explaining the issue! I liked your idea to use the callback and gave it a try to make a fix. Can you have a look at it?