Closed pdunaj closed 5 years ago
We fixed this recently in the scope of https://github.com/zephyrproject-rtos/zephyr/pull/15335, can you check if you have those commits?
See this comment in particular
Hi @carlescufi , I have these changes (ncs zephyr a288e8b8ab54c667c6ed9776042200411710a9cf)
In that case this sounds like a new issue. @cvinayak can you please take a look?
@pdunaj could you provide more details for the below:
When security fails and peripheral disconnects observe devices go back to point 4
I was not able to reproduce the crash you observe after the following experiments
[2019-04-22 21:58:02.427] ***** Booting Zephyr OS zephyr-v1.14.0-1-gae02726ac3b9 *****
[2019-04-22 21:58:02.436] Bluetooth initialized
[2019-04-22 21:58:02.438] [00:00:00.Advertising successfully started
[2019-04-22 21:58:02.683] 008,178] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-04-22 21:58:02.970] [00:00:00.008,209] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[2019-04-22 21:58:02.977] [00:00:00.008,209] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 0
[2019-04-22 21:58:02.987] [00:00:00.008,605] <wrn> bt_hci_core: No ID address. App must call settings_load()
[2019-04-22 21:58:02.994] [00:00:00.012,481] <inf> bt_hci_core: Identity: ea:15:23:d0:ac:03 (random)
[2019-04-22 21:58:03.002] [00:00:00.012,481] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-04-22 21:58:03.011] [00:00:00.012,512] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
[2019-04-22 21:58:19.022] Connected
[2019-04-22 21:58:23.036] Passkey for 67:b0:92:41:72:49 (random): 700241
[2019-04-22 21:58:33.285] Security changed: 4
[2019-04-22 21:58:33.437] Disconnected (reason 22)
[2019-04-22 21:58:47.196] Connected
[2019-04-22 21:58:58.476] Disconnected (reason 19)
[2019-04-22 21:59:01.443] Connected
[2019-04-22 21:59:07.581] Disconnected (reason 19)
[2019-04-22 21:59:20.272] Connected
[2019-04-22 21:59:24.187] Disconnected (reason 19)
[2019-04-22 22:07:49.541] Connected
[2019-04-22 22:07:55.892] Passkey for 67:b0:92:41:72:49 (random): 717207
[2019-04-22 22:07:59.108] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:07:59.826] [00:09:56.699,707] <err> bt_smp: reason 0x8
[2019-04-22 22:08:01.741] Disconnected (reason 19)
[2019-04-22 22:08:07.726] Connected
[2019-04-22 22:08:11.842] Passkey for 67:b0:92:41:72:49 (random): 357756
[2019-04-22 22:08:12.981] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:13.127] [00:10:10.573,394] <err> bt_smp: reason 0x8
[2019-04-22 22:08:14.796] Disconnected (reason 19)
[2019-04-22 22:08:17.979] Connected
[2019-04-22 22:08:22.570] Passkey for 67:b0:92:41:72:49 (random): 006498
[2019-04-22 22:08:23.654] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:24.214] [00:10:21.247,589] <err> bt_smp: reason 0x8
[2019-04-22 22:08:25.750] Disconnected (reason 19)
[2019-04-22 22:08:30.789] Connected
[2019-04-22 22:08:35.318] Passkey for 67:b0:92:41:72:49 (random): 481417
[2019-04-22 22:08:36.307] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:37.315] [00:10:33.902,221] <err> bt_smp: reason 0x8
[2019-04-22 22:08:38.111] Disconnected (reason 19)
[2019-04-22 22:08:40.645] Connected
[2019-04-22 22:08:44.451] Passkey for 67:b0:92:41:72:49 (random): 130707
[2019-04-22 22:08:45.469] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:46.386] [00:10:43.063,903] <err> bt_smp: reason 0x8
[2019-04-22 22:08:47.449] Disconnected (reason 19)
[2019-04-22 22:08:58.368] Connected
[2019-04-22 22:09:02.359] Passkey for 67:b0:92:41:72:49 (random): 197670
[2019-04-22 22:09:12.883] Security changed: 4
[2019-04-22 22:09:13.029] Disconnected (reason 22)
[2019-04-22 22:09:18.362] Connected
[2019-04-22 22:09:22.516] Passkey for 67:b0:92:41:72:49 (random): 087678
[2019-04-22 22:09:33.192] Security changed: 4
[2019-04-22 22:09:33.338] Disconnected (reason 22)
[2019-04-22 22:09:38.885] Connected
[2019-04-22 22:09:45.253] Passkey for 67:b0:92:41:72:49 (random): 760333
[2019-04-22 22:09:56.609] Security changed: 4
[2019-04-22 22:09:56.756] Disconnected (reason 22)
[2019-04-22 22:10:03.243] Connected
[2019-04-22 22:10:07.851] Passkey for 67:b0:92:41:72:49 (random): 125522
[2019-04-22 22:10:18.537] Security changed: 4
[2019-04-22 22:10:18.684] Disconnected (reason 22)
@pdunaj On which callback do you initiate disconnect in your peripheral application?
Hi @cvinayak , I replicate the problem with pca20041 (mouse) and pca10059 (dongle). Our project supports DK so I will check if problem replicates there too. I guess it should as this code is common.
Let me add some more info to each step:
bt_unpair(cur_peer_id, BT_ADDR_LE_ANY);
after user confirms. cur_peer_id
is current identity and you can assume it is 0 (default). As far as I know this causes disconnect all.connected
callback by calling bt_conn_security(conn, BT_SECURITY_MEDIUM);
bt_unpair(cur_peer_id, BT_ADDR_LE_ANY);
. This causes disconnect of active connection.Issue happens every time. I don't see it in other scenarios.
I checked if problem can be replicated on pca10056 and it does. There is however one problem - on DK we don't support a bond control button so bonds have to be removed manually (I erase them by clearing settings with nrfjprog -f NRF52 --erasepage 0x000f8000-0x00100000
). After this is done I see dongle connecting with DK but no security, disconnect and then another attempt. I do the same on dongle as I did before, that is clear bonds and I see crash on DK.
We have checked replication route together with @cvinayak . There are two observations worth noting.
First, apparently it is enough to simply wait (i.e. bonds on dongle do not have to be removed) as after 5 retries crash is observed. The error reported on disconnection is 61. Logs below. Second, when I do the same while working with windows 7 (Intel drivers stack) there is no crash. I still get disconnected with error 61 but this repeats over and over (no crash was seen after around 10 cycles).
[00897530] <inf> ble_state: Connected to f6:7d:a2:fa:f6:a8 (random)
[00897530] <inf> ble_state: Identity 0 has 0 bonds
[00897531] <inf> ble_state: Set security level
[00897538] <inf> event_manager: e: ble_peer_event id=0x2000262c CONNECTED
[00897539] <inf> ble_adv: Advertising stopped
[00897540] <inf> hid_state: Subscriber 0x2000262c connected
[00897540] <inf> hid_state: Active subscriber 0x2000262c
[00897548] <inf> event_manager: e: led_event led_id:1 mode:ON period:0 color: < 150 100 120 >
[00960226] <inf> battery: Battery level: 86% (4034 mV)
[00960228] <inf> event_manager: e: battery_level_event level=86
[01061409] <inf> ble_state: Disconnected from f6:7d:a2:fa:f6:a8 (random) (reason 61)
[01061416] <inf> event_manager: e: ble_peer_event id=0x2000262c DISCONNECTED
[01061418] <inf> ble_adv: Advertising stopped
[01061418] <inf> ble_adv: Use fast advertising
[01061443] <inf> ble_adv: Advertising started
[01061443] <inf> hid_state: Subscriber 0x2000262c disconnected
[01061443] <wrn> hid_state: No subscriber of type BLE
[01061443] <inf> hid_state: No active subscriber
[01061448] <inf> event_manager: e: led_event led_id:1 mode:BLINKING period:1000 color: < 150 100 120 >
[01193062] <inf> ble_state: Connected to f6:7d:a2:fa:f6:a8 (random)
[01193062] <inf> ble_state: Identity 0 has 0 bonds
[01193064] <inf> ble_state: Set security level
[01193070] <inf> event_manager: e: ble_peer_event id=0x2000262c CONNECTED
[01193072] <inf> ble_adv: Advertising stopped
[01193072] <inf> hid_state: Subscriber 0x2000262c connected
[01193072] <inf> hid_state: Active subscriber 0x2000262c
[01193080] <inf> event_manager: e: led_event led_id:1 mode:ON period:0 color: < 150 100 120 >
[01193120] <err> bt_ctlr_hci: Tx Buffer Overflow
[01193120] <err> bt_hci_core: assert: 'bt_hci_evt_is_prio(hdr->evt)' failed
***** Kernel OOPS! *****
Current thread ID = 0x200025a0
Faulting instruction address = 0x2b03a
Fatal fault in thread 0x200025a0! Aborting.
[0ASSERTION FAIL [out_ctx->control_block->offset <= out_ctx->size] @ /home/pdunaj/work/ncs/zephyr/subsys/logging/log_output.c:101
***** Kernel Panic! *****
Current thread ID = 0x20003fbc
Faulting instruction address = 0x188e4
@carlescufi on slack:
I think we should start by determining if this is a Host or Controller issue because at this point it's confusing it really looks like https://github.com/zephyrproject-rtos/zephyr/pull/15335#issuecomment-482082539
It is not host, to my understanding for now, as data buffer is leaked in the controller. I still need to try updates to my test setup (i need to try adding a bt_conn_security call in my peripheral after bt_unpair scenario).
It is not related to #15335 as that code path is not (to be) exercised in peripheral. Initially while debugging I thought it could be a bug that memory is leaked to wrong pools in the controller (but then the bug has to be since the initial contribution, as not much has changed in the data/ctrl path implementation). But then again testing with Window 7 /Intel implementation, for the same disconnect reason the peripheral is not crashing, hence, I need to find the root cause/reproduction of the MIC failure scenario that leads to the leak.
Case 1: Central and Peripheral try to establish encryption. Central has bond information for peripheral Peripheral has no bond information.
Sniffer log: (1 and 2) peripheral_no_bond_kernel_panic.pcapng
Problem 1: Encryption procedure is not terminated on peripheral side when sending reject ind (PIN OR KEY MISSING) function event_enc_prep is still using:
/* procedure request acked */
conn->llcp_ack = conn->llcp_req;
Peripheral initiates connection parameter update request Central accepts parameters Peripheral when it receives CONN_PARAM_UPD_RSP it disconnects with MIC error because this control PDU has "collided" with the encryption procedure.
The devices reconnect, but the peripheral crashes at this point, with TX buffer overflow. Most likely this happens when it tries to enqueue the Security Request. The peripheral previously had paused transmission of packets. When it disconnected the packet queue is likely full, and freeing of these packets was not successful.
Problem 2: Peripheral initiated the connection parameter request procedure while the encryption procedure was "ongoing". If the procedure was actually active the central would have disconnected the link at this point.
Case 2: Central and Peripheral try to establish encryption Both devices have bonding information. Peripheral disconnects the link with MIC failure after receiving LL_LENGTH_RSP during encryption procedure. Happens every time, devices are stuck in reconnect loop.
Problem 3: Central initiates Encryption procedure before it has finished it's previously initiated Length Update Procedure
Problem 4: Central Responds to peer initiated Length Procedure after it has sent LL_ENC_REQ PDU. The central shall only send Disconnect or Encryption related PDUs at this point.
Problem 5: Peripheral should have disconnected the link when receiving the Encryption request, before it has responded to the Length request.
Sniffer log: (3, 4 and 5) mic_error_disconnect_on_reconnect
@joerchan PR for problem one: #15727 . This will fix the MIC failure happening after reject_ext_ind for an encryption procedure.
The subsequent problem (2) of connection parameter request being sent is a bit weird because whats unset is pause_rx flag (due to the bug, missing reset of the flag), which should prevent CPR prepared/dispatch in the event_connection_prepare function.
Could you make a new issue or add to #15256, I can handle the fixes related to central (problem 3,4 and 5) in #15320 ?
@joerchan can you post a quick update on this one? Is it still reproducible?
any news?
The problems related to control procedures has been fixed. The TX Buffer overflow problem I could not figure out why happened. I have no way to reproduce it anymore, could be that it was provoked by the other issues. I'd suggest to close this issue.
Let's close this issue unless @pdunaj can reproduce it on the latest master.
The problems related to control procedures has been fixed. The TX Buffer overflow problem I could not figure out why happened. I have no way to reproduce it anymore, could be that it was provoked by the other issues. I'd suggest to close this issue.
This fix #31813 can explain the Tx Buffer Overflow. The compiler instruction reordering depends on optimization and could change based on changes in the code. Also, refer to #30378
Describe the bug
To Reproduce Steps to reproduce the behavior:
Expected behavior No crash. Central should be able to erase its bonds without effect on peripheral.
Impact Showstopper
Screenshots or console output N/A
Environment (please complete the following information): master
Additional context N/A