zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.86k stars 6.62k forks source link

Unbonded peripheral gets 'Tx Buffer Overflow' when erasing bond on master #15570

Closed pdunaj closed 5 years ago

pdunaj commented 5 years ago

Describe the bug

[00709206] <err> bt_ctlr_hci: Tx Buffer Overflow
[00709206] <err> bt_hci_core: assert: 'bt_hci_evt_is_prio(hdr->evt)' failed
***** Kernel OOPS! *****
Current thread ID = 0x200025a0
Faulting instruction address = 0x2b15a
Fatal fault in thread 0x200025a0! Aborting.
[00961529] <err> bt_ctlr_llsw_ctrl: assert: '_radio.ticker_id_prepare == (7 + conn->handle)' failed
***** Kernel OOPS! *****
Current thread ID = 0x20002e44
Faulting instruction address = 0x364d4
Fatal fault in ISR! Spinning...

To Reproduce Steps to reproduce the behavior:

  1. Bond master and peripheral
  2. Remove bonds on peripheral
  3. Observe peripheral disconnecting
  4. Since master still has peripheral bonded it tries to reconnect
  5. Connection is established and security start is called (that will fail as peripheral does not have bond)
  6. When security fails and peripheral disconnects observe devices go back to point 4
  7. At some point once devices are connected and security start is in progress remove bonds on central
  8. Observe crash on peripheral; central is unaffected (disconnects and has bonds removed)

Expected behavior No crash. Central should be able to erase its bonds without effect on peripheral.

Impact Showstopper

Screenshots or console output N/A

Environment (please complete the following information): master

Additional context N/A

carlescufi commented 5 years ago

We fixed this recently in the scope of https://github.com/zephyrproject-rtos/zephyr/pull/15335, can you check if you have those commits?

See this comment in particular

pdunaj commented 5 years ago

Hi @carlescufi , I have these changes (ncs zephyr a288e8b8ab54c667c6ed9776042200411710a9cf)

carlescufi commented 5 years ago

In that case this sounds like a new issue. @cvinayak can you please take a look?

cvinayak commented 5 years ago

@pdunaj could you provide more details for the below:

When security fails and peripheral disconnects observe devices go back to point 4

I was not able to reproduce the crash you observe after the following experiments

[2019-04-22 21:58:02.427] ***** Booting Zephyr OS zephyr-v1.14.0-1-gae02726ac3b9 *****
[2019-04-22 21:58:02.436] Bluetooth initialized
[2019-04-22 21:58:02.438] [00:00:00.Advertising successfully started
[2019-04-22 21:58:02.683] 008,178] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-04-22 21:58:02.970] [00:00:00.008,209] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[2019-04-22 21:58:02.977] [00:00:00.008,209] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 0
[2019-04-22 21:58:02.987] [00:00:00.008,605] <wrn> bt_hci_core: No ID address. App must call settings_load()
[2019-04-22 21:58:02.994] [00:00:00.012,481] <inf> bt_hci_core: Identity: ea:15:23:d0:ac:03 (random)
[2019-04-22 21:58:03.002] [00:00:00.012,481] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-04-22 21:58:03.011] [00:00:00.012,512] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
[2019-04-22 21:58:19.022] Connected
[2019-04-22 21:58:23.036] Passkey for 67:b0:92:41:72:49 (random): 700241
[2019-04-22 21:58:33.285] Security changed: 4
[2019-04-22 21:58:33.437] Disconnected (reason 22)
[2019-04-22 21:58:47.196] Connected
[2019-04-22 21:58:58.476] Disconnected (reason 19)
[2019-04-22 21:59:01.443] Connected
[2019-04-22 21:59:07.581] Disconnected (reason 19)
[2019-04-22 21:59:20.272] Connected
[2019-04-22 21:59:24.187] Disconnected (reason 19)
[2019-04-22 22:07:49.541] Connected
[2019-04-22 22:07:55.892] Passkey for 67:b0:92:41:72:49 (random): 717207
[2019-04-22 22:07:59.108] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:07:59.826] [00:09:56.699,707] <err> bt_smp: reason 0x8
[2019-04-22 22:08:01.741] Disconnected (reason 19)
[2019-04-22 22:08:07.726] Connected
[2019-04-22 22:08:11.842] Passkey for 67:b0:92:41:72:49 (random): 357756
[2019-04-22 22:08:12.981] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:13.127] [00:10:10.573,394] <err> bt_smp: reason 0x8
[2019-04-22 22:08:14.796] Disconnected (reason 19)
[2019-04-22 22:08:17.979] Connected
[2019-04-22 22:08:22.570] Passkey for 67:b0:92:41:72:49 (random): 006498
[2019-04-22 22:08:23.654] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:24.214] [00:10:21.247,589] <err> bt_smp: reason 0x8
[2019-04-22 22:08:25.750] Disconnected (reason 19)
[2019-04-22 22:08:30.789] Connected
[2019-04-22 22:08:35.318] Passkey for 67:b0:92:41:72:49 (random): 481417
[2019-04-22 22:08:36.307] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:37.315] [00:10:33.902,221] <err> bt_smp: reason 0x8
[2019-04-22 22:08:38.111] Disconnected (reason 19)
[2019-04-22 22:08:40.645] Connected
[2019-04-22 22:08:44.451] Passkey for 67:b0:92:41:72:49 (random): 130707
[2019-04-22 22:08:45.469] Pairing cancelled: 67:b0:92:41:72:49 (random)
[2019-04-22 22:08:46.386] [00:10:43.063,903] <err> bt_smp: reason 0x8
[2019-04-22 22:08:47.449] Disconnected (reason 19)
[2019-04-22 22:08:58.368] Connected
[2019-04-22 22:09:02.359] Passkey for 67:b0:92:41:72:49 (random): 197670
[2019-04-22 22:09:12.883] Security changed: 4
[2019-04-22 22:09:13.029] Disconnected (reason 22)
[2019-04-22 22:09:18.362] Connected
[2019-04-22 22:09:22.516] Passkey for 67:b0:92:41:72:49 (random): 087678
[2019-04-22 22:09:33.192] Security changed: 4
[2019-04-22 22:09:33.338] Disconnected (reason 22)
[2019-04-22 22:09:38.885] Connected
[2019-04-22 22:09:45.253] Passkey for 67:b0:92:41:72:49 (random): 760333
[2019-04-22 22:09:56.609] Security changed: 4
[2019-04-22 22:09:56.756] Disconnected (reason 22)
[2019-04-22 22:10:03.243] Connected
[2019-04-22 22:10:07.851] Passkey for 67:b0:92:41:72:49 (random): 125522
[2019-04-22 22:10:18.537] Security changed: 4
[2019-04-22 22:10:18.684] Disconnected (reason 22)

@pdunaj On which callback do you initiate disconnect in your peripheral application?

pdunaj commented 5 years ago

Hi @cvinayak , I replicate the problem with pca20041 (mouse) and pca10059 (dongle). Our project supports DK so I will check if problem replicates there too. I guess it should as this code is common.

Let me add some more info to each step:

  1. Remove bonds on peripheral In our case we do it with bt_unpair(cur_peer_id, BT_ADDR_LE_ANY); after user confirms. cur_peer_id is current identity and you can assume it is 0 (default). As far as I know this causes disconnect all.
  2. Observe peripheral disconnecting As an effect of disconnect all from previous step.
  3. Since master still has peripheral bonded it tries to reconnect Master has a bond on its side. Dongle firmware is so scanning for mouse and when one is found and address matches it connects.
  4. Connection is established and security start is called (that will fail as peripheral does not have bond) In our case peripheral is starting the security enable process. We do it from connected callback by calling bt_conn_security(conn, BT_SECURITY_MEDIUM);
  5. When security fails and peripheral disconnects observe devices go back to point 4 Master was bonded with the peripheral so it will not want to pair again. We don't do any specific disconnection so I guess disconnect happens after timeout. If nothing is done this will loop as disconnect will cause dongle to scan for mouse again. Mouse is present and it will connect and fail security again.
  6. At some point once devices are connected and security start is in progress remove bonds on central. On dongle side you can remove bonds the same way you remove them on mouse. After user confirms we do bt_unpair(cur_peer_id, BT_ADDR_LE_ANY);. This causes disconnect of active connection.
  7. Observe crash on peripheral; central is unaffected (disconnects and has bonds removed) Disconnection done by unpairing clearly causes something on mouse to go nuts. Maybe mouse expects information about security that never comes (normally gets timeout but if disconnect happens it crashes?).

Issue happens every time. I don't see it in other scenarios.

pdunaj commented 5 years ago

I checked if problem can be replicated on pca10056 and it does. There is however one problem - on DK we don't support a bond control button so bonds have to be removed manually (I erase them by clearing settings with nrfjprog -f NRF52 --erasepage 0x000f8000-0x00100000). After this is done I see dongle connecting with DK but no security, disconnect and then another attempt. I do the same on dongle as I did before, that is clear bonds and I see crash on DK.

pdunaj commented 5 years ago

We have checked replication route together with @cvinayak . There are two observations worth noting.

First, apparently it is enough to simply wait (i.e. bonds on dongle do not have to be removed) as after 5 retries crash is observed. The error reported on disconnection is 61. Logs below. Second, when I do the same while working with windows 7 (Intel drivers stack) there is no crash. I still get disconnected with error 61 but this repeats over and over (no crash was seen after around 10 cycles).

[00897530] <inf> ble_state: Connected to f6:7d:a2:fa:f6:a8 (random)
[00897530] <inf> ble_state: Identity 0 has 0 bonds
[00897531] <inf> ble_state: Set security level
[00897538] <inf> event_manager: e: ble_peer_event id=0x2000262c CONNECTED
[00897539] <inf> ble_adv: Advertising stopped
[00897540] <inf> hid_state: Subscriber 0x2000262c connected
[00897540] <inf> hid_state: Active subscriber 0x2000262c
[00897548] <inf> event_manager: e: led_event led_id:1 mode:ON period:0 color: < 150 100 120 >
[00960226] <inf> battery: Battery level: 86% (4034 mV)
[00960228] <inf> event_manager: e: battery_level_event level=86
[01061409] <inf> ble_state: Disconnected from f6:7d:a2:fa:f6:a8 (random) (reason 61)
[01061416] <inf> event_manager: e: ble_peer_event id=0x2000262c DISCONNECTED
[01061418] <inf> ble_adv: Advertising stopped
[01061418] <inf> ble_adv: Use fast advertising
[01061443] <inf> ble_adv: Advertising started
[01061443] <inf> hid_state: Subscriber 0x2000262c disconnected
[01061443] <wrn> hid_state: No subscriber of type BLE
[01061443] <inf> hid_state: No active subscriber
[01061448] <inf> event_manager: e: led_event led_id:1 mode:BLINKING period:1000 color: < 150 100 120 >
[01193062] <inf> ble_state: Connected to f6:7d:a2:fa:f6:a8 (random)
[01193062] <inf> ble_state: Identity 0 has 0 bonds
[01193064] <inf> ble_state: Set security level
[01193070] <inf> event_manager: e: ble_peer_event id=0x2000262c CONNECTED
[01193072] <inf> ble_adv: Advertising stopped
[01193072] <inf> hid_state: Subscriber 0x2000262c connected
[01193072] <inf> hid_state: Active subscriber 0x2000262c
[01193080] <inf> event_manager: e: led_event led_id:1 mode:ON period:0 color: < 150 100 120 >
[01193120] <err> bt_ctlr_hci: Tx Buffer Overflow
[01193120] <err> bt_hci_core: assert: 'bt_hci_evt_is_prio(hdr->evt)' failed
***** Kernel OOPS! *****
Current thread ID = 0x200025a0
Faulting instruction address = 0x2b03a
Fatal fault in thread 0x200025a0! Aborting.
[0ASSERTION FAIL [out_ctx->control_block->offset <= out_ctx->size] @ /home/pdunaj/work/ncs/zephyr/subsys/logging/log_output.c:101
***** Kernel Panic! *****
Current thread ID = 0x20003fbc
Faulting instruction address = 0x188e4
sixoctets commented 5 years ago

@carlescufi on slack:

I think we should start by determining if this is a Host or Controller issue because at this point it's confusing it really looks like https://github.com/zephyrproject-rtos/zephyr/pull/15335#issuecomment-482082539

It is not host, to my understanding for now, as data buffer is leaked in the controller. I still need to try updates to my test setup (i need to try adding a bt_conn_security call in my peripheral after bt_unpair scenario).

It is not related to #15335 as that code path is not (to be) exercised in peripheral. Initially while debugging I thought it could be a bug that memory is leaked to wrong pools in the controller (but then the bug has to be since the initial contribution, as not much has changed in the data/ctrl path implementation). But then again testing with Window 7 /Intel implementation, for the same disconnect reason the peripheral is not crashing, hence, I need to find the root cause/reproduction of the MIC failure scenario that leads to the leak.

joerchan commented 5 years ago

Case 1: Central and Peripheral try to establish encryption. Central has bond information for peripheral Peripheral has no bond information.

Sniffer log: (1 and 2) peripheral_no_bond_kernel_panic.pcapng

Problem 1: Encryption procedure is not terminated on peripheral side when sending reject ind (PIN OR KEY MISSING) function event_enc_prep is still using:

/* procedure request acked */ conn->llcp_ack = conn->llcp_req;

Peripheral initiates connection parameter update request Central accepts parameters Peripheral when it receives CONN_PARAM_UPD_RSP it disconnects with MIC error because this control PDU has "collided" with the encryption procedure.

The devices reconnect, but the peripheral crashes at this point, with TX buffer overflow. Most likely this happens when it tries to enqueue the Security Request. The peripheral previously had paused transmission of packets. When it disconnected the packet queue is likely full, and freeing of these packets was not successful.

Problem 2: Peripheral initiated the connection parameter request procedure while the encryption procedure was "ongoing". If the procedure was actually active the central would have disconnected the link at this point.

Case 2: Central and Peripheral try to establish encryption Both devices have bonding information. Peripheral disconnects the link with MIC failure after receiving LL_LENGTH_RSP during encryption procedure. Happens every time, devices are stuck in reconnect loop.

Problem 3: Central initiates Encryption procedure before it has finished it's previously initiated Length Update Procedure

Problem 4: Central Responds to peer initiated Length Procedure after it has sent LL_ENC_REQ PDU. The central shall only send Disconnect or Encryption related PDUs at this point.

Problem 5: Peripheral should have disconnected the link when receiving the Encryption request, before it has responded to the Length request.

Sniffer log: (3, 4 and 5) mic_error_disconnect_on_reconnect

joerchan commented 5 years ago

mic_failure_sniffer_traces.zip

cvinayak commented 5 years ago

@joerchan PR for problem one: #15727 . This will fix the MIC failure happening after reject_ext_ind for an encryption procedure.

The subsequent problem (2) of connection parameter request being sent is a bit weird because whats unset is pause_rx flag (due to the bug, missing reset of the flag), which should prevent CPR prepared/dispatch in the event_connection_prepare function.

Could you make a new issue or add to #15256, I can handle the fixes related to central (problem 3,4 and 5) in #15320 ?

carlescufi commented 5 years ago

@joerchan can you post a quick update on this one? Is it still reproducible?

pdunaj commented 5 years ago

any news?

joerchan commented 5 years ago

The problems related to control procedures has been fixed. The TX Buffer overflow problem I could not figure out why happened. I have no way to reproduce it anymore, could be that it was provoked by the other issues. I'd suggest to close this issue.

carlescufi commented 5 years ago

Let's close this issue unless @pdunaj can reproduce it on the latest master.

cvinayak commented 3 years ago

The problems related to control procedures has been fixed. The TX Buffer overflow problem I could not figure out why happened. I have no way to reproduce it anymore, could be that it was provoked by the other issues. I'd suggest to close this issue.

This fix #31813 can explain the Tx Buffer Overflow. The compiler instruction reordering depends on optimization and could change based on changes in the code. Also, refer to #30378