zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.51k stars 6.44k forks source link

BLE HID sample fails to reconnect on Windows 10 tablets #12362

Closed Olivier-ProGlove closed 5 years ago

Olivier-ProGlove commented 5 years ago

Describe the bug

BLE HID sample fails to reconnect on (some?) Windows 10 tablets. I only have two Windows 10 tablets (including the Windows Surface Pro 4) to check. I cannot duplicate the issue on a Windows 10 laptop (Lenovo laptop in my case), Linux and Android.

The two bluetooth controllers are (Windows 10 says I am using the latest drivers):

Marvell AVASTAR Bluetooth Radio Adapter
    Driver Date: 2/9/2017
    Driver Version: 15.68.9114.29

Realtek Semiconductor Driver
    Driver Date: 7/16/2015
    Driver Version: 1.0.183.3

The Surface Pro has USB slots. Connecting the ASUS BT-400 USB controller on the Surface Pro makes the BLE reconnection successful.

I only tested with Nordic nRF52 MCUs (particularly nRF52840) as a BLE peripheral.

To Reproduce

  1. Build and flash peripheral_hids on Nordic nRF52840 dev kit (PCA10056).
  2. Pair the Windows 10 tablet to the device (all good!)
  3. Reset the Nordic dev kit by pressing 'BOOT/RESET'

Expected behavior Windows 10 should automatically reconnect to the dev kit as it advertises on restart

I tested on Zephyr (from 4 days ago):

commit 41dfc4d478992fc83c9987b46d280e5442254a51
Author: Alistair Francis <alistair.francis@wdc.com>
Date:   Tue Dec 11 13:27:53 2018 -0800

    libc: riscv: Fix the RISC-V ifdef

I added logging traces in the bluetooth controller to get an idea of the issue:

diff --git a/subsys/bluetooth/controller/ll_sw/ctrl.c b/subsys/bluetooth/controller/ll_sw/ctrl.c
index 8f6a281..9fc24c5 100644
--- a/subsys/bluetooth/controller/ll_sw/ctrl.c
+++ b/subsys/bluetooth/controller/ll_sw/ctrl.c
@@ -3452,6 +3452,11 @@ isr_rx_conn_pkt(struct radio_pdu_node_rx *node_rx,
                                 * radio started tx */
                                LL_ASSERT(!radio_is_ready());

+                               LOG_ERR("-------- BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL (%d && %d) || (%d && %d)",
+                                      _radio.conn_curr->enc_rx,
+                                      !radio_ccm_mic_is_valid(),
+                                      _radio.conn_curr->pause_rx,
+                                      isr_rx_conn_enc_unexpected(_radio.conn_curr, pdu_data_rx));
                                terminate_ind_rx_enqueue(_radio.conn_curr,
                                        BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL);

@@ -3459,6 +3464,12 @@ isr_rx_conn_pkt(struct radio_pdu_node_rx *node_rx,
                                _radio.conn_curr = NULL;

                                return 1; /* terminated */
+                       } else {
+                               LOG_WRN("-------- else (%d && %d) || (%d && %d)",
+                                                                      _radio.conn_curr->enc_rx,
+                                                                      !radio_ccm_mic_is_valid(),
+                                                                      _radio.conn_curr->pause_rx,
+                                                                      isr_rx_conn_enc_unexpected(_radio.conn_curr, pdu_data_rx));
                        }

And here is the result:

***** Booting Zephyr OS zephyr-v1.13.0-2997-g92bb2db *****
Bluetooth initialized
Advertising successfully started
Main application running ...[00:00:00.005,889] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.005,920] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.005,920] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 99
[00:00:00.006,195] <wrn> bt_hci_core: No ID address. Expecting one to come from storage.
[00:00:00.008,483] <inf> bt_hci_core: Identity: fc:c5:96:68:fa:98 (random)
[00:00:00.008,483] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.008,514] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
Connected bc:83:85:0c:f3:ba (public)
advertise stopped
[00:05:42.527,069] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:42.527,557] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:42.587,097] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:42.707,122] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:43.127,258] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:43.187,255] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:43.187,835] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:43.427,215] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
Security changed: bc:83:85:0c:f3:ba (public) level 2
[00:05:43.787,231] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:43.907,257] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:05:44.027,130] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (1 && 0)
[00:05:44.267,272] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.267,852] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.327,178] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.447,204] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.567,230] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.687,225] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.807,220] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:44.927,215] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.047,241] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.167,236] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.287,231] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.407,226] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.587,249] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.707,244] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.827,239] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:45.947,235] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.067,230] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.187,255] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.307,220] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.427,246] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.547,241] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.667,236] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.787,231] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:46.907,257] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.027,252] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.147,277] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.207,275] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.617,889] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.637,268] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.657,287] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.697,265] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.737,274] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.777,252] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.817,260] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.857,238] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.897,247] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:47.957,244] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.037,261] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.077,270] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.117,279] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.157,287] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.197,296] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.237,274] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.277,282] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.317,260] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.357,299] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.397,308] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.437,286] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.477,264] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.517,272] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.557,281] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.597,290] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.637,298] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.677,307] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.717,285] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.757,263] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.797,271] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.837,310] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.877,319] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
input notification changed 1
[00:05:48.917,297] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:48.977,294] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:49.017,272] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:49.057,312] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:49.077,270] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:49.227,294] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
[00:05:49.377,288] <wrn> bt_ctlr_llsw_ctrl: -------- else (1 && 0) || (0 && 1)
***** Booting Zephyr OS zephyr-v1.13.0-2997-g92bb2db *****
Bluetooth initialized
Advertising successfully started
Main application running ...[00:00:00.006,042] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.006,042] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.006,072] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 99
[00:00:00.006,347] <wrn> bt_hci_core: No ID address. Expecting one to come from storage.
[00:00:00.009,033] <inf> bt_hci_core: Identity: fc:c5:96:68:fa:98 (random)
[00:00:00.009,033] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.009,033] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
input notification changed 1
Connected bc:83:85:0c:f3:ba (public)
advertise stopped
input notification changed 0
Disconnected from bc:83:85:0c:f3:ba (public) (reason 61)
advertise restarted
[00:00:01.318,847] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:00:01.319,885] <err> bt_ctlr_llsw_ctrl: -------- BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL (0 && 1) || (1 && 1)
input notification changed 1
Connected bc:83:85:0c:f3:ba (public)
advertise stopped
input notification changed 0
Disconnected from bc:83:85:0c:f3:ba (public) (reason 61)
advertise restarted
[00:00:04.199,035] <wrn> bt_ctlr_llsw_ctrl: -------- else (0 && 1) || (0 && 1)
[00:00:04.199,523] <err> bt_ctlr_llsw_ctrl: -------- BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL (0 && 1) || (1 && 1)
input notification changed 1
Connected bc:83:85:0c:f3:ba (public)
advertise stopped
input notification changed 0

I used the Nordic sniffer to try to get some insights. Here are the few Wireshark traces I manage to get:

Windows10TabletIssue-Wireshark.zip

carlescufi commented 5 years ago

@pdunaj @Qbicz @jhn-nordic have you seen MIC failure when using HID as well?

Qbicz commented 5 years ago

Hi @Olivier-ProGlove, we recently had some interoperability testing and there was indeed a problem with Surface Pro. Updating the Bluetooth drivers on Surface Pro helped a lot, and instead of failing most of the time it failed very rarely. I can later post an update what version of Surface it was.

wiminordic commented 5 years ago

@Olivier-ProGlove I had similar and other problems with Surface Pro (2017 version). Drivers causing issues:

Marvell AVASTAR Bluetooth Radio Adapter Driver Version: v15.68.9120.47

Updating Bluetooth drivers solved them. Working drivers:

Driver Date: 18/05/2018 Driver Version: 15.68.9125.57

Also updating Windows 10 from v10.0.16299 to v10.0.17134 helped with Bluetooth stability on this platform.

Olivier-ProGlove commented 5 years ago

Thanks a lot @Qbicz and @wiminordic for your valuable comments.

I found a way to force update on Windows 10 Surface Pro 4 tablet. Until now, I was relying on Microsoft update mechanism. I noticed I could update Surface Pro drivers manually using this page: https://www.microsoft.com/en-us/download/details.aspx?id=49498

The tablet now uses OS Build 17763.195 (19/12/2018) and the Marvell AVASTAR driver 15.68.9125.57.

I am still testing Zephyr commit 41dfc4d478992fc83c998. And it still does not work with the same MIC issue Disconnected from bc:83:85:0c:f3:ba (public) (reason 61).

Which version of Zephyr have you tested?

wiminordic commented 5 years ago

The fw-nrfconnect-zephyr v1.13.99-ncs2 release: https://github.com/NordicPlayground/fw-nrfconnect-zephyr/releases/tag/v1.13.99-ncs2

wiminordic commented 5 years ago

I made a quick retest with latest fw-nrfconnect-zephyr. Resetting the HID device while connected to Surface Pro does indeed lead to Surface being unable to connect to the zephyr HID device even with updated Windows 10 and drivers.

carlescufi commented 5 years ago

@wiminordic @Qbicz so is this a Surface Pro issue?

wiminordic commented 5 years ago

@carlescufi It is hard to say without detailed debugging. I have witnessed a lot of issues with Windows 10 platforms. Many of them were solved by updating drivers but not all of them (like this one). Windows 8 Surface 2 was getting BSOD on Bluetooth reconnection to Zephyr based HID device. So in general there is some instability there, but it could be caused by incorrect operation on Zephyr side.

wiminordic commented 5 years ago

I made another retest with the zephyr from the fw-nrfconnect-zephyr v1.13.99-ncs2 release. It seems to work better than latest fw-nrfconnect-zephyr. Issue was reproduced few times but at least several runs were successful (I was able to reconnect to Zephyr HID after it was reset).

Edit: Zephyr version seems to have no impact after all, latest is also sometimes working.

jhn-nordic commented 5 years ago

Could it be the default 2 Mbit phy that causes issues? Just a wild guess, but that is a "new" feature

Få Outlook for Androidhttps://aka.ms/ghei36


From: wiminordic notifications@github.com Sent: Wednesday, January 9, 2019 4:15:47 PM To: zephyrproject-rtos/zephyr Cc: Nistad, Jon Helge; Mention Subject: Re: [zephyrproject-rtos/zephyr] BLE HID sample fails to reconnect on Windows 10 tablets (#12362)

I made another retest with the zephyr from the fw-nrfconnect-zephyr v1.13.99-ncs2 release. It seems to work better than latest fw-nrfconnect-zephyr. Issue was reproduced few times but at least several runs were successful (I was able to reconnect to Zephyr HID after it was reset).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/zephyrproject-rtos/zephyr/issues/12362#issuecomment-452732034, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ai0HXAxW8hkp2o-eLgYvdu6gxYqGAX40ks5vBgejgaJpZM4Z0BUr.

Olivier-ProGlove commented 5 years ago

We have a BLE HID third-party commercial product that works well with the same Windows 10 Surface 4 Pro (using any version of the driver). The device can reconnect to Windows 10 Tablet without any issue.

So it could also be a (legacy?) feature not supported by Zephyr or an advanced feature supported by Zephyr but not well supported by the Windows 10 driver.

Olivier-ProGlove commented 5 years ago

@carlescufi @wiminordic @jhn-nordic I have just tested examples/ble_peripheral/ble_app_hids_keyboard/pca10056/s140/armgcc from Nordic SDK 15.2 on Nordic nRF52840 dev kit (PCA10056) and the reconnection works fine.

So the issue is likely to be Zephyr specific.

carlescufi commented 5 years ago

@Olivier-ProGlove maybe it would be helpful if you could reproduce with BT_DEBUG_SETTINGS and BT_DEBUG_SMP so that we can see what key is being loaded. Based on that I think @jhedberg and @sjanc might be able to help debug the issue. If that doesn't help then @joerchan may be able to take a look once he's done with what he's doing now.

Olivier-ProGlove commented 5 years ago

Thanks @carlescufi I have just done the logs as you asked. I am using the latest Zephyr:

commit b72fb4e9e80750ec520c1628b59713ca5c0cb245
Author: Sebastian Bøe <sebastian.boe@nordicsemi.no>
Date:   Wed Dec 19 16:56:57 2018 +0100

    cmake: Remove dead code 'FindRegex.cmake'

I added the two configuration settings BT_DEBUG_SETTINGS and BT_DEBUG_SMP. I removed the passkey to make it faster to duplicate the issue:

$ git diff
diff --git a/samples/bluetooth/peripheral_hids/prj.conf b/samples/bluetooth/peripheral_hids/prj.conf
index bf907c5..38f6126 100644
--- a/samples/bluetooth/peripheral_hids/prj.conf
+++ b/samples/bluetooth/peripheral_hids/prj.conf
@@ -16,3 +16,7 @@ CONFIG_FLASH_MAP=y
 CONFIG_FCB=y
 CONFIG_SETTINGS=y
 CONFIG_SETTINGS_FCB=y
+
+CONFIG_BT_DEBUG=y
+CONFIG_BT_DEBUG_SETTINGS=y
+CONFIG_BT_DEBUG_SMP=y
diff --git a/samples/bluetooth/peripheral_hids/src/main.c b/samples/bluetooth/peripheral_hids/src/main.c
index 194a071..84cc0b0 100644
--- a/samples/bluetooth/peripheral_hids/src/main.c
+++ b/samples/bluetooth/peripheral_hids/src/main.c
@@ -118,7 +118,7 @@ static void auth_cancel(struct bt_conn *conn)
 }

 static struct bt_conn_auth_cb auth_cb_display = {
-       .passkey_display = auth_passkey_display,
+       .passkey_display = NULL, //auth_passkey_display,
        .passkey_entry = NULL,
        .cancel = auth_cancel,
 };

Here are the logs for Android Samsung J5 (no problem on this Android Phone) and Microsoft Windows Surface Pro 4 (reconnection problem on this platform)

carlescufi commented 5 years ago

@Olivier-ProGlove can you also enable BT_DEBUG_KEYS and re-run on both Android and the Windows 10 tablet? I have a feeling there must be a failed allocation or match somewhere.

Olivier-ProGlove commented 5 years ago

@carlescufi Here are the log with BT_DEBUG_KEYS:

... Still nothing obvious.

Last week I tried to dump more BT_DEBUG_* logs but I could not highlight any obvious differences that could help me (... with my limited bluetooth expertise).

I am working on dumping the HCI commands/events on UART1 to see if I can see something (and avoid packet losing due to logging subsystem)...

carlescufi commented 5 years ago

@Olivier-ProGlove what would be good is to print the LTK that is being stored in flash and then loaded and used with LTK_REQUEST_REPLY, maybe using printk to avoid log allocation failures. I hope that maybe @jhedberg can give you a pointer on where to add the printk statements, I am not that familiar with the settings code. @jhedberg could you take a quick look at the 2 last logs here? MIC failure means that we loaded an invalid LTK. I wonder if Android is using privacy and the Windows 10 tablet is not.

jhedberg commented 5 years ago

@carlescufi I don't see anything obvious in the logs. It'd be super helpful to get the HCI logs to see the exact content of the exchanged SMP PDUs. Is is possible Windows10 lacks the bonding flag, i.e. does no-bonding, but Soft Device ignores it and holds on to the LTK?

carlescufi commented 5 years ago

@jhedberg but Windows 10 is the central, so why would it initiate encryption if it didn't have keys?

jhedberg commented 5 years ago

@carlescufi my speculation/theory is that it does have the keys. It's just that it didn't set the bonding flag, which causes Zephyr to discard the keys.

jhedberg commented 5 years ago

@Olivier-ProGlove could you re-run this against the Windows 10 tablet with the following patch applied:

--- a/subsys/bluetooth/host/smp.c
+++ b/subsys/bluetooth/host/smp.c
@@ -2436,6 +2436,9 @@ static u8_t smp_pairing_req(struct bt_smp *smp, struct net_buf *buf)
        if ((rsp->auth_req & BT_SMP_AUTH_BONDING) &&
            (req->auth_req & BT_SMP_AUTH_BONDING)) {
                atomic_set_bit(smp->flags, SMP_FLAG_BOND);
+               BT_INFO("Bonding");
+       } else {
+               BT_INFO("No-bonding");
        }

        atomic_set_bit(smp->flags, SMP_FLAG_PAIRING);

That should then help prove/disprove my bonding flag theory.

Olivier-ProGlove commented 5 years ago

No changes between Android and Window 10 Surface. I also printed the value of req->auth_req and they use the same value: Bonding auth_req:0x2d

Full logs: Android.log Windows10Tablet.log

I continue to work on my HCI dump. I will post it later today... Feel free to request me additional logs in the meantime.

jhedberg commented 5 years ago

I also printed the value of req->auth_req and they use the same value: Bonding auth_req:0x2d

Ok, so we can disregard my no-bonding theory then. Next step would be the HCI logs. Anyone know the link to the documentation page the explains how to enable this? On a high-level you need to disable the stdout console and CONFIG_BT_DEBUG_LOG, enable CONFIG_BT_DEBUG_MONITOR, and then on the Linux PC side you need to use "btmon --tty /dev/ttyACM0" (or whatever is your Zephyr console) to capture the logs.

Olivier-ProGlove commented 5 years ago

What I am doing for HCi dump is to redirect HCI sent/receive/acl commands (printed in hexadecimal) on the second UART of Nordic Dev Kit. It works fine. I am doing a basic formatting to help you and I am done to send you the logs.

What I have at the moment is something like that:

RCV:LE Ext Advertising Report 3e0d050000f830b13e5c315a7d32f4
SENT:LE LTK REQ NEG Reply 1b20020000
RCV:Disconnection Complete 05040000003d
SENT:LE Set Adv Enabled 0a200101
RCV:LE Scan Req Received 3e13010000000101a5818b42f04f18000000480001
SENT:LE Set Adv Enabled 0a200101
SENT:opcode:0x2016 1620020000
RCV:LE PHY Update Complete 3e0c040000005d00000000000000
RCV:LE Ext Advertising Report 3e0d050000f830b13e5c315a7d32f4
SENT:LE LTK REQ NEG Reply 1b20020000
RCV:Disconnection Complete 05040000003d
SENT:LE Set Adv Enabled 0a200101
RCV:LE Scan Req Received 3e13010000000101a5818b42f04f18000000480001
SENT:LE Set Adv Enabled 0a200101
SENT:opcode:0x2016 1620020000
RCV:LE PHY Update Complete 3e0c040000005d00000000000000
RCV:LE Ext Advertising Report 3e0d050000f830b13e5c315a7d32f4
SENT:LE LTK REQ NEG Reply 1b20020000
RCV:Disconnection Complete 05040000003d
SENT:LE Set Adv Enabled 0a200101

Is it good enough?

Olivier-ProGlove commented 5 years ago

@jhedberg Here are the logs and HCI logs for Android and Windows 10 Surface Pro Tablet:

Android.log Android-HCI.log Windows10Tablet.log Windows10Tablet-HCI.log

Note: The hexadecimal after the opcode is the data (without the size).

Olivier-ProGlove commented 5 years ago

@jhedberg I had a look to BT_DEBUG_MONITOR and got it working. Usual logs:

Android-btmon.log Windows10Tablet-btmon.log

jhedberg commented 5 years ago

The latest logs do confirm the MIC failure:

< HCI Command: LE Long Term Key Request Reply (0x08|0x001a) plen 18
        Handle: 0
        Long term key: bb4a893db7561aebd12748868b164a32
> HCI Event: Command Complete (0x0e) plen 6
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Success (0x00)
        Handle: 0
> HCI Event: Disconnect Complete (0x05) plen 4
        Status: Success (0x00)
        Handle: 0
        Reason: Connection Terminated due to MIC Failure (0x3d)

However, I don't see anything wrong with this on our side - the LTK value is the same that was successfully used for the first connection:

< HCI Command: LE Long Term Key Request Reply (0x08|0x001a) plen 18
        Handle: 0
        Long term key: bb4a893db7561aebd12748868b164a32
> HCI Event: Command Complete (0x0e) plen 6
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Success (0x00)
        Handle: 0
> HCI Event: Encryption Change (0x08) plen 4
        Status: Success (0x00)
        Handle: 0
        Encryption: Enabled with AES-CCM (0x01)
Olivier-ProGlove commented 5 years ago

We have another device that does dual boot Android and Windows 10 (CHUWI Hi10 Pro) that has a similar issue (ie: Works fine for Android but MIC Failure on reconnection with Windows 10).

Here are the logs for this platform: Chuwi-Android.log Chuwi-Android-btmon.log Chuwi-Windows.log Chuwi-Windows-btmon.log

Something I already noticed for the previous case (Samsung Android phone and Windows 10 Tablet) and I noticed again with the Chuwi tablet:

On Android it uses a Random Peer Address type:

> HCI Event: LE Meta Event (0x3e) plen 19
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Random (0x01)
        Peer address: 7A:21:46:B7:D7:30 (Resolvable)
          Identity type: Public (0x00)
          Identity: 0B:AE:4B:5D:83:88 (OUI 0B-AE-4B)
        Connection interval: 45.00 msec (0x0024)
        Connection latency: 0 (0x0000)
        Supervision timeout: 20000 msec (0x07d0)
        Master clock accuracy: 0x00

While on Windows 10, it always uses a Public Peer Address type:

> HCI Event: LE Meta Event (0x3e) plen 19
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: 88:83:5D:D4:AE:00 (OUI 88-83-5D)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x00
Olivier-ProGlove commented 5 years ago

@jhedberg more logs. As I mentioned in my initial issue message when using the USB ASUS BT-400 controller on the Windows Surface Pro 4 (the Windows10 Tablet that fails the reconnection test) I can successfully connect.

Here are the btmon log for the USB ASUS BT-400 controller on the Windows Surface Pro 4: Windows10Tablet-ASUS-btmon.log

The observation I made in my previous message is not valid anymore. The ASUS BT-400 also uses a public address:

> HCI Event: LE Meta Event (0x3e) plen 19                         #43 42.970300
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: 5C:F3:70:8A:A6:97 (CC&C Technologies, Inc)
        Connection interval: 60.00 msec (0x0030)
        Connection latency: 0 (0x0000)
        Supervision timeout: 9600 msec (0x03c0)
        Master clock accuracy: 0x05

When comparing the btmon log between the default bluetooth controller and the ASUS one on the same tablet, I can see these two commands being sent during pairing with the default controller while there are not sent for the ASUS controller:

Olivier-ProGlove commented 5 years ago

Something else > HCI Event: Encryption Change (0x08) is not received on the reconnection on the broken Windows 10 Bluetooth controllers (Windows Surface Pro and Chuwi controllers).

Could it be Zephyr do not update the encryption algorithms on the reconnection?

Olivier-ProGlove commented 5 years ago

@carlescufi @jhedberg I am still looking at this issue and pushing towards this issue is due to the 'missing' > HCI Event: Encryption Change on reconnection with Windows 10 (while this HCI event is present on Android devices).

I printed the values of:

struct bt_conn {
    ...
#if defined(CONFIG_BT_SMP) || defined(CONFIG_BT_BREDR)
    bt_security_t       sec_level;
    bt_security_t       required_sec_level;
    u8_t            encrypt;
#endif /* CONFIG_BT_SMP || CONFIG_BT_BREDR */
    ...
};

after the first pairing and the failing reconnection.

Could the fact Zephyr do not not 'restore' the previous security level and encryption status be the reason of the BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL?

I tried to add a dirty hack to workaround this 'missing' > HCI Event: Encryption Change but it still did not work. But it might due to the way I did it...

I added this code to le_ltk_request() to simulate the call to hci_encrypt_change():

--- a/subsys/bluetooth/host/hci_core.c
+++ b/subsys/bluetooth/host/hci_core.c
@@ -2959,9 +2959,26 @@ static void le_ltk_request(struct net_buf *buf)

        le_ltk_neg_reply(evt->handle);

-done:
+done: {
+
+#if ON_RECONNECTION
+       struct bt_hci_evt_encrypt_change evt = {
+               .status = 0,
+               .handle = handle,
+               .encrypt = 1,
+       };
+       struct net_buf hack_buf = { 0 };
+       hack_buf.data = &evt;
+
+       hci_encrypt_change(&hack_buf);
+#endif /* CONFIG_BT_SMP || CONFIG_BT_BREDR */
+
        bt_conn_unref(conn);
 }
+}
 #endif /* CONFIG_BT_SMP */

I booted a first time without this hack for pairing the device and then enable this hack and flash the firmware to see if it fixed my issue. I could confirm it switched to sec_level:2 required_sec_level:2 encrypt:1 but I still had BT_HCI_ERR_TERM_DUE_TO_MIC_FAIL. It could be a secret need to be negociated while calling > HCI Event: Encryption Change.

Any comment on the 'missing' > HCI Event: Encryption Change or the hack I did? Any suggestion for me to try?

FrankGorgenyi commented 5 years ago

@Olivier-ProGlove can you provide a trace of it working as expected and one failing. Your air traces collected while Windows traces are also recording would help us analyze and provide recommendations to you.

oliviermartin commented 5 years ago

Thanks a lot @FrankGorgenyi for jumping in this thread :smiley: :smiley: :smiley:

I actually think I found the issue this afternoon (German time). I will try to get more information tomorrow to know exactly if it a Zephyr issue or a Surface Pro Bluetooth controller issue. But it is likely that the issue (if my theory is confirmed) could be mitigated by the Surface Pro Bluetooth controller.

Here is the Start Encryption flow from Bluetooth Specification v5.

screenshot from 2019-01-30 22-33-44

From what I observed, Zephyr Link Layer (LL B on the diagram) sends LL_ENC_RSP too late for Surface Pro Bluetooth controller (LL A on the diagram) as I received the MIC Failure termination disconnection event with from Host A before Zephyr had time to send LL_ENC_RSP.

I noticed Zephyr has a Fast Encryption Setup - BT_CTLR_FAST_ENC option (added by @cvinayak) I guess to solve this issue. Turning on this option allows to reconnect to Windows host after few reconnection tentative. The few tests I did with this option, I needed more than 10 reconnection tentative to successfully reconnect (30+ seconds after the lost of connection). So the 'fast encryption setup' is still too slow.

My next step is too check the timing to know if it is Zephyr who cannot cope with the controller BLE intervals or if it is the Surface Pro controller that does not respect the timings (ie: timeout too early).

A mitigation would be to negociate the interval before Starting the encryption. I do not know if it is feasible per Bluetooth specification or to increase the Connection interval on reconnection. Or maybe a better approach in Windows 10 would be to take the latest negociated Connection interval on reconnection.

Few connection parameters on reconnection (before being renegociated):

> HCI Event: LE Meta Event (0x3e) plen 19
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Random (0x01)
        Peer address: 67:F9:07:56:A0:D5 (Resolvable)
          Identity type: Public (0x00)
          Identity: 48:27:EA:7D:26:7B (OUI 48-27-EA)
        Connection interval: 48.75 msec (0x0027)
        Connection latency: 0 (0x0000)
        Supervision timeout: 5000 msec (0x01f4)
        Master clock accuracy: 0x01
> HCI Event: LE Meta Event (0x3e) plen 19
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: BC:83:85:0C:F3:BA (OUI BC-83-85)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x01
> HCI Event: LE Meta Event (0x3e) plen 19
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Random (0x01)
        Peer address: 7A:21:46:B7:D7:30 (Resolvable)
          Identity type: Public (0x00)
          Identity: 0B:AE:4B:5D:83:88 (OUI 0B-AE-4B)
        Connection interval: 45.00 msec (0x0024)
        Connection latency: 0 (0x0000)
        Supervision timeout: 20000 msec (0x07d0)
        Master clock accuracy: 0x00
> HCI Event: LE Meta Event (0x3e) plen 19                         #322 6.027500
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: 88:83:5D:D4:AE:00 (OUI 88-83-5D)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x00

You could notice Chuwi in Windows mode (dual boot table) uses the same parameters as the MS Surface Pro bluetooth controller. I also noticed when I use the ASUS USB BT-400 controller on the MS Surface Pro I also have the same connection interval. But Nordic's nRF52840 dev kit with Zephyr can successfully connect on the Surface Pro with the ASUS USB BT-400 controller while it cannot with SurfacePro controller and Chuwi controller in Windows mode.

I am quite keen to try the Windows Bluetooth trace tool (first time I hear about it). And from what I understand I can capture radio info too :+1:

Olivier-ProGlove commented 5 years ago

More data/timing on latest Zephyr (33dae59a5 - Sat Jan 26 - pre West mess...). I added two LOG_INF() statements - that deferred printing - so less invasive:

--- a/subsys/bluetooth/controller/ll_sw/ctrl.c
+++ b/subsys/bluetooth/controller/ll_sw/ctrl.c
@@ -1866,6 +1866,7 @@ static inline u32_t isr_rx_conn_pkt_ack(struct pdu_data *pdu_data_tx,
                break;

        case PDU_DATA_LLCTRL_TYPE_ENC_RSP:
+               LOG_INF("ENC_RSP");
                /* pause data packet tx */
                _radio.conn_curr->pause_tx = 1U;
                break;
@@ -2575,6 +2576,7 @@ isr_rx_conn_pkt_ctrl(struct radio_pdu_node_rx *node_rx, u8_t *rx_enqueue)

 #if defined(CONFIG_BT_CTLR_LE_ENC)
        case PDU_DATA_LLCTRL_TYPE_ENC_REQ:
+               LOG_INF("ENC_REQ");
                if (!_radio.conn_curr->role ||
                    !pdu_len_cmp(PDU_DATA_LLCTRL_TYPE_ENC_REQ,
                                 pdu_data_rx->len)) {

Here are the results of my testing on few devices:

WindowsSurfacePro

[00:02:01.028,228] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:02:01.088,592] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 60ms between both packet - DELTA with ENC_REQ
[00:00:12.760,620] <inf> main: Connected  
[00:00:12.762,084] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:12.793,060] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:12.799,530] <inf> main: Disconnected from  (reason 61) // 6ms - DELTA with ENC_REQ
[00:00:16.065,826] <inf> main: Connected 
[00:00:16.067,474] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:16.092,895] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:16.094,055] <inf> main: Disconnected from  (reason 61) // 2ms
[00:00:18.413,818] <inf> main: Connected  
[00:00:18.415,222] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:18.433,013] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:18.448,547] <inf> main: Disconnected from  (reason 61) // 15ms
[00:00:25.130,920] <inf> main: Connected  
[00:00:25.132,446] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:25.169,342] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:25.170,532] <inf> main: Disconnected from  (reason 61) // 1ms
[00:00:28.023,437] <inf> main: Connected 
[00:00:28.025,299] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:28.049,591] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:28.050,781] <inf> main: Disconnected from  (reason 61) // 1ms

WindowsSurfacePro + ASUS BT-400

[00:01:48.401,977] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:01:48.521,789] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 120ms
[00:00:12.422,637] <inf> main: Connected  
[00:00:12.424,377] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:12.445,129] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:12.489,929] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:12.505,187] <inf> main: Security changed: tW level 2

[00:00:01.936,584] <inf> main: Connected 
[00:00:01.938,629] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:01.960,601] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:02.005,401] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:02.020,660] <inf> main: Security changed: tW level 2

[00:00:05.929,077] <inf> main: Connected 
[00:00:05.930,633] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:05.956,481] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:06.001,312] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:06.016,540] <inf> main: Security changed: tW level 2

Samsung J5 Android Phone

[00:00:59.164,794] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:59.179,626] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 15ms
[00:00:01.506,317] <inf> main: Connected  
[00:00:01.508,178] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:01.747,619] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:01.844,940] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 100ms
[00:00:01.893,920] <inf> main: Security changed: tW level 2

[00:00:05.254,547] <inf> main: Connected  
[00:00:05.256,134] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:05.477,111] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:05.574,432] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 100ms
[00:00:05.623,413] <inf> main: Security changed: tW level 2

[00:00:03.912,506] <inf> main: Connected  
[00:00:03.913,909] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:04.121,429] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:04.218,750] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 100ms
[00:00:04.267,761] <inf> main: Security changed: tW level 2

Chuwi Android

[00:01:12.933,227] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:01:12.941,131] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 8ms
[00:00:06.842,956] <inf> main: Connected  
[00:00:06.844,390] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:06.912,994] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:06.957,916] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:07.002,593] <inf> main: Security changed: tW level 2

[00:00:02.991,241] <inf> main: Connected 
[00:00:02.993,103] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:03.061,096] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:03.106,018] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:03.195,678] <inf> main: Security changed: tW level 2

[00:00:09.367,492] <inf> main: Connected  
[00:00:09.368,988] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:09.438,293] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:09.483,184] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:09.527,862] <inf> main: Security changed: tW level 2

[00:00:04.249,267] <inf> main: Connected 
[00:00:04.250,854] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:04.323,822] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:04.368,713] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 45ms
[00:00:04.413,391] <inf> main: Security changed: tW level 2

Chuwi Windows

[00:00:26.104,980] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:26.165,344] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 60ms
[00:00:26.225,036] <inf> main: Security changed: tW level 2
[00:00:07.797,149] <inf> main: Connected  
[00:00:07.799,011] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:07.821,533] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:07.836,425] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 15ms
[00:00:07.851,104] <inf> main: Security changed: tW level 2

[00:00:03.200,805] <inf> main: Connected  
[00:00:03.202,575] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:03.227,142] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:03.242,034] <inf> bt_ctlr_llsw_ctrl: ENC_RSP // 15ms
[00:00:03.256,713] <inf> main: Security changed: tW level 2
[00:00:04.058,319] <inf> main: Connected 
[00:00:04.060,028] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:04.099,029] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:04.118,133] <inf> main: Disconnected from  (reason 61) // 19ms
[00:00:07.465,789] <inf> main: Connected 
[00:00:07.467,041] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:07.482,330] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:07.497,863] <inf> main: Disconnected from  (reason 61) // 15ms
[00:00:16.855,926] <inf> main: Connected  
[00:00:16.857,666] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:16.872,497] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:16.903,045] <inf> main: Disconnected from  (reason 61) // 31ms
[00:00:20.048,919] <inf> main: Connected  
[00:00:20.050,415] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 0.
[00:00:20.060,089] <inf> bt_ctlr_llsw_ctrl: ENC_REQ
[00:00:20.105,621] <inf> main: Disconnected from  (reason 61) // 45ms

Windows Trace

Here are the Windows Trace you asked me:

carlescufi commented 5 years ago

@cvinayak @joerchan

Command disallowed to an LTK Request Response. Any clues?

> HCI Event: LE Meta Event (0x3e) plen 19                         #322 6.027500
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: 88:83:5D:D4:AE:00 (OUI 88-83-5D)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x00
< HCI Command: LE Set Advertise Enable (0x08|0x000a) plen 1       #323 6.030600
        Advertising: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                       #324 6.032000
      LE Set Advertise Enable (0x08|0x000a) ncmd 1
        Status: Memory Capacity Exceeded (0x07)
< HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2  #325 6.039200
        Handle: 0
> HCI Event: Command Status (0x0f) plen 4                         #326 6.040700
      LE Read Remote Used Features (0x08|0x0016) ncmd 1
        Status: Success (0x00)
< ACL Data TX: Handle 0 flags 0x00 dlen 6                         #327 6.042300
      SMP: Security Request (0x0b) len 1
        Authentication requirement: Bonding, No MITM, SC, No Keypresses (0x09)
> HCI Event: LE Meta Event (0x3e) plen 13                         #328 6.044400
      LE Long Term Key Request (0x05)
        Handle: 0
        Random number: 0x0000000000000000
        Encrypted diversifier: 0x0000
< HCI Command: LE Long Term Key Request... (0x08|0x001a) plen 18  #329 6.046900
        Handle: 0
        Long term key: 79b3099dc804728dfc3a77ef6317f11d
> HCI Event: Command Complete (0x0e) plen 6                       #330 6.049900
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Command Disallowed (0x0c)
        Handle: 0
> HCI Event: Disconnect Complete (0x05) plen 4                    #331 6.071400
        Status: Success (0x00)
        Handle: 0
        Reason: Connection Terminated due to MIC Failure (0x3d)
carlescufi commented 5 years ago

@cvinayak @joerchan could you take a look at the Wireshark traces? https://github.com/zephyrproject-rtos/zephyr/files/2733871/Windows10TabletIssue-Wireshark.zip

carlescufi commented 5 years ago

@joerchan suggests the issue might be that:

  1. Slave send LL_SLAVE_FEATURE_REQ on packet 24
  2. Master sends LL_ENC_REQ on packet 29
  3. Master replies to the LL_SLAVE_FEATURE_REQ with LL_UNKNOWN_RSP on packet 31
  4. Slave disconnects (i.e. stops listening to master packets)
  5. Master keeps re-sending the LL_UNKNOWN_RSP

The issue seems to be that the master should respond to the feature request first and then send the encryption request, so it looks like a bug in the Windows tablet controller.

image

joerchan commented 5 years ago

"If at any time during the encryption start procedure, the Link Layer of the master or the slave receives an unexpected Data Channel PDU from the peer Link Layer, it shall immediately exit the Connection State, and shall transition to the Standby State. The Host shall be notified that the link has been disconnected with the error code Connection Terminated Due to MIC Failure (0x3D)"

You can see in the HCI log that there is a disconnect event with this error code. So that confirms the behaviour of the slave. I do however think that the slave should not respond with command disallowed before the disconnect event., i.e until the disconnect event is delivered, the command should be able to complete with success.

Olivier-ProGlove commented 5 years ago

I took me a while to get the right trace. To be short, while it seems to work much better there might still be an issue.

Environment of test:

Here are the log files: 2ndReconnectionFail.zip

Here are part of the logs for the respective reconnection:

First reconnection (successful)

< HCI Command: LE Set Advertise Enable (0x08|0x000a) plen 1       #416 9.306600
        Advertising: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                       #417 9.308000
      LE Set Advertise Enable (0x08|0x000a) ncmd 1
        Status: Connection Limit Exceeded (0x09)
< HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2  #418 9.314900
        Handle: 0
> HCI Event: Command Status (0x0f) plen 4                         #419 9.316400
      LE Read Remote Used Features (0x08|0x0016) ncmd 1
        Status: Success (0x00)
< ACL Data TX: Handle 0 flags 0x00 dlen 6                         #420 9.318100
      SMP: Security Request (0x0b) len 1
        Authentication requirement: Bonding, MITM, SC, No Keypresses (0x0d)
> HCI Event: LE Meta Event (0x3e) plen 13                         #421 9.337900
      LE Long Term Key Request (0x05)
        Handle: 0
        Random number: 0x0000000000000000
        Encrypted diversifier: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5            #422 9.340400
        Num handles: 1
        Handle: 0
        Count: 1
< HCI Command: LE Long Term Key Request... (0x08|0x001a) plen 18  #423 9.342200
        Handle: 0
        Long term key: 63c0ac41ed7816a2b7f1135a378d3849
> HCI Event: Command Complete (0x0e) plen 6                       #424 9.345200
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Success (0x00)
        Handle: 0
> HCI Event: LE Meta Event (0x3e) plen 12                         #425 9.347100
      LE Read Remote Used Features (0x04)
        Status: Unsupported Remote Feature / Unsupported LMP Feature (0x1a)
        Handle: 0
        Features: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> HCI Event: Encryption Change (0x08) plen 4                      #426 9.370500
        Status: Success (0x00)
        Handle: 0
        Encryption: Enabled with AES-CCM (0x01)

1streconnection

Second reconnection (fail with MIC Failure)

> HCI Event: LE Meta Event (0x3e) plen 19                         #477 4.208800
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: BC:83:85:0C:F3:BA (OUI BC-83-85)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x01
< HCI Command: LE Set Advertise Enable (0x08|0x000a) plen 1       #478 4.211900
        Advertising: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                       #479 4.213300
      LE Set Advertise Enable (0x08|0x000a) ncmd 1
        Status: Connection Limit Exceeded (0x09)
< HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2  #480 4.220800
        Handle: 0
> HCI Event: Command Status (0x0f) plen 4                         #481 4.222300
      LE Read Remote Used Features (0x08|0x0016) ncmd 1
        Status: Success (0x00)
< ACL Data TX: Handle 0 flags 0x00 dlen 6                         #482 4.224000
      SMP: Security Request (0x0b) len 1
        Authentication requirement: Bonding, MITM, SC, No Keypresses (0x0d)
> HCI Event: LE Meta Event (0x3e) plen 13                         #483 4.240200
      LE Long Term Key Request (0x05)
        Handle: 0
        Random number: 0x0000000000000000
        Encrypted diversifier: 0x0000
< HCI Command: LE Long Term Key Request... (0x08|0x001a) plen 18  #484 4.242700
        Handle: 0
        Long term key: 63c0ac41ed7816a2b7f1135a378d3849
> HCI Event: Command Complete (0x0e) plen 6                       #485 4.245700
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Success (0x00)
        Handle: 0
> HCI Event: Disconnect Complete (0x05) plen 4                    #486 4.256400
        Status: Success (0x00)
        Handle: 0
        Reason: Connection Terminated due to MIC Failure (0x3d)

2ndreconnection

I will redo btmon+sniffer traces with CONFIG_BT_CTLR_SLAVE_FEAT_REQ=n. I have just tested quickly and them issue.

Olivier-ProGlove commented 5 years ago

Here are the traces (logs+btmon+sniffer) of the same settings + CONFIG_BT_CTLR_SLAVE_FEAT_REQ=n: WithoutFeatReq.zip

> HCI Event: LE Meta Event (0x3e) plen 19                         #68 14.858700
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Slave (0x01)
        Peer address type: Public (0x00)
        Peer address: BC:83:85:0C:F3:BA (OUI BC-83-85)
        Connection interval: 15.00 msec (0x000c)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2000 msec (0x00c8)
        Master clock accuracy: 0x01
< HCI Command: LE Set Advertise Enable (0x08|0x000a) plen 1       #69 14.861800
        Advertising: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                       #70 14.863200
      LE Set Advertise Enable (0x08|0x000a) ncmd 1
        Status: Connection Limit Exceeded (0x09)
< HCI Command: LE Set PHY (0x08|0x0032) plen 7                    #71 14.870100
        Handle: 0
        All PHYs preference: 0x00
        TX PHYs preference: 0x02
          LE 2M
        RX PHYs preference: 0x02
          LE 2M
        PHY options preference: Reserved (0x0000)
> HCI Event: Command Status (0x0f) plen 4                         #72 14.872200
      LE Set PHY (0x08|0x0032) ncmd 1
        Status: Success (0x00)
< ACL Data TX: Handle 0 flags 0x00 dlen 6                         #73 14.873800
      SMP: Security Request (0x0b) len 1
        Authentication requirement: Bonding, MITM, SC, No Keypresses (0x0d)
> HCI Event: LE Meta Event (0x3e) plen 13                         #74 14.888300
      LE Long Term Key Request (0x05)
        Handle: 0
        Random number: 0x0000000000000000
        Encrypted diversifier: 0x0000
< HCI Command: LE Long Term Key Request... (0x08|0x001a) plen 18  #75 14.890800
        Handle: 0
        Long term key: 63c0ac41ed7816a2b7f1135a378d3849
> HCI Event: Command Complete (0x0e) plen 6                       #76 14.893800
      LE Long Term Key Request Reply (0x08|0x001a) ncmd 1
        Status: Success (0x00)
        Handle: 0
> HCI Event: Disconnect Complete (0x05) plen 4                    #77 14.904500
        Status: Success (0x00)
        Handle: 0
        Reason: Connection Terminated due to MIC Failure (0x3d)

trace

Olivier-ProGlove commented 5 years ago

Follow up on this still existing issue in this new ticket with more information: https://github.com/zephyrproject-rtos/zephyr/issues/14044