zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.87k stars 6.63k forks source link

Bluetooth: Conn: Bluetooth stack becomes unusable when communicating with both centrals and peripherals #50438

Closed Klievan closed 1 year ago

Klievan commented 2 years ago

Describe the bug There is a specific case in which the Bluetooth stack seems to become stuck in a state where it will not accept/initiate any new connections. The following warning is printed in the console: <wrn> bt_hci_core: opcode 0x2005 status 0x0c

I'm using an nRF52840 and NCS 2.1. The only way to get out of this state seems to be resetting the device. Below the steps I found which reproduces this state consistently using an example application.

To Reproduce

  1. Grab two nRF52840 devkits and an iPhone that has nRFConnect for mobile installed.
  2. Flash the "central_and_peripheral_hr" example on one devkit. Flash the "peripheral_hr" example on the other devkit.
  3. Turn on the central_and_peripheral_hr devkit and connect to its console.
  4. Get your phone and connect to "Nordic_HR". Make sure that the connection stays on for the following steps.
  5. Turn on the "peripheral_hr" devkit. Make sure that it connects to the "central_and_peripheral_hr" devkit.
  6. Once the connection has been established between the two devkits, turn off the "peripheral_hr" (important)
  7. Close the connection with the phone through nRF Connect.
  8. The Bluetooth stack is now in a state where it wont accept new connections (incoming/outgoing). Observe the console output in the section below.

Impact This is not limited to the heartrate example. We initially discovered this bug in our own application and reproduced it using a sample so it would be easier to fix. Our users have to restart their gateways when configuring them to make sure that they still work properly. This feels like a pretty severe bug.

Logs and console output

*** Booting Zephyr OS build v3.1.99-ncs1-rc1  ***
Starting Bluetooth Central and Peripheral Heart Rate relay example
I: 8 Sectors of 4096 bytes
I: alloc wra: 1, fb8
I: data wra: 1, 164
I: SoftDevice Controller build revision: 
I: b3 30 26 8d c5 f3 4a bd |.0&...J.
I: f4 21 f5 83 2d de 9c f2 |.!..-...
I: 5e da 68 ec             |^.h.    
I: No ID address. App must call settings_load()
Scanning started
Advertising started
W: opcode 0x200a status 0x09
Connected: 48:79:82:C7:36:F1 (random).  <--- MOBILE PHONE
W: Ignoring data for unknown channel ID 0x003a
Filters matched. Address: F1:CE:C4:C1:92:BA (random) connectable: 1
Connected: F1:CE:C4:C1:92:BA (random) <--- HR PERIPHERAL
Security failed: F1:CE:C4:C1:92:BA (random) level 1 err 2
The discovery procedure succeeded
Heart Rate Sensor body location: Chest
Heart Rate Measurement notification received:

        Heart Rate Measurement Value Format: 8 - bit
        Sensor Contact detected: 1
        Sensor Contact supported: 1
        Energy Expended present: 0
        RR-Intervals present: 0

        Heart Rate Measurement Value: 92 bpm

Disconnected: F1:CE:C4:C1:92:BA (random) (reason 8) <--- HR PERIPHERAL
Disconnected: 48:79:82:C7:36:F1 (random) (reason 19). <--- MOBILE PHONE
Filters matched. Address: F1:CE:C4:C1:92:BA (random) connectable: 1 <--- HR PERIPHERAL
W: opcode 0x2005 status 0x0c
Connecting failed

Environment (please complete the following information):

jori-nordic commented 2 years ago

can confirm I am able to reproduce it (nrf connect on android phone), will look into it asap. edit: it looks like it's only the dk-to-dk connection that fails (ie w/zephyr as central). I am able to reconnect thru nrfconnect, I just have to re-scan and hit the connect button.

Klievan commented 2 years ago

it looks like it's only the dk-to-dk connection that fails (ie w/zephyr as central). I am able to reconnect thru nrfconnect, I just have to re-scan and hit the connect button.

Can confirm, my apologies. In my application it takes a while but is not impossible to reconnect from the iPhone (as far as I can tell because the device is busy trying to reconnect to other devices).

jori-nordic commented 2 years ago

I've narrowed it down, I think it's an issue with the controller rather than the host, as bt_conn_le_create returns -EIO which is an HCI error. Still digging..

Klievan commented 2 years ago

I've narrowed it down, I think it's an issue with the controller rather than the host, as bt_conn_le_create returns -EIO which is an HCI error. Still digging..

I've looked up the HCI warning in the BLE spec and it gave me these 'translations': bt_hci_core: opcode 0x2005 status 0x0c 0x2005 for the BT_HCI_OP_LE_CREATE_CONN command, and 0x0C should mean "NOT ALLOWED". Not sure if that helps you out in any way?

jori-nordic commented 2 years ago

Thanks, yeah I know it's not allowed (you can check out hci_err.h in zephyr it has the list).

I think this requires more digging in the softdevice controller, it looks like a bug there. I tried to make it work with the zephyr LL but there were some issues, I'll try again to double-check.

The same opcodes are sent, in the same order for both cases (scanning+connecting after boot, scanning+connecting after both devices disconnected), but in the last case, the BT_HCI_OP_LE_CREATE_CONN command is the only one returning DISALLOWED, probably a bug in their state machine.

jori-nordic commented 2 years ago

okay, it works with the zephyr link layer (no bug) with these changes in the prj.conf:

CONFIG_BT_LL_SOFTDEVICE=n
CONFIG_BT_LL_SW_SPLIT=y
CONFIG_BT_CTLR_RX_PRIO_STACK_SIZE=1024
CONFIG_BT_MAX_CONN=10

So this seems to be a bug in the Softdevice Controller. could you open a bug in the devzone and link it here? I'll close this issue. I have already pinged the controller team, but they'll need the ticket number that's why.

Klievan commented 2 years ago

Just as a last comment, I've used the Zephyr LL as well and the bug also seems to disappear on my end.

Klievan commented 2 years ago

could you open a bug in the devzone and link it here? I'll close this issue.

https://devzone.nordicsemi.com/f/nordic-q-a/92177/bluetooth-conn-bluetooth-stack-becomes-unusable-when-communicating-with-both-centrals-and-peripherals

ppryga-nordic commented 2 years ago

I've found that there actually are two issues in Host that make the incorrect behavior as described above.

  1. Host automatically resumes advertising when peripheral role is enabled. It does not check role stored in connection object hence may starve central role. In described scenario, Host resumes advertising while an application resumes scanning. I've prepared a PR for this case: https://github.com/zephyrproject-rtos/zephyr/pull/51256.
  2. Advertising resume in Host is limited operation in comparison to regular advertising enable operation. Limited in a way, there is no device address check/update before the advertising is resumed. When scanner is resumed and identity address is not used, Host uses NRPA address for a device when scanning. This is address is different than static random address. When connection in peripheral role is disconnected, Host resumes advertising and continues use of address set by scan resume. Now, assume that an application starts a connection in central role. Both functions bt_le_create_conn_ext and bt_le_create_conn_legacy call bt_id_set_create_conn_own_addr. The last one mentioned function attempts to change an address to static random address: https://github.com/zephyrproject-rtos/zephyr/blob/e104e031b7d3ddacd2b90527fff438fc4056b72a/subsys/bluetooth/host/id.c#L1519-L1528. Unfortunately that will not happen, because there is an advertising running and Controller returns error BT_HCI_ERR_CMD_DISALLOWED for Host command: BT_HCI_OP_LE_SET_RANDOM_ADDRESS.

It looks to me that, to fix second issue we need to run Bluetooth address setup before advertising is resumed. Unfortunately that would require storage of some information that are provided to Host when advertising is started by bt_le_adv_start.

ppryga-nordic commented 2 years ago

@jori-nordic @Thalley @alwa-nordic @sjanc what is your opinion?

Thalley commented 2 years ago

@jori-nordic @Thalley @alwa-nordic @sjanc what is your opinion?

Sounds like you've successfully fixed the first issue.

For the second issue, this is only an issue if BT_EXT_ADV=n, right? For BT_EXT_ADV=y BT_HCI_OP_LE_SET_RANDOM_ADDRESS can be called, even when advertising:

If the Host issues this command when any of advertising (created using legacy advertising commands), scanning, or initiating are enabled, the Controller shall return the error code Command Disallowed (0x0C).

We could for BT_EXT_ADV=n simply reject calls to bt_conn_le_create if we are advertising or scanning, or vice versa reject calls to bt_le_adv_start if we are currently scanning or connecting.

ppryga-nordic commented 2 years ago

@Thalley yes, first issue is fixed. Referring to second one and you comment. In my humble opinion rejection of bt_conn_le_create is last option. The issue happens only when a peripheral connection is closed when scanning is in progress. In other words, when Host resumes advertising for a closed peripheral role connection. If user re-enables advertising, there should be no problem because an address should be changed to static random before host requests Controller to start advertising again. I would prefer to extend advertising resume operation instead rejecting anything. An application should be able to use the API in the same way as it was after BT stack enable.

Thalley commented 2 years ago

Right. Rejecting API calls won't help in the resume case either, as that is done internally in the stack, and the application won't be able to handle any issues there anyways.

ppryga-nordic commented 2 years ago

Can anyone else share his thoughts about it?

ppryga-nordic commented 2 years ago

@carlescufi re-openning this is not completed. There is still an issue with BT adresses when resume advertising. I've described it here: https://github.com/zephyrproject-rtos/zephyr/issues/50438#issuecomment-1277621789

alwa-nordic commented 2 years ago

It looks to me that, to fix second issue we need to run Bluetooth address setup before advertising is resumed. Unfortunately that would require storage of some information that are provided to Host when advertising is started by bt_le_adv_start.

I agree.

carlescufi commented 2 years ago

Related to https://github.com/zephyrproject-rtos/zephyr/issues/30699

carlescufi commented 2 years ago

I've found that there actually are two issues in Host that make the incorrect behavior as described above.

  1. Host automatically resumes advertising when peripheral role is enabled. It does not check role stored in connection object hence may starve central role. In described scenario, Host resumes advertising while an application resumes scanning. I've prepared a PR for this case: Bluetooh: Host: conn: Resume adv after disconn for periph role only #51256.
  2. Advertising resume in Host is limited operation in comparison to regular advertising enable operation. Limited in a way, there is no device address check/update before the advertising is resumed. When scanner is resumed and identity address is not used, Host uses NRPA address for a device when scanning. This is address is different than static random address. When connection in peripheral role is disconnected, Host resumes advertising and continues use of address set by scan resume. Now, assume that an application starts a connection in central role. Both functions bt_le_create_conn_ext and bt_le_create_conn_legacy call bt_id_set_create_conn_own_addr. The last one mentioned function attempts to change an address to static random address: https://github.com/zephyrproject-rtos/zephyr/blob/e104e031b7d3ddacd2b90527fff438fc4056b72a/subsys/bluetooth/host/id.c#L1519-L1528 . Unfortunately that will not happen, because there is an advertising running and Controller returns error BT_HCI_ERR_CMD_DISALLOWED for Host command: BT_HCI_OP_LE_SET_RANDOM_ADDRESS.

It looks to me that, to fix second issue we need to run Bluetooth address setup before advertising is resumed. Unfortunately that would require storage of some information that are provided to Host when advertising is started by bt_le_adv_start.

@ppryga-nordic could you please open a new issue for this point 2., and then close this one? This present one was about something else, and it is confusing to reuse it.

ppryga-nordic commented 2 years ago

@carlescufi @jori-nordic New issue created: #52059.

jori-nordic commented 1 year ago

will close this as #52059 has been opened as follow-up.