zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

Bluetooth scanning frequent resetting #14741

Closed nwsetec closed 5 years ago

nwsetec commented 5 years ago

Describe the bug We have about 12 boards running in our RnD area with nRF52832 SoCs running as a peripherals and also scanning as Bluetooth observers. Somehow a condition occurs where all the boards begin resetting. Once the boards start resetting they don't stop. We are currently experiencing the second occurrence of this issue. The first time it happened we didn't know what was happening and eventually we turned off all the boards and it went away. Now that it's happened a second time we're keeping it in this state until we figure out how to fix it, or how to trigger it back into this state so we can fix.

If one of the boards from this set is programmed with the scan start function removed, this board will not reset. If we delay the point after power up that the scan start function is called the reseting will start once scan start is called.

We have an RF chamber and we moved some of the boards in there and with the door open they will continue resetting. Close the door they continue resetting*. Turn them all off with the door still shut and power them up and they are back to normal. Open the door and they start resetting again (it seems they are being re-triggered into this resetting behaviour from the Bluetooth advertisements coming from the still resetting boards outside the chamber).

We attempted fixing it by turning off the duplicate removal filter when calling scan start and there was no effect. We tried limiting the scanning window to 5ms of 60ms and the reseting continued.

In this environment we programmed up a pca10040 dev board with the scan_adv sample and found that it would fall over either with a checkout of Zephyr 1.13 or Zephyr 1.14-rc1.

The console output when the reset occurred for 1.13 looked like this:

***** Booting Zephyr OS zephyr-v1.13.0-4-g470b09f082 *****
Starting Scanner/Advertiser Demo
[bt] [INF] hci_vs_init: HW Platform: Nordic Semiconductor (0x0002)
[bt] [INF] hci_vs_init: HW Variant: nRF52x (0x0002)
[bt] [INF] hci_vs_init: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 0
[bt] [INF] bt_dev_show_info: Identity: fd:47:8b:0c:d9:f1 (random)
[bt] [INF] bt_dev_show_info: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[bt] [INF] bt_dev_show_info: LMP: version 5.0 (0x09) subver 0xffff
Bluetooth initialized
***** BUS FAULT *****
  Precise data bus error
  BFAR Address: 0x5300078c
***** Hardware exception *****
Current thread ID = 0x20001150
Faulting instruction address = 0x8918
Fatal fault in thread 0x20001150! Aborting.

Thread ID 0x20001150 was the idle_thread_s and address 0x8918 took as to: zephyr/subsys/bluetooth/controller/util/memq.c:95

The console output for 1.14 looked like this:

***** Booting Zephyr OS v1.14.0-rc1-1294-g3dea408405fd *****
Starting Scanner/Advertiser Demo
[00:00:00.000,457] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.000,488] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.000,488] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 0
[00:00:00.001,007] <inf> bt_hci_core: Identity: fd:47:8b:0c:d9:f1 (random)
[00:00:00.001,007] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.001,007] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
Bluetooth initialized
***** BUS FAULT *****
  Precise data bus error
  BFAR Address: 0x4f000718
***** Hardware exception *****
Current thread ID = 0x20000398
                              Faulting instruction address = 0x7ea4
Fatal fault in thread 0x20000398! Aborting.
[00:00:52.571,533] <err> bt_ctlr_llsw_ctrl: assert: '!_radio.ticker_id_prepare' failed
***** Kernel OOPS! *****
Current thread ID = 0x20000398
                              Faulting instruction address = 0x6f28
Fatal fault in ISR! Spinning...

address = 0x7ea4 took us to: /zephyr/subsys/bluetooth/controller/ll_sw/ctrl.c:11764

Impact showstopper

Environment (please complete the following information):

I'll add more information tomorrow as we look into it further.

nwsetec commented 5 years ago

@carlescufi

jhedberg commented 5 years ago

This sounds very similar to what we saw at the Junction 2018 hackathon (we had a Bluetooth Mesh task for the teams there). Something happened in the environment that made any Zephyr image start crashing very soon after it started scanning. @laperie @Vudentz FYI

jhedberg commented 5 years ago

@deadprogram FYI - this reminds a lot of what we saw at Junction

cvinayak commented 5 years ago

@nwsetec Does the boards fault just after power up or running boards too fault?

nwsetec commented 5 years ago

@cvinayak running boards fault as well.

It seems like there's a critical threshold of packets being received that triggers a fault in the stack that somehow causes a lot of advertising packets to be created that triggers the issue in other boards.

nwsetec commented 5 years ago

@cvinayak @joerchan Another thing we noticed (that might help reproduce the issue) during the first and the latest occurrence of this issue we used the nRF connect app to scan what was being advertised and there were 1 to 3 Windows laptops Bluetooth advertising (Microsoft Advertising Beacon) at rates of 1ms. Which we thought seemed an unusual rate and strangely coincidental. The first time it happened a colleague was pairing a phone to their laptop at the time.

DeclanTraill commented 5 years ago

@cvinayak @joerchan @nwsetec I'm not sure if this is a problem, but it looks like it could be: In hci_core.c when scanning is being enabled, in the function start_le_scan(), the scanning parameters are being set using a BT_HCI_OP_LE_SET_SCAN_PARAM command sent by the bt_hci_cmd_send function rather than the bt_hci_cmd_send_sync function; so doesn't this mean that it is not guaranteed that the parameters have been set prior to the scanning function being enabled (via the call to set_le_scan_enable() on the next line)? The actual enable of the scanning via a BT_HCI_OP_LE_SET_SCAN_ENABLE command in that function IS using a bt_hci_cmd_send_sync() call.

What is more, if the parameter function does get executed after the scanning has been enabled (race condition), it would have no effect, as the ll_scan_params_set() function called just returns an error code but doesn't set the parameters if scanning is already enabled when it is called.

What would happen if scanning is started with no parameters set?

cvinayak commented 5 years ago

@jhedberg Is it ok to use bt_hci_cmd_send to set scan params?

jhedberg commented 5 years ago

@DeclanTraill @cvinayak yes, that's fine. Both methods use the same command queue, i.e. the parameters will be sent to the controller no matter what, and the scan_enable will only be sent once the command complete for the parameters comes. So once send_sync() returns for scan_enable it means that both commands have completed, and in the right order.

jhedberg commented 5 years ago

A bit more about cmd_send() vs cmd_send_sync(): send_sync() can be useful when:

None of these is strictly true for the parameters, however it could be argued if we should be checking for their success as well. It doesn't seem like it's related to this issue however, and it was reported that the issue occurs for any existing scanners as well.

cvinayak commented 5 years ago

@nwsetec Yes, I have seen such behavior by Windows 10 Laptops, 1 ms beacon advertising on 'single advertising channel' . This should only cause denial of scanning legitimate advertising packets, and not crash the controller.

nwsetec commented 5 years ago

@cvinayak the resetting condition in our RnD area has now stopped unfortunately. I've attempted to retrigger it by modifying the Zephyr peripheral sample to advertise at a 1ms interval and running this firmware on 5 dev boards. The nRF Connect app showed these devices advertising at between 3ms to 6ms but it didn't cause the resetting condition to trigger.

cvinayak commented 5 years ago

@nwsetec This is what happened the last time too with @jhe, the problem just vanished. When I encounter such bugs, there are usually more than 2 random timing bugs. We need to carefully analyze the pre-condition and have controlled reproduction to nail down the bugs. The good thing is, atleast more than one person has encountered the issue, and it has been there for quite some months now.

Do you still have the laptops that advertise at 1ms around? and analyse its packet for any illegal values in it, like incorrect length fields that Zephyr could parse incorrectly and cause a bus fault. The controller assert is a consequence of bus fault as @joerchan pointed out to me.

cvinayak commented 5 years ago

With immense help from @nwsetec the issue has been narrowed down to suspected buffer overrun by the Radio H/w receiving an illegal packet length BLE advertising PDU. Review of the controller implementation with a controlled reproduction is in progress, myself and @joerchan will post updates in the coming days.