Bluetooth: Controller: Connection Parameter Update indication timeout

ChristianAmannKT commented 4 years ago

Note This sample is just to provoke the issue. This is obviously not a real scenario but it's the easiest way to show it.

Describe the bug When a central device updates connection parameters a few times in a row, the central sends one of the update indications too late. That way, once it is finally sent out, the instant for which the the new parameters shall be valid lies then in the past. The connection now gets disconnect either with reason 0x08 (timeout) or 0x28 (instant passed).

Devices involved for this sample:

Central:    nRF52840-USB Dongle (nrf52840dongle_nrf52840)     https://docs.zephyrproject.org/latest/boards/arm/nrf52840dongle_nrf52840/doc/index.html
Peripheral: nRF52 DK            (nrf52dk_nrf52832)        https://docs.zephyrproject.org/latest/boards/arm/nrf52_pca10040/doc/nrf52_pca10040.html

To Reproduce 1) Build and flash the provided sample to the usb dongle (Scripts for building and viewing RTT output are included) 2) Hook up the nRF52 DK to a computer and start nRF Connect (v3.6.0) 3) Load the Bluetooth Low Energy App, select the device and let it start advertising. 4) Keep the DK close to the nRF dongle, it will now connect/pair and update connection parameters 5) Both sides will notify about the connection being terminated because of a timeout

Expected behavior The connection update indication should either be sent in time, or if not possible, the connection update procedure should fail entirely.

Impact Need to come up with weird workarounds (timeouts etc.) in order to proceed

Logs and console output RTT Log from central:

Setting new connection params... on_le_param_updated (0x200009c8): e8:34:60:2b:56:15 (random) interval 16 latency 0 timeout 100 on_le_param_updated (0x200009c8): e8:34:60:2b:56:15 (random) interval 17 latency 0 timeout 500 on_le_param_updated (0x200009c8): e8:34:60:2b:56:15 (random) interval 18 latency 0 timeout 1000 on_le_param_updated (0x200009c8): e8:34:60:2b:56:15 (random) interval 19 latency 0 timeout 2000 Done! Disconnected: e8:34:60:2b:56:15 (random) (reason 0x08)

Ellisys trace: Ellisys_trace.zip

nRF52 USB dongle sample code: nRF52840_Dongle sample.zip

joerchan commented 4 years ago

Slave dissappeared for a number events, while master was trying to send an empty packet. When slave re-appeared the central sends connection update with instant in the past. Looks like an issue with setting the instant when the packet is queued, not when it is first actually sent.

carlescufi commented 4 years ago

@cvinayak this looks like an issue in the Controller as per @joerchan's analysis.

cvinayak commented 4 years ago

From the sniffer log, the slave has not sync-ed at event counter 33 for the UPDATE_IND sent at event counter 27. Please use a zephyr peripheral sample and let me know if its reproducible.

Yes, the instant is set when the UPDATE_IND PDU is enqueued from ULL to LLL. LLL operating as packet controller, will not parse the PDU and update it. This will have to be vendor specific.

Even if LLL would set the instant on first transmission, if the transmissions are not ack-ed (like consecutive CRC errors in every connecction event), the instant will pass. But, as procedure timeouts are not used for connection update, instant passed could be the correct outcome (and hence, the rationale behind setting the instant when connection update procedure is enqueued).

ChristianAmannKT commented 4 years ago

I was able to reproduce it using the default peripheral sample (see attachment). However, I had to explicitly set CONFIG_BT_CTLR_CONN_PARAM_REQ=n in the prj.conf. Otherwise the peripheral would just send a connection parameter update request after the central was done with it's connection update indications. This behavior keeps the connection alive.

nrf52dk_nrf52832_peripheral.zip

cvinayak commented 4 years ago

Do you have a sniffer log that you can attach here?

ChristianAmannKT commented 4 years ago

Unfortunatly not, my sniffer setup is unable to log the communication. The last thing I was able to see was the connection request. I'm not sure what I am doing wrong here.

smijovic-si commented 4 years ago

Slave dissappeared for a number events, while master was trying to send an empty packet. When slave re-appeared the central sends connection update with instant in the past. Looks like an issue with setting the instant when the packet is queued, not when it is first actually sent.

Slave disappearing for a number of events seems to be caused by a bad connection event timing by the master. There is a connection parameter update indication in connection event 27. The new connection parameters are scheduled for connection event 33 and this is exactly when the slave disappears. Connection interval at that time is 20 ms, transmit window size is 1.25 ms and transmit window offset is 0 ms. The distance between connection event 32 and 33 is 21.418 ms (please see attached screenshot), which is bigger than the connection event + transmit window offset + transmit window size. The slave probably emerges again when the receiving window gets big enough (due to window widening). My question is: what is causing the bad timing by the master 2020-11-17 11_48_57-Window ?

carlescufi commented 4 years ago

@cvinayak could you please comment on this?

cvinayak commented 4 years ago

It appears like the Zephyr controller has sent the conn_update_ind PDU without filling the window offset value of 1.

Could you please recheck with the following in your prj.conf: CONFIG_BT_CTLR_ADVANCED_FEATURES=y CONFIG_BT_CTLR_SCHED_ADVANCED=n

cvinayak commented 4 years ago

Tested the attached sample on a nRF52840DK modified prj file to use UART console (my WIP branch rebased over latest upstream master 27529e0c0d) and do not see any issues (makes me believe this is not a bug in controller source code). Console output: nRF sniffer trace (wireshark) file, gh_29636_success_nrf52840dk_nrfconnect.pcapng: gh_29636_success_nrf52840dk_nrfconnect.zip

Tested the attached sample on nRF52840DK unmodified prj file: RTT output: nRF sniffer trace (wireshark) file, gh_29636_success_nrf52840dk_nrfconnect_rtt.pcapng gh_29636_success_nrf52840dk_nrfconnect_rtt.zip

Tested the attached sample on nRF52840Dongle unmodified prj file: (no console, as I dont have SWD pinheads soldered, I flashed the hex using nrfconnect desktop's programmer) nRF sniffer trace (wireshark) file, gh_29636_success_nrf52840dongle_nrfconnect_rtt.pcapng gh_29636_success_nrf52840dongle_nrfconnect_rtt.zip

I am not able to reproduce the issue on the latest upstream master with my work in progress on top. Please provide details of your zephyr commit hash that reproduces the issue. Also, the dongle revision details (small white sticker on the dongle, mine is PCA10059 1.2.0, 2020.7 ).

ChristianAmannKT commented 4 years ago

Hi,

thanks for the update! I am on commit https://github.com/zephyrproject-rtos/zephyr/commit/cc4708e3170bfedcc2be33d8dc64c48534a76b0f and the my dongle is PCA10059 1.0.0 2018.34. I flash it using a J-Link programmer. Because I update my master frequently I just checked again that the issue still occurs and it does: 2020-11-27 08_01_00-Window

That being said, when i edit my prj.conf to contain your suggested parameters (see below), I will never run into a connection timeout.

CONFIG_BT_CTLR_ADVANCED_FEATURES=y CONFIG_BT_CTLR_SCHED_ADVANCED=n

For me this is an acceptable workaround. If you need anything else from me to find the root cause, just let me know.

Thanks a lot!

cvinayak commented 3 years ago

@Guayusa please verify the fix: https://github.com/zephyrproject-rtos/zephyr/pull/31477

ChristianAmannKT commented 3 years ago

I checked out your branch and let it run without the workaround mentioned earlier. I did not face any disconnects anymore. LGTM!

cvinayak commented 3 years ago

I checked out your branch and let it run without the workaround mentioned earlier. I did not face any disconnects anymore. LGTM!

@Guayusa Thank you for the testing and fast response.

zephyrproject-rtos / zephyr

Bluetooth: Controller: Connection Parameter Update indication timeout #29636