zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.55k stars 6.46k forks source link

Bluetooth: bt_conn: Unable to allocate buffer within timeout #43718

Closed aolowin closed 2 years ago

aolowin commented 2 years ago

Describe the bug A BLE peripheral can sometimes get into a state where it will no longer send notifications. It seems to occur if connections are repeatedly dropped due to range or antenna issues. It may happen if the connection is lost during discovery but that's a guess.

The following errors are generated: [00:02:21.702,667] bt_conn: Disconnected while allocating context [00:02:29.443,542] bt_conn: Unable to allocate buffer within timeout [00:02:29.443,572] bt_l2cap: Unable to allocate buffer for op 0x12 [00:02:52.495,300] bt_conn: Unable to allocate buffer within timeout [00:02:52.495,300] bt_att: Unable to allocate buffer for op 0x07

Once it gets into this state only a reboot will fix it. A reset on the central side has no effect.

To Reproduce Use the central_hr and peripheral_hr sample apps on the nrf52840dk_nrf52840 boards west build samples/bluetooth/peripheral_hr --build-dir=./build/peripheral_hr -b nrf52840dk_nrf52840 west build samples/bluetooth/central_hr --build-dir=./build/central_hr -b nrf52840dk_nrf52840

For ease of desktop testing it's convenient to use: CONFIG_BT_CTLR_TX_PWR_MINUS_40=y on the central_hr device. This allows a disconnect by moving the boards a short distance apart.

Move the boards closer and farther from each other to trigger disconnect/reconnect events and eventually generate the erroneous state.

Expected behavior A peripheral can cleanly reconnect after a disconnect.

Impact Serious impact since the peripheral will be unable to communicate without a reboot.

Logs and console output central_hr:

<snip>
<this is what is normal>

Scanning successfully started
[DEVICE]: E6:56:45:48:5D:4A (random), AD evt type 0, AD data len 11, RSSI -41
[AD]: 1 data_len 1
[AD]: 3 data_len 6
Connected: E6:56:45:48:5D:4A (random)
[ATTRIBUTE] handle 25
[ATTRIBUTE] handle 26
[ATTRIBUTE] handle 28
[SUBSCRIBED]
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[NOTIFICATION] data 0x20005b9d length 2
[UNSUBSCRIBED]
Disconnected: E6:56:45:48:5D:4A (random) (reason 0x08)
Scanning successfully started

<snip>
<this is what eventually happens>

[DEVICE]: E6:56:45:48:5D:4A (random), AD evt type 0, AD data len 11, RSSI -18
[AD]: 1 data_len 1
[AD]: 3 data_len 6
Connected: E6:56:45:48:5D:4A (random)
Discover complete
[00:04:51.470,947] <err> bt_att: ATT Timeout
Disconnected: E6:56:45:48:5D:4A (random) (reason 0x22)
Scanning successfully started

peripheral_hr:

*** Booting Zephyr OS build zephyr-v3.0.0-1041-g3ab10f2d52a5  ***
Bluetooth initialized
Advertising successfully started
[00:00:00.257,202] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.257,232] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.257,263] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 3.0 Build 99
[00:00:00.257,995] <inf> bt_hci_core: Identity: E6:56:45:48:5D:4A (random)
[00:00:00.258,026] <inf> bt_hci_core: HCI: version 5.3 (0x0c) revision 0x0000, manufacturer 0x05f1
[00:00:00.258,056] <inf> bt_hci_core: LMP: version 5.3 (0x0c) subver 0xffff
Connected
[00:00:03.341,705] <inf> hrs: HRS notifications enabled

<snip>

Disconnected (reason 0x08)
[00:02:10.680,572] <inf> hrs: HRS notifications disabled
Connected
[00:02:12.349,761] <inf> hrs: HRS notifications enabled
Disconnected (reason 0x08)
Connected
[00:02:21.700,195] <inf> hrs: HRS notifications disabled
[00:02:21.702,667] <wrn> bt_conn: Disconnected while allocating context
[00:02:29.443,542] <wrn> bt_conn: Unable to allocate buffer within timeout
[00:02:29.443,572] <err> bt_l2cap: Unable to allocate buffer for op 0x12
[00:02:52.495,300] <wrn> bt_conn: Unable to allocate buffer within timeout
[00:02:52.495,300] <err> bt_att: Unable to allocate buffer for op 0x07
Disconnected (reason 0x22)
Connected
[00:03:22.495,391] <wrn> bt_conn: Unable to allocate buffer within timeout
[00:03:22.495,422] <err> bt_att: Unable to allocate buffer for op 0x01
[00:03:29.817,321] <wrn> bt_conn: Unable to allocate buffer within timeout
[00:03:29.817,352] <err> bt_l2cap: Unable to allocate buffer for op 0x12
[00:03:52.869,079] <wrn> bt_conn: Unable to allocate buffer within timeout
[00:03:52.869,110] <err> bt_att: Unable to allocate buffer for op 0x07

Environment (please complete the following information):

cvinayak commented 2 years ago

@jori-nordic I will assume you will resume discussion based on the work you are doing here https://github.com/jori-nordic/zephyr/commit/2215bf0a945bc1f84ebb2a2c9b5b2d6754877b16

aolowin commented 2 years ago

@jori-nordic Thank you for tracking this down. Is the fix in your branch likely to be the final solution or will there be other changes before it goes into the main branch?

jori-nordic commented 2 years ago

I was on vacation, sorry for the delay.

@aolowin the fix I have might have to change a bit, because while the bug still exists in latest upstream, it has different symptoms due to the l2cap work that has been done.

@hermabe has a PR (https://github.com/zephyrproject-rtos/zephyr/pull/45682/files) to fix some issues that were introduced there, and I'll be trying to reproduce this particular bug today with his fixes applied locally.

EDIT: so I haven't been able to reproduce the issue with herman's PR applied locally. I think we can close this when that PR is merged. Could you try to reproduce it after that PR is merged @aolowin ?

aolowin commented 2 years ago

@jori-nordic I'll try and reproduce it after that PR is in. Thanks.

carlescufi commented 2 years ago

@jori-nordic I'll try and reproduce it after that PR is in. Thanks.

@aolowin would you mind trying to reproduce the issue with the PR applied before it's merged? This way we would know that this is indeed a fix for this issue.

You can check out this branch: https://github.com/hermabe/zephyr/tree/fix/meta_free from the PR.

aolowin commented 2 years ago

I've tested https://github.com/zephyrproject-rtos/zephyr/pull/45682 and it does seem to fix the issue. By keeping the peripheral at the edge of the RF range I was able to force numerous disconnect/reconnect cycles and the peripheral was always able to recover.

There were a few warnings:

Connected
[00:09:34.331,970] <inf> hrs: HRS notifications enabled
Disconnected (reason 0x08)
[00:09:38.373,413] <wrn> bt_att: Unable to allocate ATT TX meta
[00:09:38.373,443] <wrn> bt_gatt: No buffer available to send notification
[00:09:38.482,421] <inf> hrs: HRS notifications disabled
Connected
[00:09:40.669,647] <inf> hrs: HRS notifications enabled

but no errors.

carlescufi commented 2 years ago

I've tested #45682 and it does seem to fix the issue. By keeping the peripheral at the edge of the RF range I was able to force numerous disconnect/reconnect cycles and the peripheral was always able to recover.

There were a few warnings:

Connected
[00:09:34.331,970] <inf> hrs: HRS notifications enabled
Disconnected (reason 0x08)
[00:09:38.373,413] <wrn> bt_att: Unable to allocate ATT TX meta
[00:09:38.373,443] <wrn> bt_gatt: No buffer available to send notification
[00:09:38.482,421] <inf> hrs: HRS notifications disabled
Connected
[00:09:40.669,647] <inf> hrs: HRS notifications enabled

but no errors.

Thanks for testing. Could you make sure that these warnings do not prevent the stack from sending data once buffers are available again? i.e. this is a recoverable warning that doesn't require rebooting any of the Zephyr-based devices.

hermabe commented 2 years ago

There were a few warnings

I think these warnings are fine. The buffer itself is freed after it is sent to the controller, but the metadata is not freed until the callbacks are called after receiving the num_complete event from the controller. The allocation of the buffer blocks, but the allocation of the metadata does not, so if no metadata could be allocated the warning is printed and -ENOMEM is returned by the GATT API functions.

aolowin commented 2 years ago

I can confirm that the central_hr was able to receive notifications from the peripheral_hr once connected - regardless of whether the warnings occurred. No reboots required.

danielstuart14 commented 1 year ago

@jori-nordic @hermabe any chance of the fixes for this issue being backported to 2.7? https://github.com/zephyrproject-rtos/zephyr/pull/46165 https://github.com/zephyrproject-rtos/zephyr/pull/48395