zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.49k stars 6.42k forks source link

Serial: Bluetooth: Unable to allocate buffer when sending over other Characteristic #77435

Closed dathpo closed 1 week ago

dathpo commented 3 weeks ago

Describe the bug

I am using a custom board that uses the nRF52840 SoC.

My device acts as a BLE Peripheral and has BLE logging enabled over NUS using the UART_BT module. A Central is attempting to connect and pair to it and receive the logs over the NUS.

The uart_bt module seems to prevent the Central from pairing due to a failure in allocating a buffer. This causes a disconnection and on reattempting a connection the same issue occurs. The error seems to happen after the discovery phase, after which the Central subscribes to the NUS and another Characteristic. While the Peripheral manages to send several log messages over the NUS, when it tries to send data over the other Characteristic, it fails and throws the error.

In the uart_bt.c module, increasing the ring_buf_get_claim size from 20, which is the default MTU size, to 492 fixed the issue. I imagine reducing the logging on connection would also fix it, as well as subscribing to the NUS after Char B.

Increasing CONFIG_BT_CONN_TX_MAX=10 and CONFIG_BT_L2CAP_TX_BUF_COUNT=10 didn't fix the problem.

To Reproduce

  1. Configure a Zephyr BLE Peripheral
  2. Configure the UART_BT module:
CONFIG_LOG=y
CONFIG_PRINTK=y
CONFIG_SERIAL=y
CONFIG_UART_CONSOLE=y
CONFIG_UART_BT=y
CONFIG_BT_ZEPHYR_NUS=y
CONFIG_BT_NUS=y
  1. Configure the Peripheral to send data over another Service Characteristic (B) on its subscription
  2. Configure the Peripheral to log most events on a BLE connection
  3. From a Central, connect to the Peripheral
  4. Discover the Services
  5. Subscribe to the NUS
  6. Subscribe to Char B
  7. Pair

Expected behavior

The Central should be able to subscribe and communicate across both Service Characteristics without triggering disconnections.

Impact

The two devices will fail to connect and remain in this connection-disconnection loop.

Logs and console output

Zephyr BLE peripheral: image

External BLE Central that is attempting to pair with the Peripheral: image

Environment (please complete the following information):

Additional context

ubieda commented 3 weeks ago

@dathpo I'm unable to replicate this on upstream Zephyr. Here's what I've used:

I see no disconnects, however.

Can you try replicating using an upstream sample-code? That way I could attempt to follow the repro steps on my end.

Also, are you able to replicate with your exact same setup + the fix from #77430?

dathpo commented 3 weeks ago

@ubieda to answer your question at the end: I was running with the fix and still had the same issue.

I tried to reproduce the issue using the samples with added modifications but I wasn't able to, no matter how many logs I would send and the order of the sends to the Characteristics.

With the CONFIG_NET_BUF_LOG option enabled we get a bit more information:

[14:36:23.649][INF] DISCOVERED
[14:36:23.651][DBG] Unsubscribed from notifications from GATT service 0x0001 characteristic 0x0003
[14:36:23.958][DBG] Subscribed to notifications from GATT service 0x0001 characteristic 0x0003
[14:36:24.267][DBG] Unsubscribed from notifications from GATT service 0x0101 characteristic 0x0102
[14:36:24.574][DBG] Subscribed to notifications from GATT service 0x0101 characteristic 0x0102
[14:36:24.574][DBG] LOG - 16.872,375] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:24.574][DBG] LOG - [01:02:16.872,894] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:24.574][DBG] LOG - [01:02:16.973,937] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:24.574][DBG] LOG - [01:02:16.974,792] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:24.752][DBG] LOG - [01:02:17.073,913] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:24.757][DBG] LOG - [01:02:17.074,798] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:24.952][DBG] LOG - [01:02:17.173,919] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.050][DBG] LOG - [01:02:17.174,774] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:25.249][DBG] LOG - [01:02:17.273,925] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.256][DBG] LOG - [01:05:39.758,117] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.358][DBG] LOG - [01:05:39.758,850] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:25.454][DBG] LOG - [01:05:39.858,123] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.553][DBG] LOG - [01:05:39.859,008] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:25.652][DBG] LOG - [01:05:39.955,963] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.750][DBG] LOG - [01:05:39.956,115] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:25.758][DBG] LOG - [01:05:40.057,586] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:25.952][DBG] LOG - [01:05:40.058,349] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:26.051][DBG] LOG - [01:05:40.157,592] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:26.057][DBG] LOG - [01:05:40.158,355] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:26.154][DBG] LOG - [01:05:40.255,889] <inf> bt_auth: Notifications enabled 
[14:36:26.252][DBG] LOG - [01:05:40.256,011] <inf> bt_auth: Sending 0x03 over characteristic...
[14:36:26.349][DBG] LOG - [01:05:40.256,164] <err> net_buf: bt_l2cap_create_pdu_timeout():635: Failed to get free buffer
[14:36:26.356][DBG] LOG - [01:05:40.256,164] <wrn> bt_conn: Unable to allocate buffer within timeout
[14:36:26.451][DBG] LOG - [01:05:40.256,164] <err> bt_att: Unable to allocate buffer for op 0x1b
[14:36:26.456][DBG] LOG - [01:05:40.256,195] <wrn> bt_gatt: No buffer available to send notification
[14:36:26.649][DBG] LOG - [01:05:40.256,195] <wrn> bt_auth: Failed to send data over characteristic, error -12
[14:36:26.659][DBG] LOG - [01:05:40.256,286] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:26.756][DBG] LOG - [01:05:40.256,439] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.
[14:36:26.854][DBG] LOG - [01:05:40.358,154] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 blocked for 0 secs
[14:36:26.955][DBG] LOG - [01:05:40.359,039] <wrn> net_buf: bt_l2cap_create_pdu_timeout():635: Pool 0x200016c8 low on buffers.

I ended up modifying the code on my Central application side so that it only subscribes to the NUS after pairing, instead of on discovery. This allows the communication with the Char B to be isolated until pairing. This seems to have solved the problem for me, even with the default chunk size of 20.

It is hard to say what the issue is and what system is responsible for it. So far I have only observed it from my custom Central application. Mobile and desktop nRF Connect apps weren't displaying it, however they are not as "efficient" in performing all the connectivity steps as my Central app. Feel free to close the issue if you feel so anyways.

I have a few other points about the uart_bt module I wanted to raise with you, however:

  1. This one is actually a bit of a dealbreaker: the module is very power-hungry. My device draws 600 uA more (700 uA total) than a build that uses the LOG_BACKEND_BLE module, while doing the same task. I imagine a lot of extra power is due to uart_bt requiring CONFIG_SERIAL, which itself adds at least 400 uA AFAIK. And that current offset is fixed as long as the system is on. This means that my battery-powered devices would only last a few days. Is there any way this can be addressed?
  2. Some of the log messages seem to be truncated when a lot of data is logged. This for me happens consistently during my connection phase. There are no warnings coming from the uart_bt module when this happens and I can see the missing logs on the RTT stream side so that makes me think the issue isn't LOG_BUFFER.

Here is a log from my Central application. You can see the missing logs on the line at 22:44:13.351. It jumps from uptime 0 to uptime 10 seconds.

[22:44:11.846][INF] PAIRED
[22:44:11.848][DBG] Unsubscribed from notifications from GATT service 0x0001 characteristic 0x0003
[22:44:12.157][DBG] Subscribed to notifications from GATT service 0x0001 characteristic 0x0003
[22:44:12.466][DBG] Unsubscribed from notifications from GATT service 0x0001 characteristic 0x0002
[22:44:12.775][DBG] Unsubscribed from notifications from GATT service 0x0101 characteristic 0x0102
[22:44:12.775][DBG] LOG - [00:00:00.003,906] <inf> LSM6DSO: Initialize regulator reg_ic201
[22:44:12.775][DBG] LOG - [00:00:00.005,432] <inf> LSM6DSO: Regulator reg_ic201 powered on
[22:44:12.775][DBG] LOG - [00:00:00.005,493] <inf> LSM6DSO: Initialize device lsm6dso@0
[22:44:12.775][DBG] LOG - [00:00:00.005,554] <inf> LSM6DSO: chip id 0x6c
[22:44:12.775][DBG] LOG - *** Booting My Application v1.2.3-beta-19307655a822 ***
[22:44:12.775][DBG] LOG - *** Using nRF Connect SDK v2.7.0-5cb85570ca43 ***
[22:44:12.775][DBG] LOG - *** Using Zephyr OS v3.6.99-100befc70c74 ***
[22:44:12.946][DBG] LOG - [00:00:00.011,260] <inf> main: Build time: Aug 23 2024 22:42:31
[22:44:13.246][DBG] LOG - [00:00:00.019,989] <inf> fs_nvs: 2 Sectors of 4096 bytes
[22:44:13.252][DBG] LOG - [00:00:00.019,989] <inf> fs_nvs: alloc wra: 0, fe8
[22:44:13.351][DBG] LOG - [00:00:00.020,0[00:00:09.853,088] <inf> bt_dual: Security changed to level 2 with E4:61:B0:65:6D:F0 (random)
[22:44:13.446][DBG] LOG - [00:00:10.165,374] <inf> bt_dual: Pairing completed with E4:61:B0:65:6D:F0 (random), level 2, bonded: 1
[22:44:13.541][DBG] LOG - [00:00:10.165,496] <inf> bt_dual: Found bond stored for E6:28:44:09:05:AB (random)
[22:44:13.550][DBG] LOG - [00:00:10.165,649] <inf> bt_dual: Found bond stored for E4:61:B0:65:6D:F0 (random)
ubieda commented 1 week ago

I ended up modifying the code on my Central application side so that it only subscribes to the NUS after pairing, instead of on discovery. This allows the communication with the Char B to be isolated until pairing. This seems to have solved the problem for me, even with the default chunk size of 20.

I agree on waiting until both pairing and discovery are done, before subscribing to NUS notifications.

WRT the additional issues you're experiencing:

This one is actually a bit of a dealbreaker: the module is very power-hungry. My device draws 600 uA more (700 uA total) than a build that uses the LOG_BACKEND_BLE module, while doing the same task. I imagine a lot of extra power is due to uart_bt requiring CONFIG_SERIAL, which itself adds at least 400 uA AFAIK. And that current offset is fixed as long as the system is on. This means that my battery-powered devices would only last a few days. Is there any way this can be addressed?

This may be related to other UART peripherals enabled when CONFIG_SERIAL is enabled. I suggest you look in the generated device-tree and make sure there aren't unintended devices enabled.

Some of the log messages seem to be truncated when a lot of data is logged. This for me happens consistently during my connection phase. There are no warnings coming from the uart_bt module when this happens and I can see the missing logs on the RTT stream side so that makes me think the issue isn't LOG_BUFFER.

This has to do with the NUS TX buffer size specified in the device-tree. The way this works is that UART BT will buffer generated traffic and dumps it once you connect and subscribe. This buffer may be filled since the time to connect (or the logs generated) may be significant. When this buffer fills up, it just discards extra data instead of overwriting it. To expand this buffer, adjust this to your requirements:

    bt_nus_console_uart: bt_nus_console_uart {
        compatible = "zephyr,nus-uart";
        rx-fifo-size = <1024>;
        tx-fifo-size = <1024>; /** Increase this to your needs */
    };

@dathpo I'll close this issue as it appears the original issue is addressed. If you're still experiencing issues, feel free to open a separate tickets and I'll take a look.