zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.61k stars 6.5k forks source link

Bluetooth: Host: ATT: Zephyr host is deadlocked due to the use of the low-priority system work queue to block the high-priority BT RX thread. #78761

Open pin-zephyr opened 2 weeks ago

pin-zephyr commented 2 weeks ago

precondition:

  1. No use of EATT on the device Peripheral_P1
  2. CONFIG_BT_ATT_TX_COUNT=8 on Peripheral_P1 (req_slab available slot = 8)

steps of reproduction:

  1. Peripheral_P1 sends out a "READ request" to Phone_A, Phone_A does not reply the request right away. Therefore there is an ongoing transaction on the attribute channel, Peripheral_P1 can not send out any gatt indication before receiving "READ response" from the Phone_A. (req_slab available slot = 7)

  2. Phone_A writes to a characteristics on the Peripheral_P1, which cause Peripheral_P1 trys to send 8 gatt indications to Phone_A.

  3. The first 7 outgoing gatt indications go well and are put into a system list since Phone_A does not reply the "READ request" yet. (req_slab available slot = 0)

  4. The 8th outgoing gatt indication will block the system workqueue by the function call gatt_req_alloc() for 30 seconds (BT_ATT_TIMEOUT) due to no more free slots in req_slab.

  5. The next incoming ATT read/write request is received by BT RX thread. then BT RX thread will notify any pending TX before processing any new data for this connection. Notifying any pending TX will commit a work to system workqueue.

However, system workqueue is already blocked in step 4 for 30 seconds. Therefore zephyr host will be deadlocked here. The deadlock will be resolved after the ATT_TIMEOUT of 30 seconds, but then the link would anyway be unusable because the BT spec allows no further ATT communication on it. This would cause one or more disconnects after 30 seconds of unresponsiveness each time on the device Peripheral_P1.

henrikbrixandersen commented 2 weeks ago

Please use our bug template when reporting bugs. You need to edit this issue to include the information requested in https://github.com/zephyrproject-rtos/zephyr/blob/main/.github/ISSUE_TEMPLATE/001_bug_report.md

jori-nordic commented 1 week ago

Notifying any pending TX will commit a work to system workqueue

So this is the bug. It's tricky to do a quick-fix due to how the host's synchronous nature.

What would be nice is getting an "out of resources" error for the last (and any future) indications. @alwa-nordic and @jhedberg feel free to chime in on how we could do that.

Notifying tx from the RX thread may seem weird, but it's done to free resources if possible for the current RX (and its associated app callback). We have changed the arch a bit, so maybe that bit can be removed as well (ping @alwa-nordic).

I'll look into the blocking on the syswq, we are already on a crusade to stop bluetooth from blocking it.

jori-nordic commented 1 week ago

@pin-zephyr could you post the zephyr hash and the exact build/zephyr/.config used for the bug?

alwa-nordic commented 1 week ago
  1. Peripheral_P1 sends out a "READ request" to Phone_A, Phone_A does not reply the request right away. Therefore there is an ongoing transaction on the attribute channel, Peripheral_P1 can not send out any gatt indication before receiving "READ response" from the Phone_A.

Note to implementers: According to spec (3.F.3.3.2), the GATT server and client have independent flow control. It is not mandated that P1 server waits with the indication until the P1 client has obtained a response.

  1. Phone_A writes to a characteristics on the Peripheral_P1, which cause Peripheral_P1 trys to send 8 gatt indications to Phone_A.

  2. The first 7 outgoing gatt indications go well and are put into a system list since Phone_A does not reply the "READ request" yet. (req_slab available slot = 0)

    1. The 8th outgoing gatt indication will block the system workqueue by the function call gatt_req_alloc() for 30 seconds (BT_ATT_TIMEOUT) due to no more free slots in req_slab.

This is an API misuse unfortunately. The Zephyr GATT API is blocking. That makes it unsuitable for use on the system work queue. The application has to invoke the GATT API on a different thread. This cannot be fixed without changing API.

We would like to introduce new non-blocking GATT APIs. This is a to-do.

alwa-nordic commented 1 week ago

Notifying any pending TX will commit a work to system workqueue

So this is the bug. It's tricky to do a quick-fix due to how the host's synchronous nature.

Not a bug. The bug is that the system work queue is blocked. But it is possible for OP to work around by adding a second "blocking definitely not allowed here" system work queue and modify the Host to post the TX work there.

jori-nordic commented 1 week ago

Not a bug

Doesn't seem to be reproducible on latest main: the only configuration that passes without errors is sending from main-thread https://github.com/jori-nordic/zephyr/blob/4b3fb9c4c5132f32d01c185ff9fdc16393f9c706/tests/bsim/bluetooth/host/att/sequential/dut/src/main.c#L34-L37

When indicating from RX thread or syswq (whether from a cb using the BT_RECV_WORKQ_SYS option or a dedicated work item), an -ENOMEM is immediately raised by bt_att_req_alloc() https://github.com/zephyrproject-rtos/zephyr/blob/ac1cc17ae90a9ede2a2b2c5d8311bddf93ea3eca/subsys/bluetooth/host/att.c#L3902-L3909

alwa-nordic commented 3 days ago

This will be fixed in https://github.com/zephyrproject-rtos/zephyr/pull/79258. The resolution is that this bug is accepted, we cannot enforce no blocking on the system work queue, so we will add a no-blocking-allowed work queue which will handle the TX work and potentially other work that does not block.