Closed cvinayak closed 1 year ago
Anylsis: Symmetrical deadlock in both devices in the connection between TX and RX thread. TX thread has filled up the all the buffers in the controller and has blocked waiting for num complete events. Controller has filled up all it's receive buffers and the RX thread is delivering ACL packets as fast as it can. The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command. The TX thread will not be able to proceed until a num complete arrives on both sides. The RX thread will not be able to proceed until TX thread is unblocked on both sides. Receive buffer is full on both sides so no new packets can be transmitted, meaning no new completes will happen.
is this fixed?
@galak No.
I ran into this issue and my workaround was to do the sending from a workqueue thread with a lower priority.
@xudongzheng Could you elaborate on your workaround? What exactly did you move to the workqueue? Sending of data, or sending of the command?
@joerchan
I was occasionally getting the "Unable to allocate TX context" error when calling bt_gatt_notify()
from a work running on the system workqueue. However moving my bt_gatt_notify()
call to a work running on my custom workqueue (with a lower priority) did not give the same "Unable to allocate TX context" error.
I haven't looked into the details of this specific bug but it's possible that https://github.com/cvinayak/zephyr/blob/a2289565363a210e33b8956eac46df55e19a790a/samples/bluetooth/peripheral/src/main.c#L476 or one of the other calls is running into the same issue.
@xudongzheng That behavior is documented here: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/bluetooth/gatt.h#L927L930
That is however not the same as is happening in this case, you haven't encountered this issue.
I'm seeing the same thing when I spam bt_gatt_notify. This seems like a dangerous bug since there's no way to really know if you'll exceed the threshold for overflow. Much better would be to simply return an error that could optionally be ignored for non-critical notifications. @galak can you comment on why you feel this is low priority?
The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command.
I can't understand this, can you point out the relevant code?
@joerchan Why we hci_le_read_max_data_len
not use bt_hci_cmd_send
to avoiding race condition ?
I'm seeing the same thing when I spam bt_gatt_notify. This seems like a dangerous bug since there's no way to really know if you'll exceed the threshold for overflow. Much better would be to simply return an error that could optionally be ignored for non-critical notifications. @galak can you comment on why you feel this is low priority?
I also see this. Whenever I see Unable to allocate TX context
the nRF52 I I control over HCI completely freezes. My bt_gatt_notify is now in a workqueue but if I decrease the notification speed to anything below 500ms the system still deadlocks. It's totally bizarre.
@cvinayak and I could generate the same assertion error using the sample in #37577
@joerchan Why we hci_le_read_max_data_len not use bt_hci_cmd_send to avoiding race condition ?
@LingaoM Because the problem would still be there for all the other places where we use bt_hci_cmd_send_sync. Also you can't simply replace it, because you want to get the output of the command complete, which you wouldn't if you don't use send_sync.
So I think I could isolate causes of the mentioned assertion fail (at least in my case). First, I am running BabbleSim and probably need to enable flow control manually:
CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
Second (and more important), I was sending whole SDU segments over an L2CAP channel which I assumed would not be fragmented as I was using the channel's MTU as the maximum payload size. As it turns out, the packets did in fact not fit into the ACL buffers and were thus also getting fragmented again. With a restricted amount of buffers available for fragmentation (see https://docs.zephyrproject.org/latest/reference/kconfig/CONFIG_BT_L2CAP_TX_FRAG_COUNT.html) this seems to have favored the sem_take timeout.
I am currently using the following config:
CONFIG_BT_L2CAP_TX_MTU=247
CONFIG_BT_BUF_ACL_RX_SIZE=256
CONFIG_BT_BUF_ACL_TX_SIZE=251
CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
And changed my mtu calculation to actually incorporate the BT_L2CAP_SDU_TX_MTU:
uint32_t mtu = MIN(le_chan.tx.mtu, BT_L2CAP_SDU_TX_MTU);
Another reason might be mentioned in https://github.com/zephyrproject-rtos/zephyr/issues/34600: if many transmissions are scheduled rapidly, the number of TX buffers could get exhausted (have not yet confirmed this).
CONFIG_BT_L2CAP_TX_BUF_COUNT=128
Related Nordic Devzone thread: https://devzone.nordicsemi.com/f/nordic-q-a/89888/ncs-nordic-uart-service-bt-rx-hci-timeout
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Is it still worked on ? Its still having the same issue.. Im on Zephyr 3.2.0
@klapro this is supposed to be fixed. Could you open a new bug report with all the relevant details (how to reproduce etc) ? Maybe it's something different.
Describe the bug Performing GATT write commands from a peripheral and a central as fast as possible from the main loop while the peer central is performing security procedure to bond, causes the following assertion fail on both the peripheral and central device
In peripheral:
In central_hr:
To Reproduce Use the branch: https://github.com/cvinayak/zephyr/commit/a2289565363a210e33b8956eac46df55e19a790a
Steps to reproduce the behavior: Build and flash peripheral
Build and flash central_hr
mkdir -p build/central_hr; cd build/central_hr
cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/central_hr
ninja
ninja flash
Open two terminals,
minicom -D /dev/ttyACMx
Observe both peripheral and central_hr fail after connection
Expected behavior peripheral and central_hr should connect, transfer write commands, perform SMP pairing, be encrypted and continue to transder write commands.
Impact showstopper
Environment (please complete the following information):