Closed arbrauns closed 4 months ago
@arbrauns what platform is this?
I'll take a look. Bluetooth should not be blocking the syswq at any time, so it is a pretty bad bug indeed.
@arbrauns what platform is this?
It's running on an stm32f767, with the following bluetooth-related Kconfigs:
CONFIG_BT=y
CONFIG_BT_SPI=y
CONFIG_BT_SPI_BLUENRG=y
CONFIG_BT_HCI_TX_STACK_SIZE=1024
CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT=y
CONFIG_BT_HCI_ACL_FLOW_CONTROL=n # not supported by bluetooth module
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_DEVICE_NAME_DYNAMIC=y
@arbrauns The Bluetooth subsystem can be configured to use either the system workqueue or a dedicated one for RX (although it will still use the system workqueue for other operations). Can you check that CONFIG_BT_RECV_CONTEXT
is set to CONFIG_BT_RECV_WORKQ_BT
?
I don't think this will fix your problem, but let's continue the investigation.
Yes, CONFIG_BT_RECV_WORKQ_BT
is default-set. I had already experimented with simply moving some of the blocking operations to the RX workqueue, which helped with the specific problem, but creates problems in the receive path, of course.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I don't think it has magically fixed itself, the bluetooth subsystem still submits work to the system work queue.
@arbrauns the stale label just means that the issue hasn't seen activity, not that it's gone. I'll remove it. It's still low-prio though as the workaround reasonable: use a memory-mapped GPIO instead of an extender.
@arbrauns the stale label just means that the issue hasn't seen activity, not that it's gone.
It does mean that the issue will effectively be closed as wontfix.
the workaround reasonable: use a memory-mapped GPIO instead of an extender.
Can't really do that if the hardware already exists.
@arbrauns is it only bt_le_adv_resume
that's blocking you right now? I can make a quick PR to move that out.
The deadlock thing will be addressed by making the stack runnable with pre-emptible prios, which I've just started work on, but that will take time.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
@arbrauns do you still have this issue?
@arbrauns is it only bt_le_adv_resume that's blocking you right now? I can make a quick PR to move that out.
See also #52364
Sorry, I'm currently busy with other things, it might take me another week or two to test this again. 3.4.0 has also changed a couple things related to HCI-SPI interrupt handling, so this exact issue might not even be relevant anymore.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Closing this since there hasn't been reports for the latest tree. @arbrauns feel free to reopen if you still have this issue.
Still encountering this on 3.6.0, can someone reopen?
FYI, the following patch is a functioning workaround in my setup:
From f427ec6882fdc28d0b9d09f45cda53e70ce9f13f Mon Sep 17 00:00:00 2001
From: Armin Brauns <armin.brauns@embedded-solutions.at>
Date: Fri, 3 Mar 2023 11:28:30 +0100
Subject: [PATCH] bluetooth: put workqueue items in local workqueue instead of
syswq
See https://github.com/zephyrproject-rtos/zephyr/issues/55279
Signed-off-by: Armin Brauns <armin.brauns@embedded-solutions.at>
---
subsys/bluetooth/host/conn.c | 8 +++++---
subsys/bluetooth/host/hci_core.c | 25 ++++++++++++++++---------
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/subsys/bluetooth/host/conn.c b/subsys/bluetooth/host/conn.c
index 344b10de5b8..63474a8db31 100644
--- a/subsys/bluetooth/host/conn.c
+++ b/subsys/bluetooth/host/conn.c
@@ -64,6 +64,8 @@ BUILD_ASSERT(sizeof(struct tx_meta) == CONFIG_BT_CONN_TX_USER_DATA_SIZE,
#define tx_data(buf) ((struct tx_meta *)net_buf_user_data(buf))
K_FIFO_DEFINE(free_tx);
+extern struct k_work_q bt_workq;
+
static void tx_free(struct bt_conn_tx *tx);
static void conn_tx_destroy(struct bt_conn *conn, struct bt_conn_tx *tx)
@@ -812,7 +814,7 @@ static void conn_cleanup(struct bt_conn *conn)
bt_conn_reset_rx_state(conn);
- k_work_reschedule(&conn->deferred_work, K_NO_WAIT);
+ k_work_reschedule_for_queue(&bt_workq, &conn->deferred_work, K_NO_WAIT);
}
static void conn_destroy(struct bt_conn *conn, void *data)
@@ -1099,7 +1101,7 @@ void bt_conn_set_state(struct bt_conn *conn, bt_conn_state_t state)
}
#endif /* CONFIG_BT_GAP_AUTO_UPDATE_CONN_PARAMS */
- k_work_schedule(&conn->deferred_work,
+ k_work_schedule_for_queue(&bt_workq, &conn->deferred_work,
CONN_UPDATE_TIMEOUT);
}
#endif /* CONFIG_BT_CONN */
@@ -1203,7 +1205,7 @@ void bt_conn_set_state(struct bt_conn *conn, bt_conn_state_t state)
if (IS_ENABLED(CONFIG_BT_CENTRAL) &&
conn->type == BT_CONN_TYPE_LE &&
bt_dev.create_param.timeout != 0) {
- k_work_schedule(&conn->deferred_work,
+ k_work_schedule_for_queue(&bt_workq, &conn->deferred_work,
K_MSEC(10 * bt_dev.create_param.timeout));
}
diff --git a/subsys/bluetooth/host/hci_core.c b/subsys/bluetooth/host/hci_core.c
index f04d817fd37..71dfde97020 100644
--- a/subsys/bluetooth/host/hci_core.c
+++ b/subsys/bluetooth/host/hci_core.c
@@ -70,10 +70,12 @@ LOG_MODULE_REGISTER(bt_hci_core);
static void rx_work_handler(struct k_work *work);
static K_WORK_DEFINE(rx_work, rx_work_handler);
#if defined(CONFIG_BT_RECV_WORKQ_BT)
-static struct k_work_q bt_workq;
+static struct k_work_q rx_workq;
static K_KERNEL_STACK_DEFINE(rx_thread_stack, CONFIG_BT_RX_STACK_SIZE);
#endif /* CONFIG_BT_RECV_WORKQ_BT */
#endif /* !CONFIG_BT_RECV_BLOCKING */
+struct k_work_q bt_workq;
+static K_KERNEL_STACK_DEFINE(workq_thread_stack, 2048);
static struct k_thread tx_thread_data;
static K_KERNEL_STACK_DEFINE(tx_thread_stack, CONFIG_BT_HCI_TX_STACK_SIZE);
@@ -501,7 +503,7 @@ static void hci_num_completed_packets(struct net_buf *buf)
sys_slist_append(&conn->tx_complete, &tx->node);
irq_unlock(key);
- k_work_submit(&conn->tx_complete_work);
+ k_work_submit_to_queue(&bt_workq, &conn->tx_complete_work);
k_sem_give(bt_conn_get_pkts(conn));
}
@@ -3840,7 +3842,7 @@ static void rx_queue_put(struct net_buf *buf)
#if defined(CONFIG_BT_RECV_WORKQ_SYS)
const int err = k_work_submit(&rx_work);
#elif defined(CONFIG_BT_RECV_WORKQ_BT)
- const int err = k_work_submit_to_queue(&bt_workq, &rx_work);
+ const int err = k_work_submit_to_queue(&rx_workq, &rx_work);
#endif /* CONFIG_BT_RECV_WORKQ_SYS */
if (err < 0) {
LOG_ERR("Could not submit rx_work: %d", err);
@@ -4032,7 +4034,7 @@ static void rx_work_handler(struct k_work *work)
#if defined(CONFIG_BT_RECV_WORKQ_SYS)
err = k_work_submit(&rx_work);
#elif defined(CONFIG_BT_RECV_WORKQ_BT)
- err = k_work_submit_to_queue(&bt_workq, &rx_work);
+ err = k_work_submit_to_queue(&rx_workq, &rx_work);
#endif
if (err < 0) {
LOG_ERR("Could not submit rx_work: %d", err);
@@ -4097,12 +4099,17 @@ int bt_enable(bt_ready_cb_t cb)
#if defined(CONFIG_BT_RECV_WORKQ_BT)
/* RX thread */
- k_work_queue_init(&bt_workq);
- k_work_queue_start(&bt_workq, rx_thread_stack,
+ k_work_queue_init(&rx_workq);
+ k_work_queue_start(&rx_workq, rx_thread_stack,
CONFIG_BT_RX_STACK_SIZE,
K_PRIO_COOP(CONFIG_BT_RX_PRIO), NULL);
- k_thread_name_set(&bt_workq.thread, "BT RX");
+ k_thread_name_set(&rx_workq.thread, "BT RX");
#endif
+ k_work_queue_init(&bt_workq);
+ k_work_queue_start(&bt_workq, workq_thread_stack,
+ K_THREAD_STACK_SIZEOF(workq_thread_stack),
+ K_PRIO_COOP(CONFIG_BT_RX_PRIO), NULL);
+ k_thread_name_set(&bt_workq.thread, "BT WorkQ");
err = bt_dev.drv->open();
if (err) {
@@ -4116,7 +4123,7 @@ int bt_enable(bt_ready_cb_t cb)
return bt_init();
}
- k_work_submit(&bt_dev.init);
+ k_work_submit_to_queue(&bt_workq, &bt_dev.init);
return 0;
}
@@ -4181,7 +4188,7 @@ int bt_disable(void)
#if defined(CONFIG_BT_RECV_WORKQ_BT)
/* Abort RX thread */
- k_thread_abort(&bt_workq.thread);
+ k_thread_abort(&rx_workq.thread);
#endif
bt_monitor_send(BT_MONITOR_CLOSE_INDEX, NULL, 0);
--
2.34.1
FYI, the following patch is a functioning workaround in my setup:
Could you please open a PR so we can discuss this approach in it?
I don't really think it's a viable approach, it's just my dirty workaround. I don't currently have the bandwidth to come up with something good.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
Describe the bug When using an SPI-attached bluetooth controller (e.g. BlueNRG-MS) whose IRQ pin is attached to a port expander, deadlocks and subsequent crashes occur. This is because both the blutetooth stack and the port expander driver rely on the system work queue.
The sequence of events is as follows:
grep -Rw k_work_submit subsys/bluetooth/
)bt_hci_cmd_send_sync
, which blocks onsync_sem
: https://github.com/zephyrproject-rtos/zephyr/blob/9f46db90c832e56363c5d7bb42892651b39b271d/subsys/bluetooth/host/hci_core.c#L329-L330sync_sem
is given throughbt_spi_rx_thread
->bt_recv
-> ... ->hci_cmd_complete
->hci_cmd_done
when an interrupt is received on the SPI interfacegrep k_work_submit drivers/gpio/*.c
).bt_hci_cmd_send_sync
), waiting for the next item in the system work queue (the interrupt callback) to be run.Example backtrace of the blocking work item (which eventually errors):
Example backtrace of a successful run of the interrupt callback, triggering
bt_spi_rx_thread
(in the deadlock case, this work item is stuck in the work queue):Example backtrace of the
bt_spi_rx_thread
unblocking the first work item:To Reproduce Steps to reproduce the behavior: None so far, but I can try to create a reproducer if necessary (though it would still require specific hardware)
Expected behavior No deadlocks occur, bluetooth works as expected.
Impact Bluetooth is unusable on boards where a BlueNRG-MS's IRQ pin is connected through a port expander.
Logs and console output Eventually crashes with an oops:
Environment (please complete the following information):
Additional context
I have found some past issues related to the use of the system work queue by the bluetooth stack:
49661
53455