zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.84k stars 6.6k forks source link

Crash when using SPI bluetooth controller with IRQ pin attached to port expander #55279

Closed arbrauns closed 4 months ago

arbrauns commented 1 year ago

Describe the bug When using an SPI-attached bluetooth controller (e.g. BlueNRG-MS) whose IRQ pin is attached to a port expander, deadlocks and subsequent crashes occur. This is because both the blutetooth stack and the port expander driver rely on the system work queue.

The sequence of events is as follows:

  1. Bluetooth stack submits a work item to the system work queue (grep -Rw k_work_submit subsys/bluetooth/)
  2. This work item calls bt_hci_cmd_send_sync, which blocks on sync_sem: https://github.com/zephyrproject-rtos/zephyr/blob/9f46db90c832e56363c5d7bb42892651b39b271d/subsys/bluetooth/host/hci_core.c#L329-L330
  3. sync_sem is given through bt_spi_rx_thread -> bt_recv -> ... -> hci_cmd_complete -> hci_cmd_done when an interrupt is received on the SPI interface
  4. In order to dispatch pin interrupt callbacks, the port expander driver schedules a work item in the system work queue (grep k_work_submit drivers/gpio/*.c).
  5. Now, the system work queue is blocked (in bt_hci_cmd_send_sync), waiting for the next item in the system work queue (the interrupt callback) to be run.

Example backtrace of the blocking work item (which eventually errors):

...
#5  0x0803031e in _oops () at $ZEPHYR/arch/arm/core/aarch32/swap_helper.S:570
#6  <signal handler called>
#7  0x08034c60 in bt_hci_cmd_send_sync (opcode=opcode@entry=8202, buf=buf@entry=0x2004ec08 <_net_buf_hci_cmd_pool+32>, rsp=rsp@entry=0x0) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:330
#8  0x08037fd2 in bt_le_adv_set_enable_legacy (adv=adv@entry=0x20020a48 <bt_dev+24>, enable=<optimized out>) at $ZEPHYR/subsys/bluetooth/host/adv.c:336
#9  0x0803800e in bt_le_adv_set_enable (adv=adv@entry=0x20020a48 <bt_dev+24>, enable=enable@entry=true) at $ZEPHYR/subsys/bluetooth/host/adv.c:386
#10 0x080386ec in bt_le_adv_resume () at $ZEPHYR/subsys/bluetooth/host/adv.c:1513
#11 0x08039b6a in bt_conn_unref (conn=conn@entry=0x20027b28 <acl_conns>) at $ZEPHYR/subsys/bluetooth/host/conn.c:1267
#12 0x0803aaa2 in deferred_work (work=0x20027b88 <acl_conns+96>) at $ZEPHYR/subsys/bluetooth/host/conn.c:1734
#13 0x080aaec4 in work_queue_main (workq_ptr=0x2002c6e0 <k_sys_work_q>, p2=<optimized out>, p3=<optimized out>) at $ZEPHYR/kernel/work.c:668
#14 0x0801c38a in z_thread_entry (entry=0x80aadf5 <work_queue_main>, p1=0x2002c6e0 <k_sys_work_q>, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#15 0xaaaaaaaa in ?? ()

Example backtrace of a successful run of the interrupt callback, triggering bt_spi_rx_thread (in the deadlock case, this work item is stuck in the work queue):

#0  bt_spi_isr (unused1=0x80b5f44 <__device_dts_ord_35>, unused2=0x20038970 <gpio_cb>, unused3=1) at $ZEPHYR/drivers/bluetooth/hci/spi.c:165
#1  0x0806cff4 in gpio_fire_callbacks (list=list@entry=0x2002204c <mcp230xx_0_drvdata+28>, port=port@entry=0x80b5f44 <__device_dts_ord_35>, pins=1) at $ZEPHYR/include/zephyr/drivers/gpio/gpio_utils.h:78
#2  0x0806db6a in mcp23xxx_work_handler (work=0x20022064 <mcp230xx_0_drvdata+52>) at $ZEPHYR/drivers/gpio/gpio_mcp23xxx.c:446
#3  0x080aaec4 in work_queue_main (workq_ptr=0x2002c6e0 <k_sys_work_q>, p2=<optimized out>, p3=<optimized out>) at $ZEPHYR/kernel/work.c:668
#4  0x0801c38a in z_thread_entry (entry=0x80aadf5 <work_queue_main>, p1=0x2002c6e0 <k_sys_work_q>, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#5  0xaaaaaaaa in ?? ()

Example backtrace of the bt_spi_rx_thread unblocking the first work item:

#0  hci_cmd_complete (buf=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:2101
#1  0x0803335c in handle_event_common (event=event@entry=14 '\016', buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>, handlers=handlers@entry=0x80c42b8 <prio_events>, num_handlers=num_handlers@entry=5) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:164
#2  0x08034148 in handle_event (event=<optimized out>, buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>, handlers=handlers@entry=0x80c42b8 <prio_events>, num_handlers=num_handlers@entry=5) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:176
#3  0x08036284 in hci_event_prio (buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:3475
#4  0x0803635e in bt_recv (buf=buf@entry=0x2004ebe8 <_net_buf_hci_cmd_pool>) at $ZEPHYR/subsys/bluetooth/host/hci_core.c:3525
#5  0x08076760 in bt_spi_rx_thread () at $ZEPHYR/drivers/bluetooth/hci/spi.c:381
#6  0x0801c38a in z_thread_entry (entry=0x80764fd <bt_spi_rx_thread>, p1=0x0, p2=0x0, p3=0x0) at $ZEPHYR/lib/os/thread_entry.c:36
#7  0xaaaaaaaa in ?? ()

To Reproduce Steps to reproduce the behavior: None so far, but I can try to create a reproducer if necessary (though it would still require specific hardware)

Expected behavior No deadlocks occur, bluetooth works as expected.

Impact Bluetooth is unusable on boards where a BlueNRG-MS's IRQ pin is connected through a port expander.

Logs and console output Eventually crashes with an oops:

ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:330
    k_sem_take failed with err -11
[00:01:10.319,000] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000002  r2/a3:  0x00000000
[00:01:10.319,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x000abadb r14/lr:  0x08034c55
[00:01:10.319,000] <err> os:  xpsr:  0x41000000
[00:01:10.319,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:01:10.319,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:01:10.320,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:01:10.320,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:01:10.320,000] <err> os: fpscr:  0x00000000
[00:01:10.320,000] <err> os: Faulting instruction address (r15/pc): 0x08034c60
[00:01:10.320,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:01:10.320,000] <err> os: Current thread: 0x2002c6e0 (sysworkq)
[00:01:10.320,000] <err> os: Halting system

Environment (please complete the following information):

Additional context

I have found some past issues related to the use of the system work queue by the bluetooth stack:

nashif commented 1 year ago

@arbrauns what platform is this?

jori-nordic commented 1 year ago

I'll take a look. Bluetooth should not be blocking the syswq at any time, so it is a pretty bad bug indeed.

arbrauns commented 1 year ago

@arbrauns what platform is this?

It's running on an stm32f767, with the following bluetooth-related Kconfigs:

CONFIG_BT=y
CONFIG_BT_SPI=y
CONFIG_BT_SPI_BLUENRG=y
CONFIG_BT_HCI_TX_STACK_SIZE=1024
CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT=y
CONFIG_BT_HCI_ACL_FLOW_CONTROL=n # not supported by bluetooth module
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_DEVICE_NAME_DYNAMIC=y
carlescufi commented 1 year ago

@arbrauns The Bluetooth subsystem can be configured to use either the system workqueue or a dedicated one for RX (although it will still use the system workqueue for other operations). Can you check that CONFIG_BT_RECV_CONTEXT is set to CONFIG_BT_RECV_WORKQ_BT?

I don't think this will fix your problem, but let's continue the investigation.

arbrauns commented 1 year ago

Yes, CONFIG_BT_RECV_WORKQ_BT is default-set. I had already experimented with simply moving some of the blocking operations to the RX workqueue, which helped with the specific problem, but creates problems in the receive path, of course.

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

arbrauns commented 1 year ago

I don't think it has magically fixed itself, the bluetooth subsystem still submits work to the system work queue.

jori-nordic commented 1 year ago

@arbrauns the stale label just means that the issue hasn't seen activity, not that it's gone. I'll remove it. It's still low-prio though as the workaround reasonable: use a memory-mapped GPIO instead of an extender.

arbrauns commented 1 year ago

@arbrauns the stale label just means that the issue hasn't seen activity, not that it's gone.

It does mean that the issue will effectively be closed as wontfix.

the workaround reasonable: use a memory-mapped GPIO instead of an extender.

Can't really do that if the hardware already exists.

jori-nordic commented 1 year ago

@arbrauns is it only bt_le_adv_resume that's blocking you right now? I can make a quick PR to move that out.

The deadlock thing will be addressed by making the stack runnable with pre-emptible prios, which I've just started work on, but that will take time.

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

kruithofa commented 1 year ago

@arbrauns do you still have this issue?

@arbrauns is it only bt_le_adv_resume that's blocking you right now? I can make a quick PR to move that out.

See also #52364

arbrauns commented 1 year ago

Sorry, I'm currently busy with other things, it might take me another week or two to test this again. 3.4.0 has also changed a couple things related to HCI-SPI interrupt handling, so this exact issue might not even be relevant anymore.

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

jhedberg commented 1 year ago

Closing this since there hasn't been reports for the latest tree. @arbrauns feel free to reopen if you still have this issue.

arbrauns commented 7 months ago

Still encountering this on 3.6.0, can someone reopen?

arbrauns commented 7 months ago

FYI, the following patch is a functioning workaround in my setup:

From f427ec6882fdc28d0b9d09f45cda53e70ce9f13f Mon Sep 17 00:00:00 2001
From: Armin Brauns <armin.brauns@embedded-solutions.at>
Date: Fri, 3 Mar 2023 11:28:30 +0100
Subject: [PATCH] bluetooth: put workqueue items in local workqueue instead of
 syswq

See https://github.com/zephyrproject-rtos/zephyr/issues/55279

Signed-off-by: Armin Brauns <armin.brauns@embedded-solutions.at>
---
 subsys/bluetooth/host/conn.c     |  8 +++++---
 subsys/bluetooth/host/hci_core.c | 25 ++++++++++++++++---------
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/subsys/bluetooth/host/conn.c b/subsys/bluetooth/host/conn.c
index 344b10de5b8..63474a8db31 100644
--- a/subsys/bluetooth/host/conn.c
+++ b/subsys/bluetooth/host/conn.c
@@ -64,6 +64,8 @@ BUILD_ASSERT(sizeof(struct tx_meta) == CONFIG_BT_CONN_TX_USER_DATA_SIZE,
 #define tx_data(buf) ((struct tx_meta *)net_buf_user_data(buf))
 K_FIFO_DEFINE(free_tx);

+extern struct k_work_q bt_workq;
+
 static void tx_free(struct bt_conn_tx *tx);

 static void conn_tx_destroy(struct bt_conn *conn, struct bt_conn_tx *tx)
@@ -812,7 +814,7 @@ static void conn_cleanup(struct bt_conn *conn)

    bt_conn_reset_rx_state(conn);

-   k_work_reschedule(&conn->deferred_work, K_NO_WAIT);
+   k_work_reschedule_for_queue(&bt_workq, &conn->deferred_work, K_NO_WAIT);
 }

 static void conn_destroy(struct bt_conn *conn, void *data)
@@ -1099,7 +1101,7 @@ void bt_conn_set_state(struct bt_conn *conn, bt_conn_state_t state)
            }
 #endif /* CONFIG_BT_GAP_AUTO_UPDATE_CONN_PARAMS */

-           k_work_schedule(&conn->deferred_work,
+           k_work_schedule_for_queue(&bt_workq, &conn->deferred_work,
                    CONN_UPDATE_TIMEOUT);
        }
 #endif /* CONFIG_BT_CONN */
@@ -1203,7 +1205,7 @@ void bt_conn_set_state(struct bt_conn *conn, bt_conn_state_t state)
        if (IS_ENABLED(CONFIG_BT_CENTRAL) &&
            conn->type == BT_CONN_TYPE_LE &&
            bt_dev.create_param.timeout != 0) {
-           k_work_schedule(&conn->deferred_work,
+           k_work_schedule_for_queue(&bt_workq, &conn->deferred_work,
                    K_MSEC(10 * bt_dev.create_param.timeout));
        }

diff --git a/subsys/bluetooth/host/hci_core.c b/subsys/bluetooth/host/hci_core.c
index f04d817fd37..71dfde97020 100644
--- a/subsys/bluetooth/host/hci_core.c
+++ b/subsys/bluetooth/host/hci_core.c
@@ -70,10 +70,12 @@ LOG_MODULE_REGISTER(bt_hci_core);
 static void rx_work_handler(struct k_work *work);
 static K_WORK_DEFINE(rx_work, rx_work_handler);
 #if defined(CONFIG_BT_RECV_WORKQ_BT)
-static struct k_work_q bt_workq;
+static struct k_work_q rx_workq;
 static K_KERNEL_STACK_DEFINE(rx_thread_stack, CONFIG_BT_RX_STACK_SIZE);
 #endif /* CONFIG_BT_RECV_WORKQ_BT */
 #endif /* !CONFIG_BT_RECV_BLOCKING */
+struct k_work_q bt_workq;
+static K_KERNEL_STACK_DEFINE(workq_thread_stack, 2048);
 static struct k_thread tx_thread_data;
 static K_KERNEL_STACK_DEFINE(tx_thread_stack, CONFIG_BT_HCI_TX_STACK_SIZE);

@@ -501,7 +503,7 @@ static void hci_num_completed_packets(struct net_buf *buf)
            sys_slist_append(&conn->tx_complete, &tx->node);
            irq_unlock(key);

-           k_work_submit(&conn->tx_complete_work);
+           k_work_submit_to_queue(&bt_workq, &conn->tx_complete_work);
            k_sem_give(bt_conn_get_pkts(conn));
        }

@@ -3840,7 +3842,7 @@ static void rx_queue_put(struct net_buf *buf)
 #if defined(CONFIG_BT_RECV_WORKQ_SYS)
    const int err = k_work_submit(&rx_work);
 #elif defined(CONFIG_BT_RECV_WORKQ_BT)
-   const int err = k_work_submit_to_queue(&bt_workq, &rx_work);
+   const int err = k_work_submit_to_queue(&rx_workq, &rx_work);
 #endif /* CONFIG_BT_RECV_WORKQ_SYS */
    if (err < 0) {
        LOG_ERR("Could not submit rx_work: %d", err);
@@ -4032,7 +4034,7 @@ static void rx_work_handler(struct k_work *work)
 #if defined(CONFIG_BT_RECV_WORKQ_SYS)
        err = k_work_submit(&rx_work);
 #elif defined(CONFIG_BT_RECV_WORKQ_BT)
-       err = k_work_submit_to_queue(&bt_workq, &rx_work);
+       err = k_work_submit_to_queue(&rx_workq, &rx_work);
 #endif
        if (err < 0) {
            LOG_ERR("Could not submit rx_work: %d", err);
@@ -4097,12 +4099,17 @@ int bt_enable(bt_ready_cb_t cb)

 #if defined(CONFIG_BT_RECV_WORKQ_BT)
    /* RX thread */
-   k_work_queue_init(&bt_workq);
-   k_work_queue_start(&bt_workq, rx_thread_stack,
+   k_work_queue_init(&rx_workq);
+   k_work_queue_start(&rx_workq, rx_thread_stack,
               CONFIG_BT_RX_STACK_SIZE,
               K_PRIO_COOP(CONFIG_BT_RX_PRIO), NULL);
-   k_thread_name_set(&bt_workq.thread, "BT RX");
+   k_thread_name_set(&rx_workq.thread, "BT RX");
 #endif
+   k_work_queue_init(&bt_workq);
+   k_work_queue_start(&bt_workq, workq_thread_stack,
+              K_THREAD_STACK_SIZEOF(workq_thread_stack),
+              K_PRIO_COOP(CONFIG_BT_RX_PRIO), NULL);
+   k_thread_name_set(&bt_workq.thread, "BT WorkQ");

    err = bt_dev.drv->open();
    if (err) {
@@ -4116,7 +4123,7 @@ int bt_enable(bt_ready_cb_t cb)
        return bt_init();
    }

-   k_work_submit(&bt_dev.init);
+   k_work_submit_to_queue(&bt_workq, &bt_dev.init);
    return 0;
 }

@@ -4181,7 +4188,7 @@ int bt_disable(void)

 #if defined(CONFIG_BT_RECV_WORKQ_BT)
    /* Abort RX thread */
-   k_thread_abort(&bt_workq.thread);
+   k_thread_abort(&rx_workq.thread);
 #endif

    bt_monitor_send(BT_MONITOR_CLOSE_INDEX, NULL, 0);
--
2.34.1
carlescufi commented 7 months ago

FYI, the following patch is a functioning workaround in my setup:

Could you please open a PR so we can discuss this approach in it?

arbrauns commented 7 months ago

I don't really think it's a viable approach, it's just my dirty workaround. I don't currently have the bandwidth to come up with something good.

github-actions[bot] commented 5 months ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.