zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.86k stars 6.62k forks source link

Deferred bt_conn_tx causes sysworkq deadlock #16803

Closed MariuszSkamra closed 5 years ago

MariuszSkamra commented 5 years ago

Mesh pb-gatt sends several GATT notifications in a row. It runs in system workqueue thread context, so that it is possible to cause deadlock after recent changes that deferred bt_conn_tx callback to system workqueue.

While sending several notifications in a row, bt_att_send takes tx_sem

int bt_att_send(struct bt_conn *conn, struct net_buf *buf, bt_conn_tx_cb_t cb,
        void *user_data)
{
...
    /* Don't use tx_sem if caller has set it own callback */
    if (!cb) {
        k_sem_take(&att->tx_sem, K_FOREVER);
        if (!att_is_connected(att)) {
            BT_WARN("Disconnected");
            k_sem_give(&att->tx_sem);
            return -ENOTCONN;
        }
    }
...
}

that is later given by bt_conn_tx.

static void att_pdu_sent(struct bt_conn *conn, void *user_data)
{
    struct bt_att *att = att_get(conn);

    BT_DBG("conn %p att %p", conn, att);

    k_sem_give(&att->tx_sem);
}

As a result running bt_att_send and bt_conn_tx from the same thread causes deadlock in ATT.

This regression was introduced with: 362497235df575586a049dd92216910c99e79094

MariuszSkamra commented 5 years ago

RTT_Terminal_201906131359.log

Vudentz commented 5 years ago

So we been blocking on system wq since btgatt* do block when BT_ATT_ENFORCE_FLOW is set, we could perhaps make att.c queue if the semaphore cannot be taken and we don't really need to save the callback as the semaphore is only used when it is set to NULL, then we resume when the previous packet completes.

Vudentz commented 5 years ago

Actually that still likely to block if we hold on the buffer into a queue it will likely block on bt_conn_create_pdu waiting for buffer to become available, so this one maybe more complicated to deal with, perhaps we will have to have the bt_att_req for notification as well so we can queue them up without requiring a buffer though the bt_att_req is something Id like to make internal to the stack so we can queue things up on its own without requiring the client to manage the pool of request.

mbolivar commented 5 years ago

I'm trying to use smp_svr on zephyr 802718a773 with nrf52840_pca10056 and have problems I'm told are related to this issue.

When I run sudo mcumgr --conntype ble --connstring ctlr_name=hci0,peer_name='Zephyr' echo hello, I eventually see Error: can't discover services: ATT request timeout: req timeout.

Here is a btmon trace: https://gist.github.com/mbolivar/d0ea16d55ffaf9d05482eea36175a967

The problem persists even if I add # CONFIG_BT_ATT_ENFORCE_FLOW is not set to my prj.conf, as suggested by @carlescufi. Enabling bluetooth debugging with CONFIG_BT_DEBUG_LOG=y doesn't produce much difference either:

***** Booting Zephyr OS zephyr-v1.14.0-1473-g6e3083fe310f *****
Bluetooth initialized
Advertising successfully started

[00:00:00.008,850] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,850] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,880] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 99
[00:00:00.010,742] <inf> bt_hci_core: Identity: e4:ce:3c:20:bb:b2 (random)
[00:00:00.010,772] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.010,772] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
uart:~$ Connected
Disconnected (reason 19)
Advertising successfully started

Edit: note that the reason the boot banner says a different zephyr sha is because i have some unrelated west changes on top to fix the west sign command, which I'm using to produce the signed mcuboot binary:

https://github.com/mbolivar/zephyr/tree/west-sign-fix-improve

They don't touch any BT code, and if they weren't producing a signed binary correctly, Zephyr never would have booted.

jhedberg commented 5 years ago

@mbolivar could you try reverting 8a7615f4a3d8932481277dffd41a4a72febd0a61 as well?

tedd-an commented 5 years ago

@mbolivar could you try reverting 8a7615f as well?

You may want to revert 2699d05e45f58897edd7d6d3718c254224390667 as well.

jhedberg commented 5 years ago

@mbolivar could you try reverting 8a7615f as well? You may want to revert 2699d05 as well.

If the issue is flow enforcement related I'd say leave commit 2699d05 as-is and revert 8a7615f

That way you're making TX callbacks happen sooner while keeping the RX callbacks later, which is essential for avoiding the race condition where we get new incoming data before we've been notified that some related piece of outgoing data has been transmitted over the air.

Vudentz commented 5 years ago

Btw, should there be a k_yield after this line:

https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/mgmt/smp.c#L277

mbolivar commented 5 years ago

could you try reverting 8a7615f as well?

same timeout result.

You may want to revert 2699d05 as well.

Same result as well (with both reverted).

Edit: I should note the following diff is applied to smp_svr/prj.conf in both cases:

+
+# CONFIG_BT_ATT_ENFORCE_FLOW is not set
+CONFIG_BT_DEBUG_LOG=y
+CONFIG_BT_DEBUG_ATT=y
Vudentz commented 5 years ago

could you try reverting 8a7615f as well?

same timeout result.

You may want to revert 2699d05 as well.

Same result as well (with both reverted).

I guess it is a different problem then, do you actually get any traffic? Or perhaps there is something with the user_data that has been changed, thought it had been increased from 7 to 8 bytes.

mbolivar commented 5 years ago

I guess it is a different problem then, do you actually get any traffic? Or perhaps there is something with the user_data that has been changed, thought it had been increased from 7 to 8 bytes.

I assume so; the device does say that it's connected and disconnected. Does the btmon trace I posted earlier help answer the question?

Here is another one for 8a7615f reverted, with the same diff on smp_svr/prj.conf: https://gist.github.com/mbolivar/c5d36baf581fc47a6d51257164adfe78

The device logs:

***** Booting Zephyr OS zephyr-v1.14.0-1467-g8cb8b2553045 *****
Bluetooth initialized
Advertising successfully started

[00:00:00.008,880] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,880] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,911] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 99
[00:00:00.010,803] <inf> bt_hci_core: Identity: e4:ce:3c:20:bb:b2 (random)
[00:00:00.010,803] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[00:00:00.010,803] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
uart:~$ Connected
[00:00:57.810,913] <dbg> bt_att.bt_att_accept: conn 0x20000630 handle 0
[00:00:57.810,913] <dbg> bt_att.bt_att_connected: chan 0x2000077c cid 0x0004
[00:00:57.818,695] <dbg> bt_att.bt_att_recv: Received ATT code 0x02 len 2
[00:00:57.818,695] <dbg> bt_att.att_mtu_req: Client MTU 512
[00:00:57.818,725] <dbg> bt_att.att_mtu_req: Server MTU 252
[00:00:57.818,725] <dbg> bt_att.att_mtu_req: Negotiated MTU 252
[00:00:57.833,099] <dbg> bt_att.att_rsp_sent: conn 0x20000630 att 0x2000077c
[00:00:57.833,221] <dbg> bt_att.bt_att_recv: Received ATT code 0x10 len 6
[00:00:57.833,282] <dbg> bt_att.att_read_group_req: start_handle 0x0001 end_handle 0xffff type 2800
[00:00:57.833,312] <dbg> bt_att.read_group_cb: handle 0x0001
[00:00:57.833,343] <dbg> bt_att.read_group_cb: handle 0x0009
[00:00:57.833,404] <dbg> bt_att.read_group_cb: handle 0x0010
[00:00:57.833,709] <err> log: argument 3 in log message "%s: start_handle 0x%04x end_handle 0x%04x type %s" missing log_strdup().
uart:~$ Disconnected (reason 19)
Advertising successfully started
[00:01:27.938,323] <dbg> bt_att.bt_att_disconnected: chan 0x2000077c cid 0x0004
jhedberg commented 5 years ago

It's a bit unclear from your logs, @mbolivar do I understand correctly that just reverting 8a7615f didn't help?

As far as I see there are two issues:

The reason I suggested reverting 8a7615f is that it should improve on both issues: with both RX and TX callbacks being on sys wq means you can't currently send more data from the callback without risking a deadlock. Reverting the patch also makes the TX callbacks happen sooner, since there's one less deferral to another thread, so the race condition should be less likely.

As for the log_strdup warning, I've opened a PR that should hopefully fix it: #16819

mbolivar commented 5 years ago

It's a bit unclear from your logs, @mbolivar do I understand correctly that just reverting 8a7615f didn't help?

Yes, that is correct.

The btmon trace in https://github.com/zephyrproject-rtos/zephyr/issues/16803#issuecomment-501808742 is with just 8a7615f reverted (edit: and the other diffs on prj.conf posted earlier disabling flow control and turning on debug logs). The timeout result is the same. The device logs in that comment are from the same test run.

jhedberg commented 5 years ago

@mbolivar ok, then I agree with @Vudentz that further studying is needed since this sounds like something else than we've originally been looking at.

mbolivar commented 5 years ago

Should I open a separate bug?

Vudentz commented 5 years ago

@mbolivar try with https://github.com/zephyrproject-rtos/zephyr/pull/16820 maybe it help with you issue as well.

mbolivar commented 5 years ago

@mbolivar try with #16820 maybe it help with you issue as well.

@Vudentz I tried with c01c68ab48 (and a clean tree, no prj.conf changes) and got similar results.