zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.49k stars 6.42k forks source link

bluetooth: audio: race hazard in bt_bap_unicast_client_discover() #77130

Open Chris-StJohn-Bose opened 1 month ago

Chris-StJohn-Bose commented 1 month ago

Describe the bug This snippet of code assumes that unicast_client_pac_discover_cb() isn't called before bt_bap_unicast_client_discover() returns.

client->disc_params.func = unicast_client_pac_discover_cb;
client->disc_params.type = BT_GATT_DISCOVER_CHARACTERISTIC;
client->disc_params.start_handle = BT_ATT_FIRST_ATTRIBUTE_HANDLE;
client->disc_params.end_handle = BT_ATT_LAST_ATTRIBUTE_HANDLE;

err = bt_gatt_discover(conn, &client->disc_params);
if (err != 0) {
    return err;
}

client->dir = dir;
client->busy = true;

https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/audio/bap_unicast_client.c#L4194C1-L4210C22

but since unicast_client_pac_discover_cb() is called from a different thread a pre-emptive context switch before setting client->dir or client->busy will cause the discovery state to get very confused.

Potentially, for instance, the discovery might complete before client->busy = true is reached in which case there's no pending callback to set it false again.

I suggest instead that there is some "unwinding" of the state on error:

client->disc_params.func = unicast_client_pac_discover_cb;
client->disc_params.type = BT_GATT_DISCOVER_CHARACTERISTIC;
client->disc_params.start_handle = BT_ATT_FIRST_ATTRIBUTE_HANDLE;
client->disc_params.end_handle = BT_ATT_LAST_ATTRIBUTE_HANDLE;
client->dir = dir;
client->busy = true;

err = bt_gatt_discover(conn, &client->disc_params);
if (err != 0) {
    client->busy = false;
    return err;
}

I raise this as a bug rather than a PR since there may be other places where this anti-pattern has been used?

To Reproduce Insert an intentional context switch before client->dir = dir such as k_sleep(K_SECONDS(5))

Expected behavior The implementation shouldn't expect to execute logic out-of-order on the assumption that "it's fast enough".

Impact Minor. Realistically this probably won't happen - until it does.

Logs and console output N/A - found whilst reviewing code

Environment (please complete the following information): N/A - found whilst reviewing code

Additional context None

Thalley commented 3 weeks ago

Thanks @Chris-StJohn-Bose - This is indeed a problematic pattern that is used many places in LE Audio. It is unlikely to cause actual problems, but it is something we actually hit often when implementing unit tests.

Chris-StJohn-Bose commented 3 weeks ago

@Thalley you're welcome. I have been toying with the idea of implementing (in my spare time...) a synchronous GATT API for Zephyr so that we could eventually push all the multi-threaded stuff "under the hood".

I don't suppose you're aware of any other efforts in this direction? For Audio LE in particular the "application" thread is mainly idling whilst waiting for its callback; several example applications I've seen even implement a synchronous layer to make the example more readable e.g. https://github.com/zephyrproject-rtos/zephyr/blob/main/samples/bluetooth/bap_unicast_client/src/main.c

Thalley commented 3 weeks ago

@Thalley you're welcome. I have been toying with the idea of implementing (in my spare time...) a synchronous GATT API for Zephyr so that we could eventually push all the multi-threaded stuff "under the hood".

I don't suppose you're aware of any other efforts in this direction? For Audio LE in particular the "application" thread is mainly idling whilst waiting for its callback; several example applications I've seen even implement a synchronous layer to make the example more readable e.g. https://github.com/zephyrproject-rtos/zephyr/blob/main/samples/bluetooth/bap_unicast_client/src/main.c

I am not aware of any plans to make a synchronous GATT API (and doubt that Zephyr will ever go in that direction). There are plans to reduce the number of threads that the stack uses for Bluetooth (we recently removed the TX thread).

Ideally in the future we will be using a event based API so that an application could do something like

bt_gatt_discover(...);
get_event(BT_GATT_DISCOVER_COMPLETED);

or something like that (effectively an Async/await pattern (https://en.wikipedia.org/wiki/Async/await)

Chris-StJohn-Bose commented 3 weeks ago

The direction of my thinking is that currently the discovery callback function, for instance, may be called any time after bt_gatt_discover() is invoked - even before it returns (hence this bug...)

What if there were an option to set a GATT connection to be synchronous so that it is guaranteed that all callbacks would be called before return and the params pointer never retained? It would require minimal code changes to upper layer stacks such as BAP etc (and might expose a few bugs like this one) but an application setting this option would have much better control over callbacks, memory lifetime and error handling.

Anyway, just a thought. Pretty similar to your get_event() but implicit with an option rather than explicit after each API call.

Thalley commented 3 weeks ago

That is a fair point. It would be similar to bt_hci_cmd_send_sync (https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/bluetooth/hci.h#L62-L105).

From a simplicity perspective (in the application) it could make sense. There is, however, nothing stopping applications from doing this themselves, and is in fact how many of our babblesim tests are designed.

Whether it belongs in the stack is debatable. It adds complexity (if both sync and async can be used at the same time), it worsens performance as you wouldn't do multiple outstanding requests on multiple connections, adds memory and computational overhead, etc.

To me it seems like a patch for this callback-oriented architecture Bluetooth in Zephyr has, rather than a full proper solution (which I would argue that an event-based implementation would be).