zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.08k stars 6.19k forks source link

Bluetooth: Host: `bt_conn_cb_register` is not thread-safe #74173

Open alwa-nordic opened 3 weeks ago

alwa-nordic commented 3 weeks ago

bt_conn_cb_register and bt_conn_cb_unregister are not thread safe.

Fixing this is non-trivial because of bt_conn_cb_unregister. The following must be taken into consideration:

It may be necessary to alter the API. Either by limiting in documentation when bt_conn_cb_unregister may be called, or better by changing the API to naturally synchronize. Alternatively, consider removing bt_conn_cb_unregister, as the use-case it was added for is poorly motivated[^1].

[^1]: The simpler fix would have been to call bt_conn_cb_register once on boot from SYS_INIT.

rugeGerritsen commented 3 weeks ago

The most common pattern is to register the callback before starting Bluetooth, right? Are there use cases where this is not the case? If not, I suggest documenting it.

Thalley commented 3 weeks ago

Is this an issue specifically for the conn callbacks, or for all callbacks?

jhedberg commented 3 weeks ago

I'm not opposed to removing the unregister() API, however at least the concern where linked list nodes may get removed in the middle of iterating the list should be covered by the existence of SYS_SLIST_FOR_EACH_NODE_SAFE() (which is specifically designed for such a use case).

alwa-nordic commented 3 weeks ago

I'm not opposed to removing the unregister() API, however at least the concern where linked list nodes may get removed in the middle of iterating the list should be covered by the existence of SYS_SLIST_FOR_EACH_NODE_SAFE() (which is specifically designed for such a use case).

Unfortunately SYS_SLIST_FOR_EACH_NODE_SAFE only makes it safe to delete the one item that the iteration is currently at. It's not safe to perform an arbitrary amount of deletions.

It is possible to make a custom iterator-aware list implementation that is safe.

alwa-nordic commented 3 weeks ago

Is this an issue specifically for the conn callbacks, or for all callbacks?

This is probably a general issue. But I think it's best to consider cases one by one, as the situation may be slightly different each time, and the problem at hand is more tangible.

I think it's best to limit the scope of a single GitHub ticket to focus and make our product better a little at a time.

alwa-nordic commented 3 weeks ago

The most common pattern is to register the callback before starting Bluetooth, right? Are there use cases where this is not the case? If not, I suggest documenting it.

Yes, can't think of a good case. This amount of flexibility is only truly needed for exotic stuff like for dynamically loading an arbitrary number of libraries runtime that need Bluetooth callbacks.

How should we phrase this in the documentation? As a recommendation? Or as a requirement?

Thalley commented 3 weeks ago

Is this an issue specifically for the conn callbacks, or for all callbacks?

This is probably a general issue. But I think it's best to consider cases one by one, as the situation may be slightly different each time, and the problem at hand is more tangible.

I think it's best to limit the scope of a single GitHub ticket to focus and make our product better a little at a time.

The reason why I asked is that we have multiple cases where we register/unregister callbacks in the LE audio stack and samples.

We may want to scan for something specific and do

  1. register scan callbacks
  2. filter on adv reports for what we are looking for
  3. unregister scan callbacks once it's found

Step 3 can of course be replaced by a local boolean where we just return early in the scan receive callback.

Another use case we have for unregister (and the reason why we recently added some unregister in the LE audio API), is that when running unittests, we want to register callbacks at the beginning of the test, and remove it again at the end. The reason for this is that memory isn't cleared between each run, so we need to clear it manually. There may be some workarounds that can make this work, but there certainly are some use cases for unregister.

aescolar commented 2 weeks ago

@alwa-nordic Even though this could be classified as a bug, it seems to be more of a general "we need to improve this area" kind of thing. So I will retag it as enhancement. So it follows a different triaging path. If you guys disagree, you are welcome to tag it back to bug and set a priority.

alwa-nordic commented 2 weeks ago

Ok. At least enhancement seems consistent with other issues related to thread-safety: https://github.com/zephyrproject-rtos/zephyr/issues/52364

We should evaluate whether these issues should be counted as bugs or not in the Bluetooth WG meeting. I'll add the tag.

Thalley commented 2 weeks ago

Another thing to keep in mind is that an application can also at any point set/unset the function pointers in struct bt_conn_cb. This is another way for an application to avoid getting callbacks when it's not interested (similar to the bool approach mentioned above). In this case the list itself is not modified, but the pointers inside each list node may be