zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
9.99k stars 6.15k forks source link

Mark bt_hci_cmd_send_sync API __deprecated #74642

Open LingaoM opened 1 week ago

LingaoM commented 1 week ago

Is your enhancement proposal related to a problem? Please describe.

  1. Current bt_hci_cmd_send_sync borrow cmd buffer as response, maybe cause some ram overhead: https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/hci_core.c#L162
  2. Not check response buffer length, maybe buffer overflow.
  3. The caller must be free response, otherwise will cause buffer leak.

Describe the solution you'd like Response should alloc by caller, not by API internal.

Describe alternatives you've considered See: https://github.com/zephyrproject-rtos/zephyr/pull/74600 For Zephyr API lifecycle, to avoiding cause breaking API changes, add new API: bt_hci_cmd_send_sync_v2, and old bt_hci_cmd_send_sync mark __deprecated.

Additional context None

jori-nordic commented 5 days ago

if we mark it as deprecated, I would go one step further: make a bt_hci_cmd_send_async() (with maybe a bt_hci_cmd_wait_rsp() too) and let the application manage synchronisation itself.

LingaoM commented 5 days ago

if we mark it as deprecated, I would go one step further: make a bt_hci_cmd_send_async() (with maybe a bt_hci_cmd_wait_rsp() too) and let the application manage synchronisation itself.

What are the benefits of changing to asynchronous?

jori-nordic commented 5 days ago

It opens new possibilities. The application can define a wrapper that restores the synchronous behavior, but it can also work in a more event-loop or cooperative manner.

I.e. you can fire off a VS command that starts a special advertiser or something and immediately start acquisition of a sensor value. You don't have to choose between more memory (for two threads) or slowness due to having to wait for the response all the time. This is especially useful for slow HCI links I suppose.

For usage in the stack, it allows making proper state machines instead of holding hostage the thread that called bt_init() or the syswq. There are not many places where we have a bunch of synchronous calls in the host, but that's where the bugs have been in the past. Defining proper state machines instead of a nest of if () {send-sync-cmd()}s will give us more readability and robustness.

LingaoM commented 5 days ago

A few doubts with asynchronous:

If change from sync to async, how to confirm whether the command is actually executed successfully, asynchronous notification by defining callback? Does such a design require the reconstruction of all designs of the current bluetooth host protocol stack also Mesh and LE Audio?

alwa-nordic commented 4 days ago

If change from sync to async, how to confirm whether the command is actually executed successfully, asynchronous notification by defining callback?

Yes, an async API has to give some kind of signal when the result is ready. This can be a callback, semaphore or other event.

Does such a design require the reconstruction of all designs of the current bluetooth host protocol stack also Mesh and LE Audio?

No, you can always transform the async version to sync by simply waiting for the result signal after starting the async operation. The async version is more flexible.

We have a sketch of a possible bt_hci_cmd_send_async. There you can see that bt_hci_cmd_send_sync is implemented using the async version.

LingaoM commented 4 days ago

if we mark it as deprecated, I would go one step further: make a bt_hci_cmd_send_async() (with maybe a bt_hci_cmd_wait_rsp() too) and let the application manage synchronisation itself.

Looks great, maybe we can do these two things together, which will make the code more readable.

LingaoM commented 3 days ago

We have a sketch of a possible bt_hci_cmd_send_async. There you can see that bt_hci_cmd_send_sync is implemented using the async version.

Converting to asynchronous may cause some behavioral inconsistencies:

If we call bt_hci_cmd_send_async in sysworkq thread, but not call k_sem_take(&sync_sem, HCI_CMD_TIMEOUT), the sysworkq will be continue to running, maybe in other place, some code also call bt_hci_cmd_send_async, then will be ASSERT there(https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/hci_core.c#L404).

In this condition, look like we must be bound to this sysworkq and wait must be called, but adding this constraint is strange for user development even for host & mesh, also reduce code readability.

So it looks like we can only remove this ASSERT, so there no constraint for upper developer.

But then, there will be a new problem. We assume that k_sem_take(&sync_sem, HCI_CMD_TIMEOUT) is not used after the second call bt_hci_cmd_send_async, and then the sysworkq continues to running, maybe in other place, some code also call bt_hci_cmd_send_async again, this will enter K_FOREVER for allocate insufficient command buffer, so sysworkq will be blocked(even not blocked by call bt_hci_cmd_send_async, other code also maybe enter block state).

In this case, the only way to avoid deadlock is not to call net_buf_unref(cmd_buf) on the sysworkq thread, but such a constraint is very strange.