Bluetooth: BLE stack notifications blocks host side for too long (`drivers/bluetooth/hci/spi.c` and `hci_spi`)

flo-carrere commented 1 year ago

First, I have no pretention in stating it is a bug, it may be the normal way of working of the BLE stack architecture at this moment. In this case, it could be reclassified as an enhancement request if relevant. I may be expecting too much.

Preliminary context

I am using a classic dual-chip configuration:

Host-side : custom board (stm32h7, similar to a nucleo-h743zi), using a host-build only
- 3 services (HID, BAS, DIS)
- HID reporting rate @133Hz (7.5ms)
Controller-side : standard hci_spi sample project flashed on nrf52840 addon board connected through SPI@1MHz

Describe the bug

From the host, using a simple bt_gatt_notify_cb to update a BLE service characteristic (let's say a 17-bytes long HID report update). I have noticed this call is blocking (OK) for a unexpectedly long period of time (NOK) of approximately 1.5ms degrading others thread "performances" in a non-negligible way.

It's basically a 1.5ms penalty every 7.5ms that is spent in a busy-wait style fashion starving other actions in the system. More details below on my investigations.

On the other hand, a similar system reporting through USB-HID at the same rate, do not suffer from the same deficiency and a reasonable 0.1 <-> 0.7ms is spent.

What have you tried to diagnose or workaround this issue?

1) Independently of my full system I have been able to reproduce this blocking behavior by setting up a sample test application for the host side, reporting a classic BAS notification (battery level %) periodically (every 50ms down to every 7.5ms (target)) and measuring by software the time spent for a bt_gatt_notify_cb operation.

// Sample test - pseudo code
period = 50ms;
while(1) {
  time;
  bt_gatt_notify(); // reporting BAS notification
  elapsed = now - time;
  printf("%d", elapsed); // connected:~750us <--> ~980us, not connected: ~5us
  wait( period - elapsed );
}

The host-side prj.conf, I use for the sample test is provided below. prj.conf.txt

2) I have also tried to pimp my BLE stack configuration using the throughput maximization sample project but with no effect.

3) When debugging, I noticed that the vast majority of time was spent in "waiting" for a rescheduling operation to complete in the implementation of the BLE stack, for some reason.

In a nutshell my call stack (host):

bt_gatt_notify_cb // app waiting here
  // GATT
    // ATT
     // ACL
       // KERNEL QUEUE - k_queue_append_list
        z_reschedule(&queue->lock, key);    <<<< Waiting here for ~80% of the time
     // HCI_SPI transfer happening at some point

   // ... returns and continues normally up to the next notification

  // Transfer OTA happening on controller -> at next connection interval -> OK   
  // callback notification sent received -> OK

4) I have not tried yet but I envisioned moving to a single-chip configuration (Host + Controller using HCI RAM) running on nrf52840 and connecting my current "Host" MCU with a custom protocol over SPI to receive raw notifications.

This way I could have a fully externalized BLE stack not impacting the rest of my system (hopefully) With the big cons of maintaining two FWs instead of a single one....

Do you think it's a valid way of working around the issue?

Is this a regression? If yes, have you been able to "git bisect" it to a specific commit? I don't think it's a regression, probably the way it has always worked

Expected behavior

From a system architecture point of view, I was expecting the dual-chip configuration to work in tandem:

Host-side -> serializing and publishing notification very fast over HCI -> achieving a minimal time waiting
Controller-side -> offloading the full BLE-OTA transfer
- Tuned with sufficiently large HCI events queues according to the report rate/size/types
It's a pity to have two powerful MCUs (STM32H7 on host + nrf52840 on controller) and having such a poor behavior.

Impact

As stated this "poor notification performances" are very annoying as it considerably degrades the full system performance, and sometimes delays time-critical operations (acquisition threads...).

Environment (please complete the following information):

Building on Windows 10, connecting my device (BLE) on the same laptop.
Toolchain
- Zephyr SDK 0.14.1
- Zephyr v3.0 > Similar behavior on Zephyr v3.1

Thalley commented 1 year ago

I won't take a look at this, but from the issue text it sounds like it's a simple matter of the host running out of buffers, and have to wait for previous buffers to be free'd.

Also, as this seems to be a simple request for enhancing the performance (if possible), then I suggest to change the type from bug to enhancement.

flo-carrere commented 1 year ago

@Thalley No, I don't think so as it is exactly what I tried to check first with no success.

Let me explain further.

I have also tried to pimp my BLE stack configuration using the throughput maximization sample project but with no effect.

This basically proposes to increase both Host/Controller configuration with some more buffer/payload/MTUs/queue/stacks and so on up to their maximum configurable value, creating huge space at every level throughout the full stack.

Commented in the prj.conf.txt provided are the options I checked (with values) and results associated.

... reporting periodically (every 50ms down to every 7.5ms (target))

IMO, what you point as a "simple buffer problem" would be mitigated if we both:

Increase the configuration as stated above
Decrease the reporting rate, up to a point where we enqueue/dequeue at a similar rate (50ms sounds very reasonable to me but again I may be expecting too much).

I may re-run with very low reporting rate @1sec and let you know the outcome, what do you think?

Thalley commented 1 year ago

@flo-carrere Keep in mind that you can only send 1 GATT notification per connection interval. Have you tried increasing that as well? Since you are sending in a while(1) loop, it won't do any good to increase the number of buffers, if the buffers are queued faster than they can be sent.

flo-carrere commented 1 year ago

Keep in mind that you can only send 1 GATT notification per connection interval. Have you tried increasing that as well?

No I haven't, is there a kconfig option available to do so? Can you point me to a way to do so?

BTW I didn't modify any BLE stack code so far only went through during debugging.

Since you are sending in a while(1) loop, it won't do any good to increase the number of buffers, if the buffers are queued faster than they can be sent.

I agree with this, that's why I emulated a reporting rate of 50ms (see pseudo-code), that I believe is in the same order of magnitude as the connection interval (~15 <-> 40ms) based on the time I receive GATT notification callback.

Thalley commented 1 year ago

No I haven't, is there a kconfig option available to do so? Can you point me to a way to do so?

The connection interval is done at runtime, and is initially controller by the central. The peripheral may however ask to modify it by calling bt_conn_le_param_update.

If you also control the central, you can just modify the conn_param when calling bt_conn_le_create

The default values should be around 30-50ms interval, but can be reduced to 7.5ms

flo-carrere commented 1 year ago

I may re-run with very low reporting rate at 1sec and let you know the outcome, what do you think?

As I expected, even when diminishing the reporting rate to 1 Hz it does not change the time spent in GATT notify.

I measure a constant 1.482ms to "notify" a 17 bytes long HID report update, and I receive the callback from the controller 40ms after (using the default connection interval).

My understanding is that this time should be spent for 3 actions on host-side:

Packetizing the GATT notification according to the LL-stack layers -> I expect this to run very fast as it's barely copying and wrapping a few bytes at every layer.
Waiting for a buffer available -> According to my tests I think we can discard this being the root cause of my issue
Pushing the packetized HCI event (containing my report and stuff) to the controller through SPI -> I expect this to be the limiting factor but still at 1MHz, I would expect the transfer being << 1.5ms

The connection interval is done at runtime, and is initially controller by the central. The peripheral may however ask to modify it by calling bt_conn_le_param_update

This I set but with no improvements either

    struct bt_le_conn_param bleconn_params = {
        .interval_min = CONFIG_BT_PERIPHERAL_PREF_MIN_INT, // 6*1.25
        .interval_max = CONFIG_BT_PERIPHERAL_PREF_MAX_INT, // 6*1.25
        .latency      = 0,
        .timeout      = 4,
    };

    bt_conn_le_param_update(conn, &bleconn_params);

carlescufi commented 1 year ago

(STM32H7 on host + STM32M4 on controller)

Can you please clarify if you are using an nRF52840 or an STM32M4 as the controller?

carlescufi commented 1 year ago

When debugging, I noticed that the vast majority of time was spent in "waiting" for a rescheduling operation to complete in the LL implementation of the BLE stack, for some reason.

You mean you were looking at the controller chip here? Because the LL would run on the controller chip.

carlescufi commented 1 year ago

z_reschedule(&queue->lock, key); <<<< Waiting here for ~80% of the time

How did you measure that ~80%?

carlescufi commented 1 year ago

Also, could you try using hci_uart instead of hci_spi? This might be the SPI driver not being properly optimized, given how seldom it is used when compared to UART.

flo-carrere commented 1 year ago

@carlescufi

Can you please clarify if you are using an nRF52840 or an STM32M4 as the controller?

I use an nRF52840, I was referring to the STM32M4 inside
Edited the issue description

You mean you were looking at the controller chip here? Because the LL would run on the controller chip.

No, I only debugged the host side.
My bad if the LL is misleading, I meant the layers traversed on host side down to the HCI event creation (GATT/ATT/ACL/...).
Edited the issue description

How did you measure that ~80%?

Same techniques as for the bt_gatt_notify call.
Getting the time before and after (us), getting the difference (dt), comparing with the overall time (bt_gatt_notify).

I agree it's not as precise as a GPIO pin would be, ... but well I did this carefully for every area of the stack that is traversed and was very surprised that this part was blocking that long.

As this belongs to the kernel API part, I don't expect it to be the root cause of my issue, just a symptom of course.

Is there any profiling studies on the BLE during development?
With some reasonable expected performances?

Could you try using hci_uart instead of hci_spi?

This is something, I can try for sure, by chance it's available on my board.
I'll let you know.

jori-nordic commented 1 year ago

@flo-carrere

I was referring to the STM32M4 inside

There is no STM32M4 (STmicro SoC) in nordic chips. I suppose you're rather referring to the Cortext-M4 CPU core.

In any case >1ms processing is excessive I totally agree. We have tests with real HW downstream (in nRF Connect SDK) that measure all kinds of things, including performance, so that comes as a surprise to me. The performance tests are measuring throughput so it's possible that we are missing a test case though, I'll double-check.

I'm definitely suspecting either a misconfiguration or the HCI driver itself, I will try to reproduce it but right now #25917 is my priority.

flo-carrere commented 1 year ago

@jori-nordic

There is no STM32M4 (STmicro SoC) in nordic chips. I suppose you're rather referring to the Cortext-M4 CPU core

Yes indeed, my bad

flo-carrere commented 1 year ago

Hi again,

As you had a concern on the HCI_SPI driver implementation, I investigated a bit further and realized that adding a delay (randomly placed and tuned at 200us) in the bt_spi_send function was totally solving my issue.

Making the overall bt_gatt_notify call duration -> ~15us (instead of 1.5ms as stated in the issue) which is in line with what I expected.

// In zephyr/drivers/bluetooth/hci/spi.c
// static int bt_spi_send(struct net_buf *buf)
// line 418
    /* Allow time for the read thread to handle interrupt */
    while (true) {
        pending = gpio_pin_get(irq_dev, GPIO_IRQ_PIN);
        if (pending <= 0) {
            break;
        }
        k_sleep(K_MSEC(1));
    }
    k_sleep(K_USEC(200)); // <<<<< Here
    k_sem_take(&sem_busy, K_FOREVER);
// line 430

It seems that there is some kind of data race in there, what's interesting is that a simple k_yield is not sufficient to solve the issue.

jori-nordic commented 1 year ago

Yeah there might be an issue with thread priorities then. Because with k_yield, the kernel will only allow threads that have >= prio of the current thread to run. But with a k_sleep or any other blocking call on a kernel object, lower priority threads will be allowed to run.

flo-carrere commented 1 year ago

@jori-nordic

Do you mean there is a way to solve the issue by changing the priorities associated to BT threads according to my "app threads" (by the use of already existing config options)? If so, can you point to relevant guidelines to achieve such behavior?

At this moment, I can reproduce the problem with only a single thread (main thread with the default priority) calling the bt_gatt_notify function. This could be already a simple use-case to solve IMO.

On my app on the other hand, indeed I have multiple threads assigned to preemptive priorities. One of them, with relatively high priority, calls bt_gatt_notify regularly. I would expect it to be unlocked as fast as possible. But if another low-priority thread, could execute in the mean time, it could be also an acceptable solution.

jori-nordic commented 1 year ago

I was not suggesting the problem be in your app thread priorities (though it could be if they are cooperative), but rather in the relative priorities of the BT stack threads (and the system workqueue, if that was altered somehow).

You can try to play with CONFIG_BT_DRIVER_RX_HIGH_PRIO, but I don't think that's the root cause either. General documentation (other than API) is a bit light for Bluetooth at the moment, so we don't have any guides on priority settings for it.

We have some work in the pipeline to get the stack to work with SMP (e.g. on dual-core systems), so as part of that work we will probably figure out a way to run the stack with only pre-emptive threads. But it'll probably be a while.

If you could reproduce the issue with a sample and push a branch that I can just build from, that would help me find the issue.

flo-carrere commented 1 year ago

@jori-nordic

My code is very similar to existing zephyr samples:

Flash the host side with samples/bluetooth/peripheral_hids. Default configuration is fine.
Flash the controller side with samples/bluetooth/hci_spi. Default configuration is fine.
Monitor the duration of bt_gatt_notify call (in peripheral_hids/src/hog.c).

You'll notice > ~1ms duration which is way too excessive, as discussed. (5us when not connected).

With my "workaround" (sleep time in hci_spi driver), it will decrease to ~15/50us, which is closer to expectations for a dual configuration.

theob-pro commented 1 year ago

We can't reproduce it with the Zephyr controller on nrf52840. You can see bt_gatt_notify completing in ~140us.

flo-carrere commented 1 year ago

@theob-pro

OK thanks for the effort but I'm not sure comparing time is sufficient to state you can't reproduce the issue. Still your result is 3-10 times longer than my results after "fix" (~15/50us).

After all, 140us to serialize 3 bytes is a bit long isn't it?

Can you provide additional details?

What is the Zephyr version you used? (Was using v3.0.0)
What is your host controller? and frequency? (Was using STM32-H7/Nucleo-H7 @480Mhz)
Have you tried enforcing a preemptive priority for the main thread running the sample? (having a chance to reproduce the issue between preemptive main thread and cooperative BT threads).
What is your SPI frequency? (Was using 1MHz)
Have you tried applying the workaround/fix pointed and comparing the outputs?

Despite this, it's also possible that this sample (peripheral_hids) is not the best to reproduce the issue. In this case, I may provide another custom sample.

jori-nordic commented 1 year ago

@flo-carrere while I agree with your point that serializing 3 bytes shouldn't take that long, you should keep in mind that your chip is running at almost 8x the speed as ours (480MHz vs 64 on an nRF53840)

that said, to answer your questions:

zephyr was latest main from a week ago
host and controller are running off two nRF52840s
main thread is prio 0 (default in zephyr) -> pre-emptive
spi freq is 1M also
workaround only seems to reduce processing time to ~120us

We will try more extensive profiling, likely with a trace probe this week, I also have an STM32F4 that runs ~200MHz that we will try to use as host.

But yeah, it'd be useful if we got a commit/branch with a sample that we can just run without needing to modify anything.

flo-carrere commented 1 year ago

@jori-nordic

OK thanks for the clarifications, indeed it is strange.

almost 8x the speed as ours (480MHz vs 64 on an nRF53840)

Does this mean you are using two nRF53840? (1 as host and 1 as controller).

But yeah, it'd be useful if we got a commit/branch with a sample that we can just run without needing to modify anything

OK I'll try to provide that.

At this moment, I found myself not able to check whether the performances are similar using hci_uart rather than hci_spi driver as @carlescufi asked for at some point (expectation is that hci_uart does not suffer from the same issue).

Are you able to quickly assess performances when using HCI over UART? It could also help pointing the performance issue if the two are diverging too much independently of the hardware used.

jori-nordic commented 1 year ago

Hi, yeah we'll try with UART. Yeah we were using two 840s, 1 host 1 ctlr. We've successfully reproduced the issue on the stm32f303re nucleo board (only fast stm32 I had). Unfortunately it doesn't have trace pins so we'll have to make do without a trace probe, the investigation will be a bit longer.

edit: tried with UART, we get a ~170us processing time (which is still high), at 1Mbits. We will try to investigate what's happening.

theob-pro commented 1 year ago

Hi @flo-carrere, after further investigations with @jori-nordic we found that what you reported was actually not a bug but the "normal" behavior.

Let me explain what's happening. When the main thread will execute gatt_notify, it will add data to the TX queue of data to be send. The scheduler will find out that the TX thread is waiting on the TX queue and that there is data on it. The scheduler will then pre-empt the main thread and run the TX thread (TX thread is higher-prio than main). The TX thread will then wait until the SPI driver gives back control (ie when it blocks on a kernel object, eg. a semaphore, or sleeps). When that happens, control will be returned to the main thread, allowing gatt_notify to return.

So now, what your fix does is that it will allow the gatt_notify to return before the SPI transfer has even started. This is because using k_sleep will allow the scheduler to switch to another waiting thread even if they have a lower priority, which is the case here.

Adding the sleep just delays the inevitable, that is the main thread will get pre-empted by the TX thread at some later point, and how much that thread blocks is highly dependent on the SPI driver implementation. That explains the massive difference between the host running on a nRF vs an STM32.

You can see that illustrated in the two screenshots, the first (w/o sleep) has the TX thread (hci_tx_thread) yielding control after starting the SPI transfer (spi_nrfx_transceive), after which bt_gatt_notify_cb returns. The second one has the TX thread sleeping, then the scheduler returning control to the main thread, allowing bt_gatt_notify_cb to return sooner. We can still see the main thread being pre-empted later by the TX thread, to execute the spi_nrfx_transceive function until the whole data has been sent.

Switching to a higher SPI clock speed might mitigate the issue somewhat.

Trace without the workaround Trace with the workaround

jori-nordic commented 1 year ago

I have opened another issue https://github.com/zephyrproject-rtos/zephyr/issues/52364 to track progress on this, but will close this one since there is no easy fix, it's an architecture issue in the stack.

flo-carrere commented 1 year ago

@theob-pro @jori-nordic

Thank you very much for the detailed explanation and investigation, everything makes sense.

For those interested in a "workaround", I have currently forked and patched Zephyr v3.0 in my private repo to overcome this issue -not perfect but ok-ish.

I'm eager to switch to a proper solution, once everything is fixed properly in a further Zephyr release.

yajunzhi commented 1 year ago

hi @theob-pro , Could let me know which software used for the stack timing analysis as show in the screenshots above? We are also do some timing optimization with zephyr stack.Thank you

theob-pro commented 1 year ago

The software used for the flame graphs is Ozone from Segger.

jori-nordic commented 1 year ago

hi @theob-pro , Could let me know which software used for the stack timing analysis as show in the screenshots above? We are also do some timing optimization with zephyr stack.Thank you

The software used for the flame graphs is Ozone from Segger.

with a segger J-trace (uses ARM's ETM instruction tracing)

zephyrproject-rtos / zephyr