High bluetooth ISR latency with CONFIG_BT_MAX_CONN=2

jmgao commented 3 years ago

Describe the bug

I'm seeing high ISR latencies with CONFIG_BT_MAX_CONN set to 2, which I suspect might be resulting in a !radio_is_ready() assertion firing after multiple hours (perhaps due to unlucky collision with another ISR).

To Reproduce Apply the following patch to samples/bluetooth/peripheral:

diff --git a/samples/bluetooth/peripheral/prj.conf b/samples/bluetooth/peripheral/prj.conf
index 9a56b2c122..fc7eacf030 100644
--- a/samples/bluetooth/peripheral/prj.conf
+++ b/samples/bluetooth/peripheral/prj.conf
@@ -1,6 +1,9 @@
 # Incresed stack due to settings API usage
 CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048

+CONFIG_BT_CTLR_PROFILE_ISR=y
+CONFIG_BT_MAX_CONN=2
+
 CONFIG_BT=y
 CONFIG_BT_DEBUG_LOG=y
 CONFIG_BT_SMP=y

Build and flash (tested on particle_xenon). Use nRF Connect to connect to the device, open the vendor-defined service, and wait. I'm seeing maximum ISR latencies of >50us within a minute or so, and >70us after a few minutes of waiting.

Expected behavior Lower peak ISR latencies

Impact !radio_is_ready assertions firing

Logs and console output

[00:01:09.077,667] <inf> bt_ctlr_hci: l: 19, 1, 46; t: 7, 7, 23.
[00:01:09.078,002] <inf> bt_ctlr_hci: l: 13, 1, 46; t: 15, 7, 23.
[00:01:09.078,918] <inf> bt_ctlr_hci: l: 8, 1, 46; t: 8, 7, 23.
[00:01:09.084,136] <inf> bt_ctlr_hci: l: 1, 1, 46; t: 18, 7, 23.
[00:01:09.174,163] <inf> bt_ctlr_hci: l: 14, 1, 46; t: 17, 7, 23.
[00:01:09.185,363] <inf> bt_ctlr_hci: l: 19, 1, 46; t: 8, 7, 23.

<snip>

[00:03:39.603,515] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 14, 7, 23.
[00:03:39.604,461] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.
[00:03:39.620,849] <inf> bt_ctlr_hci: l: 1, 1, 75; t: 18, 7, 23.
[00:03:39.705,871] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 8, 7, 23.
[00:03:39.706,207] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 16, 7, 23.
[00:03:39.707,153] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.
[00:03:39.707,489] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 14, 7, 23.
[00:03:39.708,435] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.

Environment (please complete the following information):

OS: linux host, Android R on Pixel 4 for BT client
Toolchain: Zephyr SDK 0.11.1
Commit SHA or Version used: v2.4.0

Additional context Add any other context about the problem here. @cvinayak

joerchan commented 3 years ago

It could perhaps be this?

static void hci_num_completed_packets(struct net_buf *buf)
{
        /* ... */
        key = irq_lock();

        conn = bt_conn_lookup_handle(handle);
        if (!conn) {
            irq_unlock(key);
            BT_ERR("No connection for handle %u", handle);
            continue;
        }

        irq_unlock(key);

@jhedberg Is this irq_lock really needed? We don't lock irq anywhere else where we lookup the conn from the handle. Not much can change with regards to the handles here I think.

cvinayak commented 3 years ago

Reporter see increasing latencies over time which has no explanation, unless there is some sort of O(n) traversal

carlescufi commented 3 years ago

Setting this as high since ISR latency is critical for most users.

carlescufi commented 3 years ago

Reporter see increasing latencies over time which has no explanation, unless there is some sort of O(n) traversal

Well, you do display the worst case recorded right? So it could be that over time we hit the worst case scenario?

jmgao commented 3 years ago

Yeah, I don't think it's actually linear growth, just more time to get unlucky. I've seen assertions firing anywhere between 2 and 8 hours of uptime.

(edit for clarification: I saw the assertions with my own project, not the sample)

joerchan commented 3 years ago

@jmgao Can you try and comment out the irq-lock/unlock in hci_num_completed_packets around bt_conn_lookup_handle? Source file: subsys/bluetooth/host/hci_core.c.

jmgao commented 3 years ago

Commenting out the first set of irq_lock/unlocks doesn't seem to help.

jmgao commented 3 years ago

Sidenote: increasing BT_CTLR_ULL_HIGH_PRIO and BT_CTLR_ULL_LOW_PRIO to 1 appears to make the symptoms go away. It's strange to me that the critical radio ISR and the low priority upper level ISR have the same priority by default. Is ULL_LOW_PRIO actually on a hard realtime deadline? If not, that seems like it's just asking for priority inversion...

joerchan commented 3 years ago

I agree, that looks like strange IRQ priorities defaults. @jmgao You should probably only increase BT_CTLR_ULL_LOW_PRIO though. Leave BT_CTLR_ULL_HIGH_PRIO one as the default value (0)

jmgao commented 3 years ago

Oh, whoops, I think I tested with zero latency interrupts in my latest comment: with ZLI off, I'm still seeing elevated latencies with BT_CTLR_ULL_LOW_PRIO increased (and with HIGH_PRIO increased as well), although maybe somewhat better better (max of 65 after 5 minutes)

cvinayak commented 3 years ago

appears to make the symptoms go away.

Using ZLI, controller interrupts are at highest proirity and hence other interrupts or irq locks do not introduce latency.

I'm still seeing elevated latencies with BT_CTLR_ULL_LOW_PRIO increased (and with HIGH_PRIO increased as well)

As latencies are observed with ULL_LOW and ULL_HIGH increased, indicate that there is some other interrupt at the same priority (ARM interrupt priority value 0) as the Radio interrupt. As this is your own project, what other interrupts/peripherals are you using?

jmgao commented 3 years ago

As this is your own project, what other interrupts/peripherals are you using?

This is with the peripheral sample.

As latencies are observed with ULL_LOW and ULL_HIGH increased, indicate that there is some other interrupt at the same priority (ARM interrupt priority value 0) as the Radio interrupt.

Or irq_lock in a lower priority interrupt, right?

cvinayak commented 3 years ago

During my work in progress tasks, I am notice such gradual increase in reported latency by the controller. I am suspecting logging, hence, please disable logging subsystem, use a printk instead to display the ISR latencies and let us know if there is similar gradual increased latencies.

cvinayak commented 3 years ago

~The increase in latency is in some way related to log buffer overflow? @nordic-krch I see that log bytes is truncated, and then the next log message from controller has increased ISR latency measured. Does logging disable interrupts that log messages are truncated or rotated?~

cvinayak commented 3 years ago

Ok, without logging subsystem too, I see high latencies... will investigate further.

cvinayak commented 3 years ago

No added latency seen for > 15 mins of connection with phone as central and simultaneous advertising when using the below: CONFIG_ZERO_LATENCY_IRQS=y CONFIG_BT_CTLR_ADVANCED_FEATURES=y CONFIG_BT_CTLR_ZLI=y

~This rules out any controller's self implementation to be introducing latencies.~ ~The latencies should be coming from outside the Bluetooth subsystem (and its not logging either).~ See https://github.com/zephyrproject-rtos/zephyr/issues/29994#issuecomment-739623736

@koffes you may have interest here.

koffes commented 3 years ago

I can confirm that I am seeing similar issues. Random firings of !radio_is_ready especially in https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c#L316 This issue is most prominent if I activate CPU load reports and LOG active:

CONFIG_TRACING=y
CONFIG_TRACING_CPU_STATS=y
CONFIG_TRACING_CPU_STATS_LOG=y
CONFIG_TRACING_CPU_STATS_INTERVAL=5000

I tried with CONFIG_ZERO_LATENCY_IRQS=y + CONFIG_BT_CTLR_ZLI=n and CONFIG_ZERO_LATENCY_IRQS=n + CONFIG_BT_CTLR_ZLI=n In both cases the assert happens. I know the first combo of settings is not recommended.

cvinayak commented 3 years ago

@koffes What was the result for following in your application: CONFIG_ZERO_LATENCY_IRQS=y CONFIG_BT_CTLR_ADVANCED_FEATURES=y CONFIG_BT_CTLR_ZLI=y

cvinayak commented 3 years ago

Controller currently has an O(n) execution on scheduling conflict resolution and request processing in ULL_HIGH and ULL_LOW execution contexts.

When the Radio ISR, LLL, ULL_HIGH, and ULL_LOW execution are configured at the same IRQ priority, these introduce ISR latencies amongst themself.

There exists 3 Kconfig options to mitigate controller's self induced ISR Latencies

Use of Zero Latency IRQs (which inherently splits out the Radio ISR to a high priority):
```
CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_ZLI=y
```

Use explicitly a higher IRQ priority for Radio ISR (and LLL)

CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_LLL_PRIO=0
CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=1

Use of low latency controller design (applicable for slower CPUs like in nRF51 series, Cortex-M0 at 16 MHz):
```
CONFIG_BT_CTLR_LOW_LAT=y
CONFIG_BT_CTLR_LOW_LAT_ULL=y
```
Proposal to make controller's processing limited to a max CPU use per iteration.

I will appreciate, if the options 1 to 3 are tested and any observations is noted here in this issue.

jmgao commented 3 years ago

Option 1 appeared to fix things for me, option 2 still had elevated latencies.

koffes commented 3 years ago

@cvinayak : I tested with the combination you suggested. It then asserts almost right away.

The call stack looks like:lll_hfclock_off() ->z_nrf_clock_bt_ctlr_hf_release(); -> hfclk_stop(); then something strange happens with the call stack, as the actual assert seems to be triggered in int64_t z_impl_k_uptime_ticks(void) then k_spin_lock(struct k_spinlock *l) -> __ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l);

Let me know and I can show you my debug session.

carlescufi commented 3 years ago

Option 1 appeared to fix things for me, option 2 still had elevated latencies.

@jmgao Have you tried 3 as well?

carlescufi commented 3 years ago

@cvinayak : I tested with the combination you suggested. It then asserts almost right away.

The call stack looks like:lll_hfclock_off() ->z_nrf_clock_bt_ctlr_hf_release(); -> hfclk_stop(); then something strange happens with the call stack, as the actual assert seems to be triggered in int64_t z_impl_k_uptime_ticks(void) then k_spin_lock(struct k_spinlock *l) -> __ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l);

Let me know and I can show you my debug session.

@cvinayak is it expected that lll_hfclock_off() disables interrupts at some point in the call tree? EDIT: @nordic-krch states that there should be no interrupt disabling except if the CONFIG_CLOCK_CONTROL_NRF_SHELL is enabled (enabled by default if you enable the shell on nRF platforms). @jmgao @koffes do you have this option enabled?

cvinayak commented 3 years ago

is it expected that lll_hfclock_off() disables interrupts at some point in the call tree?

@carlescufi No, there should not be any disabling of interrupts. This function is called in the LLL context, any CPU overhead here will show up as Radio asserts.

nashif commented 3 years ago

any updates on that?

nashif commented 3 years ago

lowering to medium because this only happens with shell enabled.

jmgao commented 3 years ago

Sorry, missed the questions in my email inbox. I vaguely recall trying option 3 and it increasing the time to failure to a day or two, but it's been a while.

I don't believe this requires shell being enabled to happen, although that might increase the chances of a fatal assert happening.

carlescufi commented 3 years ago

I don't believe this requires shell being enabled to happen, although that might increase the chances of a fatal assert happening.

@koffes and @jmgao Could you please verify this statement? It's important that we figure out if enabling shell is a factor here.

koffes commented 3 years ago

@carlescufi I am unable to reproduce on recent master.

Attempted with CONFIG_BT_MAX_CONN=2 in samples/bluetooth/peripheral/prj.conf and hci_rpmsg with CONFIG_SHELL=y and CONFIG_TRACING=y on the NET core or nRF5340.

Also tried in our custom application with TRACING and SHELL active. My observed issue with !radio_is_ready is only related to this issue though.

jmgao commented 3 years ago

Sorry for the delay, I also had some problems reproducing, but it was because I tested without connecting to the device, because I was using a different board that connections were failing on, and spent the last few days trying to figure out why. (I was running it on a board that doesn't have a 32kHz crystal, and apparently CONFIG_CLOCK_CONTROL_NRF_K32SRC_SYNTH is good enough for my project but not the peripheral sample? I'm still not quite sure what's going on...)

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

(I was wondering if the problem could be related to RC oscillator calibration, because I vaguely recall skimming across something in an NRF datasheet saying that it would do regular calibration. I checked the board I used in the original report and particle_xenon: both of them are definitely using CLOCK_CONTROL_NRF_K32SRC_XTAL, and regardless, the problem seems to occur with both K32SRC_RC and K32SRC_XTAL.)

carlescufi commented 3 years ago

@jmgao

Sorry for the delay, I also had some problems reproducing, but it was because I tested without connecting to the device, because I was using a different board that connections were failing on, and spent the last few days trying to figure out why. (I was running it on a board that doesn't have a 32kHz crystal, and apparently CONFIG_CLOCK_CONTROL_NRF_K32SRC_SYNTH is good enough for my project but not the peripheral sample? I'm still not quite sure what's going on...)

You should not use SYNTH, that will not work reliably with Bluetooth. You either use XTAL or RC.

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

Which application did you use, your own or a zephyr sample?

jmgao commented 3 years ago

Which application did you use, your own or a zephyr sample?

The Zephyr sample in samples/bluetooth/peripheral

carlescufi commented 3 years ago

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

Do you have a Nordic nRF52840 DK that you could try to reproduce this on? If not, could you please confirm that the instructions in the PR description are correct to reproduce the issue?

@pabigot you have a Particle Xenon, right? Could you please follow the instructions above and see if you can reproduce? By reproduce I mean latencies displayed above 60us and any assert triggered.

pabigot commented 3 years ago

On xenon I do get a spike latency around the time service discovery is done. It seems to be a one-time thing, and I don't see any asserts.

[00:01:08.277,313] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 13, 7, 23.
[00:01:08.291,290] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 15, 7, 23.
[00:01:08.321,289] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 17, 7, 23.
[00:01:08.378,112] <inf> bt_ctlr_hci: l: 19, 1, 19; t: 8, 7, 23.
[00:01:08.378,753] <inf> bt_ctlr_hci: l: 19, 1, 19; t: 7, 7, 23.
[00:01:08.379,455] <inf> bt_ctlr_hci: l: 77, 1, 77; t: 8, 7, 23.
[00:01:08.381,317] <inf> bt_ctlr_hci: l: 13, 1, 77; t: 16, 7, 23.
[00:01:08.481,994] <inf> bt_ctlr_hci: l: 19, 1, 77; t: 8, 7, 23.
[00:01:08.482,635] <inf> bt_ctlr_hci: l: 19, 1, 77; t: 7, 7, 23.

edit: I see the same thing on nRF52840 DK, spikes there of 74 then 79.

cvinayak commented 3 years ago

Thank you @pabigot I am not able to reproduce the assertion too, do see similar latencies on nrf52840dk.

cvinayak commented 3 years ago

I have been adding implementation to in-system profile other controller's ISRs, and have noticed an implementation flaw that executes both ULL_HIGH and ULL_LOW in the same ISR when their priorities are same (instead of tailchaining/yielding so as to permit the naturally higher priority radio ISR exection which is at the same IRQ priority level as the ULL_HIGH and ULL_LOW).

~This is a regression in nRF52 and nRF53 introduced when nRF53 support was added. (https://github.com/zephyrproject-rtos/zephyr/commit/78b461ae3ecab8c89c2f4d682c6e22421f3f3d73)~

~I am working on a solution to decouple the ULL_LOW and allow an early vectoring to Radio ISR.~

With the new ticker implementation in the split controller, it is not feasible to have LLL and ULL_HIGH/ULL_LOW at same IRQ priority level, hence have submitted https://github.com/zephyrproject-rtos/zephyr/pull/31753

@jmgao please check that you get reduced Radio ISR latency using this PR: #31753 @pabigot could you also check on the xenon board you have that with #31753 that latency be around ~20ms

zephyrproject-rtos / zephyr

High bluetooth ISR latency with CONFIG_BT_MAX_CONN=2 #29994