zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.34k stars 6.33k forks source link

double spin lock in z_impl_k_sem_give() #72023

Closed TorstenRobitzki closed 3 months ago

TorstenRobitzki commented 3 months ago

Very similar to #30074, I have a setup, where reproducible the z_spin_lock_valid() assertation in spinlock.h fails.

Target is an nRF5340, where I'm running a QSPI-driver test program, that does some console input / output via USB-CDC. The QSPI driver has some functions that start some transactions, the end of transaction will cause an interrupt and in the interrupt service routine, new transactions are started until a certain state has been reached. Once that state has been reached, a global semaphore is posted (k_sem_give()), while the main thread is waiting on that semaphore (k_sem_take()). No other threads where started by the application.

I can set a breakpoint in z_spin_lock_valid() on the return false statement and when the break point hits, I can find the thread_cpu always to be equal to _kernel.cpus[0].current | _kernel.cpus[0].id (where _kernel.cpus[0].id is always 0), which would be the correct value, if the spin lock would be held by the calling thread.

The lock in question is the static lock lock in sem.c.

I did some instrumentation of the code to toggle a pin a different number of pins, to signal different events. Filtered only on calls to z_spin_lock_valid() and z_spin_unlock_valid() (both in thread.c) that belong to lock, I can see two consequtive calls to z_spin_lock_valid() without a call to z_spin_unlock_valid() inbetween.

Further instrumenting the code to mark the entry and exit to z_impl_k_sem_give(), z_impl_k_sem_take() and the call of the QSPI ISR, following sequence is leading to the assert:

...

As some of the pin toggle packets where a little bit unclear, I decided to add a k_spinlock to protect the pin toggling from getting interrupted:

static struct k_spinlock tlock;
extern k_spinlock* plock; // set to &lock in sem.c

extern "C" void d_toggle_impl(int t, void *p)
{
    if ( p == plock)
    {
        k_spinlock_key_t key = k_spin_lock(&tlock);
        drivers::debug_pulse1(t);
        k_spin_unlock(&tlock, key);
    }
}

This worked, as expected until the QSPI ISR was calling k_sem_give(). From there, the z_spin_lock_valid() on tlock failed! :-/ Again, the thread_cpu field was set to the correct value for the one and only thread / CPU.

My understanding of the spin lock is, that it uses arch_irq_lock() and arch_irq_unlock() to disable / enable all interrupts and thus prevents other execution context from entering:

static ALWAYS_INLINE unsigned int arch_irq_lock(void)
{
    unsigned int key;
    unsigned int tmp;

    __asm__ volatile(
        "mov %1, %2;"
        "mrs %0, BASEPRI;"
        "msr BASEPRI_MAX, %1;"
        "isb;"
        : "=r"(key), "=r"(tmp)
        : "i"(_EXC_IRQ_DEFAULT_PRIO)
        : "memory");

    return key;
}

A quick test static_assert( _EXC_IRQ_DEFAULT_PRIO == 0 ) turned out, that _EXC_IRQ_DEFAULT_PRIO is set to 64 (`note: the comparison reduces to '(64 == 0)'). Given, that the nRF53 has an implementation with 3 priority bits, that 64 would then map to priority 2, right? Which then means, not all interrupts are blocked, but those with the priority of 2 to 7, right?

The QSPI ISR was installed, as direct ISR (IRQ_DIRECT_CONNECT()) with prio 2 (but even with priority 7, the asserts in the spinlock implementation is failing).

I have not seen anything in the documentation of IRQ_DIRECT_CONNECT() nor in the documentation on the spinlock, that would forbit locking the spinlock from an direct interrupt context.

I would expect that both, spinlock and semaphore beeing objects that are safe to be called from any execution context.

thank you

Torsten

peter-mitsis commented 3 months ago

@TorstenRobitzki - Just to clarify ... what version of Zephyr are you using?

peter-mitsis commented 3 months ago

Also, can you confirm whether zero latency interrupts are enabled and in use?

TorstenRobitzki commented 3 months ago

@peter-mitsis The version, I'm using is the commit 272786b0742 on https://github.com/nrfconnect/sdk-zephyr. With the hal_nordic overwritten with fix_nrf53_qspi on https://github.com/TorstenRobitzki/sdk-hal_nordic

Yes, CONFIG_ZERO_LATENCY_IRQS are enabled.

andyross commented 3 months ago

Something is wrong with the interrupt priority management. That code path through k_spin_lock() takes an arch_irq_lock() before doing the validation: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/spinlock.h#L191

The QSPI ISR absolutely must not be firing during that critical section, and the fact that it is is your root cause. You say you're using it as a direct IRQ, which is fine[1]. But my guess is that you've set something else that is allowing it to preempt code with an irq_lock, which is not? I'd be tempted to blame the arch code, but nRF and Cortex M are some of our most mature areas.

[1] But mostly useless. Direct ISRs aren't, in fact, particularly fast or efficient in ARM Zephyr as their boilerplate needs to do the same stuff regular ISR entry does. At best an expert app could offload some of that to the end of the interrupt and improve latency. Just a pet peeve; IMHO this isn't our finest feature.

TorstenRobitzki commented 3 months ago

@andyross That's how I understood the spin lock as well. My understanding of the observed behavior is, that somehow, the critical section is entered more than once. I also tried to change arch_irq_lock() to not just adjust `BASEPRI´, but to fully disable interrupts. That led to more crashes (sorry, I forgot about the details of that test).

Not using direct ISR works for me. But as the asserts just appeared when increasing some USB / CDC load, I'm not fully convinced, that is a reliable solution to my problem.

peter-mitsis commented 3 months ago

If zero latency IRQs are being used with this particular ISR, then I would draw attention to the following lines in the documentation about them ...

The kernel addresses such use-cases by allowing interrupts with critical latency constraints to execute at a priority level that cannot be blocked by interrupt locking.

and

Zero-latency interrupts are expected to be used to manage hardware events directly, and not to interoperate with the kernel code at all. They should treat all kernel APIs as undefined behavior.

If they are being used in this case, then the arch_irq_lock() may not be blocking them resulting in this observed behaviour.

andyross commented 3 months ago

Right, ZLI's would absolutely create this situation: those interrupts aren't "Zephyr" code, they can interrupt the system in any state and thus can't call any Zephyr APIs[1]. But that seems not to be the claim above?

Let's drag @ithinuel in here. If this is really happening it's an arch problem with Cortex M interrupt masking.

[1] Or... you can if you 100% know what you're doing, but you absolutely can't do anything that will try to take a spinlock. Basically the rule is that NMIs can talk to the OS through lockless protocols only.

TorstenRobitzki commented 3 months ago

@peter-mitsis The second part of the documentation most probably applies here. My fault would be then, to use a semaphore from that zero latency IRQ context. I simply "forgot" about the IRQ_ZERO_LATENCY argument to IRQ_DIRECT_CONNECT and just looked up the documentation of IRQ_DIRECT_CONNECT.

Thank you all for your time and for the clarification!

peter-mitsis commented 3 months ago

I am closing this as I think that we are agreed that this is due to the misuse of kernel APIs with zero latency interrupts.