zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.58k stars 6.47k forks source link

IRQ locking during exceptions isn't consistent #21923

Open andrewboie opened 4 years ago

andrewboie commented 4 years ago

Describe the bug We should have well-defined behavior for whether interrupts are enabled on the current CPU when servicing an exception, like a fatal error, page fault, MPU fault, etc.

The 32-bit x86 case is the most interesting, it has a check:

    /*
     * restore interrupt enable state, then call the handler
     *
     * interrupts are enabled only if they were allowed at the time
     * the exception was triggered -- this protects kernel level code
     * that mustn't be interrupted
     *
     * Test IF bit of saved EFLAGS and re-enable interrupts if IF=1.
     */

    /* ESP is still pointing to the ESF at this point */

    testl   $0x200, __z_arch_esf_t_eflags_OFFSET(%esp)
    je  allDone
    sti

allDone:

To Reproduce Apply this patch to tests/kernel/fatal and run it:

diff --git a/tests/kernel/fatal/src/main.c b/tests/kernel/fatal/src/main.c
index 14ae937a05..acfd114808 100644
--- a/tests/kernel/fatal/src/main.c
+++ b/tests/kernel/fatal/src/main.c
@@ -46,8 +46,17 @@ static ZTEST_DMEM volatile int expected_reason = -1;

 void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *pEsf)
 {
+       int key = irq_lock();
+
+       irq_unlock(key);
+
        TC_PRINT("Caught system error -- reason %d\n", reason);

+       if (!arch_irq_unlocked(key)) {
+               printk("irqs locked during exception handling\n");
+               k_fatal_halt(reason);
+       }
+
        if (expected_reason == -1) {
                printk("Was not expecting a crash\n");
                k_fatal_halt(reason);

Expected behavior The expected behavior seems to be currently undefined. I certainly think we do not want them unconditionally locked.

The question is, should they be unlocked all the time, or conditional on whether the faulting context was locked?

I have some doubts on whether the x86-32 policy is really protecting anything, but I need to think about it some more.

Any fix for this should include the equivalent of the above test reflecting the policy and update our fusa requirements database.

Impact System latency servicing interrupts when handling an exception. This is a minor issue if all exceptions are fatal errors that are not expected to occur. But if we get into a realm where exceptions are recoverable (FPU management, page faults, etc) and part of normal kernel operation, this probably breaks our real-time guarantees. We already have one-use case for MMU/MPU fault handling with arch_user_string_nlen()

andrewboie commented 4 years ago

Related issue: some arches let you get preempted in an exception handler, some don't (x86_64 due to its per-cpu dedicated exception stacks). I had trouble on x86_64 tests where a custom k_sys_fatal_error_handler() would give a semaphore and immediately get preempted, with all their context scrambled if another exception was taken before it was swapped back in. This test code could all disappear with a k_thread_join() API but we should still document the policy on this.

andrewboie commented 4 years ago

Summary of my current thinking:

stephanosio commented 4 years ago

For ARM (other than Cortex-M), exceptions execute in separate modes: UND mode for undefined instructions and ABT mode for prefetch and data aborts (e.g. MPU fault); in addition, there are two types of interrupts: IRQ and FIQ (fast IRQ) -- where, by default upon exception entry, FIQs can preempt exceptions whereas IRQs cannot.

So in this case, "IRQs" will be locked during exceptions, but "FIQs" will not be; and real-time interrupts can be routed to the "FIQ channels" instead of the "IRQ channels" in the interrupt controller.

I wonder how that would fit into this picture, especially given that IRQ and FIQ can be masked separately.

stephanosio commented 4 years ago
  • I think we should adopt a global policy of having IRQs unlocked when servicing exceptions, unless the calling context also had IRQs locked. Currently, only x86-32 gets this right.

Agreed. At least one form of external interruption mechanism (in case of ARM, that would be FIQ; on the other platforms where only one type of interruption mechanism is available, that would be the good old IRQ) must be available during exception processing in order to maintain real-time guarantees.

  • Still unclear on what k_is_in_isr() should return during an exception, but perhaps it could depend on whether exceptions are preemptible on the current arch.

In case of ARM, masking of exceptions can be controlled independent of interrupts; I think the specification should state that "k_is_in_isr() shall return true during an exception if the exception is not preemptible by interrupts."

  • I don't see use-cases for nested exceptions.

In case of ARM, nested exceptions can occur and must be handled if, for instance, a data abort (basically, memory fault) exception occurs inside an undefined instruction exception handler (which can implement complex operations like instruction emulation and extension-specific lazy context switching), but this is all arch-specific and will not necessarily be visible to the kernel or the application.

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

andrewboie commented 4 years ago

I'm going to convert this as an enhancement to fight the stale bug GH action as I don't expect to work on this for some time but I don't want this lost either.

andrewboie commented 4 years ago

The following emulator platforms return true for arch_is_in_isr() in an exception that is triggered by some thread context event, like bad memory access, undefined instruction, etc:

ARM: All but Cortex-A: qemu_cortex_m0, qemu_cortex_m3, qemu_cortex_r5, mps2_an385, mps2_an521 ARC: All Arc (nsim_hs, nsim_em, nsim_sem, qemu_arc_em) Xtensa: qemu_xtensa

The following platforms do not:

ARM: cortex-A X86: all x86 Risc-V: all riscv nios II: all nios II native_posix

Note that this does NOT correspond to the set of platforms which have IRQs locked when entering an exception handler -- more inconsistency.

I think this should be false in all cases, except if we were in an IRQ and triggered an exception there. I think the distinguishing factor should be whether the fault happened due to some external event or was synchronous with thread code.

Tested this with this simple test case change:

diff --git a/tests/kernel/fatal/src/main.c b/tests/kernel/fatal/src/main.c
index 55c1768ebb..cb823f3590 100644
--- a/tests/kernel/fatal/src/main.c
+++ b/tests/kernel/fatal/src/main.c
@@ -48,6 +48,8 @@ void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *pEsf)
 {
        TC_PRINT("Caught system error -- reason %d\n", reason);

+       zassert_false(arch_is_in_isr(), "we are not in IRQ context");
+
        if (expected_reason == -1) {
                printk("Was not expecting a crash\n");
                k_fatal_halt(reason);
andrewboie commented 4 years ago

I don't expect to have time to really dive into this anytime soon unfortunately. I don't believe this represents functional regressions, just a potential real-time latency concern. At the moment, recoverable exceptions are uncommon. They will be less common if we implement either https://github.com/zephyrproject-rtos/zephyr/issues/13074 or https://github.com/zephyrproject-rtos/zephyr/issues/26088. I am busy working on the latter, but the platforms of interest for that (ARM Cortex-A and x86 32-bit) currently appear to do the right thing.

andrewboie commented 4 years ago

To sum up:

We need tests to enforce all of this.