Open andrewboie opened 4 years ago
Related issue: some arches let you get preempted in an exception handler, some don't (x86_64 due to its per-cpu dedicated exception stacks). I had trouble on x86_64 tests where a custom k_sys_fatal_error_handler() would give a semaphore and immediately get preempted, with all their context scrambled if another exception was taken before it was swapped back in. This test code could all disappear with a k_thread_join() API but we should still document the policy on this.
Summary of my current thinking:
For ARM (other than Cortex-M), exceptions execute in separate modes: UND mode for undefined instructions and ABT mode for prefetch and data aborts (e.g. MPU fault); in addition, there are two types of interrupts: IRQ and FIQ (fast IRQ) -- where, by default upon exception entry, FIQs can preempt exceptions whereas IRQs cannot.
So in this case, "IRQs" will be locked during exceptions, but "FIQs" will not be; and real-time interrupts can be routed to the "FIQ channels" instead of the "IRQ channels" in the interrupt controller.
I wonder how that would fit into this picture, especially given that IRQ and FIQ can be masked separately.
- I think we should adopt a global policy of having IRQs unlocked when servicing exceptions, unless the calling context also had IRQs locked. Currently, only x86-32 gets this right.
Agreed. At least one form of external interruption mechanism (in case of ARM, that would be FIQ; on the other platforms where only one type of interruption mechanism is available, that would be the good old IRQ) must be available during exception processing in order to maintain real-time guarantees.
- Still unclear on what k_is_in_isr() should return during an exception, but perhaps it could depend on whether exceptions are preemptible on the current arch.
In case of ARM, masking of exceptions can be controlled independent of interrupts; I think the specification should state that "k_is_in_isr() shall return true during an exception if the exception is not preemptible by interrupts."
- I don't see use-cases for nested exceptions.
In case of ARM, nested exceptions can occur and must be handled if, for instance, a data abort (basically, memory fault) exception occurs inside an undefined instruction exception handler (which can implement complex operations like instruction emulation and extension-specific lazy context switching), but this is all arch-specific and will not necessarily be visible to the kernel or the application.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I'm going to convert this as an enhancement to fight the stale bug GH action as I don't expect to work on this for some time but I don't want this lost either.
The following emulator platforms return true for arch_is_in_isr()
in an exception that is triggered by some thread context event, like bad memory access, undefined instruction, etc:
ARM: All but Cortex-A: qemu_cortex_m0, qemu_cortex_m3, qemu_cortex_r5, mps2_an385, mps2_an521 ARC: All Arc (nsim_hs, nsim_em, nsim_sem, qemu_arc_em) Xtensa: qemu_xtensa
The following platforms do not:
ARM: cortex-A X86: all x86 Risc-V: all riscv nios II: all nios II native_posix
Note that this does NOT correspond to the set of platforms which have IRQs locked when entering an exception handler -- more inconsistency.
I think this should be false in all cases, except if we were in an IRQ and triggered an exception there. I think the distinguishing factor should be whether the fault happened due to some external event or was synchronous with thread code.
Tested this with this simple test case change:
diff --git a/tests/kernel/fatal/src/main.c b/tests/kernel/fatal/src/main.c
index 55c1768ebb..cb823f3590 100644
--- a/tests/kernel/fatal/src/main.c
+++ b/tests/kernel/fatal/src/main.c
@@ -48,6 +48,8 @@ void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *pEsf)
{
TC_PRINT("Caught system error -- reason %d\n", reason);
+ zassert_false(arch_is_in_isr(), "we are not in IRQ context");
+
if (expected_reason == -1) {
printk("Was not expecting a crash\n");
k_fatal_halt(reason);
I don't expect to have time to really dive into this anytime soon unfortunately. I don't believe this represents functional regressions, just a potential real-time latency concern. At the moment, recoverable exceptions are uncommon. They will be less common if we implement either https://github.com/zephyrproject-rtos/zephyr/issues/13074 or https://github.com/zephyrproject-rtos/zephyr/issues/26088. I am busy working on the latter, but the platforms of interest for that (ARM Cortex-A and x86 32-bit) currently appear to do the right thing.
To sum up:
arch_is_in_isr()
return value with respect to both interrupts and exceptions (proposal: interrupts yes, exceptions no)We need tests to enforce all of this.
Describe the bug We should have well-defined behavior for whether interrupts are enabled on the current CPU when servicing an exception, like a fatal error, page fault, MPU fault, etc.
The 32-bit x86 case is the most interesting, it has a check:
To Reproduce Apply this patch to tests/kernel/fatal and run it:
Expected behavior The expected behavior seems to be currently undefined. I certainly think we do not want them unconditionally locked.
The question is, should they be unlocked all the time, or conditional on whether the faulting context was locked?
I have some doubts on whether the x86-32 policy is really protecting anything, but I need to think about it some more.
Any fix for this should include the equivalent of the above test reflecting the policy and update our fusa requirements database.
Impact System latency servicing interrupts when handling an exception. This is a minor issue if all exceptions are fatal errors that are not expected to occur. But if we get into a realm where exceptions are recoverable (FPU management, page faults, etc) and part of normal kernel operation, this probably breaks our real-time guarantees. We already have one-use case for MMU/MPU fault handling with arch_user_string_nlen()