ARM M4 MPU backed userspace livelocks on stack overflow when FPU enabled

Describe the bug When using the MPU-backed userspace feature on the NRF52840 with the FPU enabled, a correctly-constructed stack overflow of the user thread can lead to an unrecoverable livelock where the overflowed thread never successfully aborts.

I am using the userspace feature to sandbox some 3rd-party code that involves a lot of parsing, etc, and want recover in the case of a crash in userspace. I have implemented a k_sys_fatal_error_handler that returns to the kernel to abort userspace threads that have crashed, and normally (oops, divide-by-zero, etc) this works fine. However, when the FPU is enabled (CONFIG_FPU=y) some instances of stack overflow instead result the system looping hard faults and never aborting the user thread. Note that the FPU doesn't need to actually be in use for this to happen. I couldn't reproduce this issue with a kernel thread and CONFIG_HW_STACK_PROTECTION.

Whether or not this occurs depends on by how much the overflowing stack frame exceeds the MPU region. It seems that small overflows behave correctly (or at least are recoverable) where large overflows lead to the livelock behavior.

The exact behavior depends also on the state of the scheduler. Having a second thread runnable at abort-time seems to change the behavior, and exposes another edge case where it double-faults from the overflowing thread but can recover (there are still overflow amounts where it livelocks in this case.)

It's possible to detect this livelock in k_sys_fatal_error_handler by seeing if k_current_get()->base.thread_state is _THREAD_ABORTING or _THREAD_DEAD, and in that case halt the system.

To Reproduce Clone https://github.com/602p/zephyr-livelock-repro and run on an NRF52840DK. I'd suspect it would happen on other M4 devices too.

Uncomment #define ENABLE_MITIGATION in main.c to detect and halt on livelock condition. Uncomment #define MODE_JOIN to cause the main thread to join on the user thread instead of polling; changes exact livelock behaviour. See char to_try[] = { for the exact different cases of failure depending on the amount by which the user thread overflows it's stack.

Expected behavior All amounts of overflow should be recoverable or, at minimum, should bugcheck the kernel instead of livelocking.

Impact Mitigable, but still problematic because it changes the security model of using userspace to sandbox code since it could still crash the whole device.

Logs and console output See https://github.com/602p/zephyr-livelock-repro/blob/main/log.txt . The script tries overflowing a user task by increasingly large stack frames. The first few recover normally, the last gets livelocked.

Environment (please complete the following information):

OS: Linux
Toolchain: This repo's zephyr, gcc-arm-none-eabi-9-2020-q2-update
Zephyr commit f86f6e00251066815b0a7a7ad45047b525b3fab9 (originally identified on Nordic NCS Zephyr)

Cause conjecture Stack overflow is handled at zephyr/arch/arm/core/aarch32/cortex_m/fault.c:325, which resets the stack pointer to a valid address before returning to that task with PendSv set to execute context switching from the user task. Is Zephyr forgetting to also reset some part of the [lazy] FP save/restore info (Floating Point Context Address Register?) and that remains invalid, triggering another MPU fault while trying to ret into the user task to swap it out?

I have run the sample code on sam_e70_xplained board and can confirm that the problem can be observed also on SAM E70 SoC (Cortex-M7). The exact behavior in regards to how likely the problem is to manifest itself

Whether or not this occurs depends on by how much the overflowing stack frame exceeds the MPU region. It seems that small overflows behave correctly (or at least are recoverable) where large overflows lead to the livelock behavior.

is different. In case of SAM E70 device it is much easier to trigger the invalid behavior. However, the failure mode is exactly the same.

I have analyzed the issue a bit and it does not look good. It seems we have a larger problem in handling stack overflow condition in user mode on ARM platform. As issue's author has written

Stack overflow is handled at zephyr/arch/arm/core/aarch32/cortex_m/fault.c:325, which resets the stack pointer to a valid address before returning to that task with PendSv set to execute context switching from the user task.

That's the line of code and relevant excerpt from the comments

    * [...] Therefore,
    * we manually force the stack pointer to the
    * lowest allowed position, inside the thread's
    * stack.
    *
    * Note:
    * [...]
    * The manual adjustment of PSP is safe, as we
    * will not be re-scheduling this thread again
    * for execution; thread stack corruption is a
    * fatal error and a thread that corrupted its
    * stack needs to be aborted.
    */
__set_PSP(min_stack_ptr);

The stack is reset to the lowest possible stack address, i.e. stack full. That works well unless the stack overflow is triggered while entering SVCall exception. SVCall has higher priority than PendSV exception. Unlike the comment suggests manual adjustment of PSP is not safe in this case since after PSP is adjusted and MemManage exception exits the processor will continue executing SVCall handler. Relevant excerpt from z_arm_svc handler in arch/arm/core/aarch32/swap_helper.S:

SECTION_FUNC(TEXT, z_arm_svc)
    tst lr, #_EXC_RETURN_SPSEL_Msk /* did we come from thread mode ? */
    ite eq  /* if zero (equal), came from handler mode */
        mrseq r0, MSP   /* handler mode, stack frame is on MSP */
        mrsne r0, PSP   /* thread mode, stack frame is on PSP */

    /* Figure out what SVC call number was invoked */

    ldr r1, [r0, #24]   /* grab address of PC from stack frame */
    /* SVC is a two-byte instruction, point to it and read the
     * SVC number (lower byte of SCV instruction)
     */
    ldrb r1, [r1, #-2]

The hard fault seen in the logs is triggered by the last line of the above code. We've entered from the thread mode so the routine is using PSP to retrieve SVC number from the corrupted stack. Again, the stack is corrupted since upon SVCall exception entry the processor tried to save the current state (exception frame) on the PSP (application stack). This failure mode happens always when the stack overflow is caused by SVC exception entry. It was observed when FPU is enabled since exception frame with floating-point storage is much larger (26 4-byte words) than exception frame without floating-point storage (8 4-byte words). Additionally the exception frame with floating-point storage is mostly 0s. That's relevant since when executing

    ldrb r1, [r1, #-2]

the value of r1 register is taken from the stack. If the value of r1 is 0 we get a hard fault. It is a "precise data bus error" which means that upon exception exit we are going to execute exactly the same line of code triggering the same hard fault. Thus the locking behavior. If the value of r1 - 2 happens to point to the location that processor can read then it will retrieve some bogus SVC call number. This will either cause kernel oops if the SVC call number is detected as invalid or we are going to execute some random service call.

Looking at the sample application code

void stack_overflow(int stride) {
    char arr[stride];
    LOG_DBG("Overflowing, &arr=%p", &arr);
    stack_overflow(stride);
}

The invalid behavior was observed when calling

    LOG_DBG("Overflowing, &arr=%p", &arr);

and not when calling stack_overflow(stride);. Behind the scenes LOG_DBG macro is executing SVC instruction.

Since SVCall exception cannot be executed due to corrupted stack MemManage fault handler needs to clear SVCALLPENDED bit in SHCSR register. E.g. by executing

    if (SCB->SHCSR & SCB_SHCSR_SVCALLPENDED_Msk) {
        SCB->SHCSR &= ~SCB_SHCSR_SVCALLPENDED_Msk;
    }

When working on the fix we should take into account that:

MemManage fault can be triggered during exception entry caused by any other exception, not necessary SVCall. However, most likely only SVCall is a problem.
SysTick exception can be generated while we are handling MemManage fault. Again, not necessary a problem but it has higher priority than PendSV exception and will be executed first.

@theotherjimmy Can you try to reproduce this?

I have reproduced this on an mps3_an547

This looks a bit worse than I had anticipated: It affects not just SVC, but all irqs with a higher priority than PendSV.

In particular, the _isr_wrapper touches/pushes to the process stack here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/isr_wrapper.S#L80-L89 which could cause a stack overflow and mem manage fault. I think the only practical solution in the face of this issue, is to force the context switch before entering anything of lower priority than the mem manage fault to have a working stack.

Edit: I thought github would be better and show the snippet inline. It's below


    /*
     * Save away r0-r3, r12 and lr_irq for the previous context to the
     * process stack since they are clobbered here.  Also, save away lr
     * and spsr_irq since we may swap processes and return to a different
     * thread.
     */
    sub lr, lr, #4
    srsdb #MODE_SYS!
    cps #MODE_SYS
    push {r0-r3, r12, lr}

Edit2: So as far as I can tell, I read this assembly incorrectly and this does not push to the process stack.

@carlescufi @MaureenHelm I think we should discuss this on the dev review call. Flagging it for review/discussion, and @theotherjimmy and myself can attend.

In particular, the _isr_wrapper touches/pushes to the process stack here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/isr_wrapper.S#L80-L89 which could cause a stack overflow and mem manage fault.

Just to make it explicit: the code mentioned above is executed on Cortex-R, Cortex-A architectures only (which we also need to review) and not on Cortex-M devices the issue was originally observed on.

I'm not so sure. I think that's added with the config CONFIG_GEN_SW_ISR_TABLE (here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L17)

zephyr_library_sources_ifdef(CONFIG_GEN_SW_ISR_TABLE isr_wrapper.S)

The same file switches between Cortex A/R and Cortex M later on (here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L24 and here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L32-L33

add_subdirectory_ifdef(CONFIG_CPU_CORTEX_M cortex_m)
...
add_subdirectory_ifdef(CONFIG_CPU_AARCH32_CORTEX_R cortex_a_r)
add_subdirectory_ifdef(CONFIG_CPU_AARCH32_CORTEX_A cortex_a_r)

Further, the Kconfig in the cortex_m subdir has the following:

config GEN_ISR_TABLES
    default y

I think that file applies.

It seems that this is related to k_sys_fatal_error_handler, as documented (emphasis mine) :

The policy for what to do when encountering a fatal error is determined by the implementation of the k_sys_fatal_error_handler() function. This function has a default implementation with weak linkage that calls LOG_PANIC() to dump all pending logging messages and then unconditionally halts the system with k_fatal_halt().

Applications are free to implement their own error handling policy by overriding the implementation of k_sys_fatal_error_handler(). If the implementation returns, the faulting thread will be aborted and the system will otherwise continue to function. See the documentation for this function for additional details and constraints

It seems to me like we might need an additional check before blindly trusting that the user implementation of k_sys_fatal_error_handler() that we can actually abort the thread as promised.

The check would be something like:

                /* If we had a memory manage fault during another exception or
                 * interrupt, other than PendSV, we will return to that handler
                 * before PendSV with a corrupt user stack pointer. We can't be
                 * sure that the handler will not attempt something such as read
                 * the instruction pointed to by the pc in the exception frame,
                 * which is now corrupt, like the SVCall handler does.
                 */
                if ((esf->basic.xpsr & IPSR_ISR_Msk) != 14
                     && (esf->basic.xpsr & IPSR_ISR_Msk) != 0x0) {
                    k_fatal_halt(reason);
                }

This is somewhat unsatisfying. However, this satisfies the minimum in the OP:

at minimum, should bugcheck the kernel instead of livelocking.

So I'll probably submit a PR that does this, and work on not halting on this problem.

It seems to me like we might need an additional check before blindly trusting that the user implementation of k_sys_fatal_error_handler() that we can actually abort the thread as promised.

We should be careful about introducing a check like this since it seems to violate the behavior described viz. "If this function returns, then the currently executing thread will be aborted." If returning from k_sys_fatal_error_handler() is going to sometimes result directly in a halt, that's something that probably needs to be exposed to the user beforehand so they can do whatever hardware shutdown actions, make a policy decision about rebooting, etc.

That having been said, the check proposed here is a lot cleaner than detecting a double fault and should avoid the "execute some random service call" case @mnkp noted. So maybe it should be merged for the meantime with a warning somewhere about "on ARM returning may lock the CPU; check esf->basic.xpsr if you have actions you need to preform before halt."

Actually, probably all that needs to happen is that the check calls k_panic instead of k_fatal_halt; then it should reenter k_sys_fatal_error_handler with reason == K_ERR_KERNEL_PANIC and the user code should then know it has to do whatever shutdown handling.

That having been said, it'd still be good to get the underlying issue fixed so that these kinds of user task crashes are recoverable.

It looks like they are (warning big log):

[00:00:04.150,000] <inf> app: Starting task, stride=50...
[00:00:04.150,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:04.150,000] <dbg> app: main: Spinning in main thread...
[00:00:05.160,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:05.160,000] <dbg> app: main: Spinning in main thread...
[00:00:06.170,000] <inf> app: Stackoverflowing...
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b80
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b10
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa0
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a30
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c0
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001950
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:06.180,000] <err> os: ***** MPU FAULT *****
[00:00:06.180,000] <err> os:   Stacking error (context area might be not valid)
[00:00:06.180,000] <err> os: r0/a1:  0x30001bb8  r1/a2:  0x1000c5d4  r2/a3:  0x00000000
[00:00:06.180,000] <err> os: r3/a4:  0x10003bad r12/ip:  0x30001820 r14/lr:  0xffffffed
[00:00:06.190,000] <err> os:  xpsr:  0x00000000
[00:00:06.190,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.190,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.190,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.190,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:06.190,000] <err> os: fpscr:  0x00040000
[00:00:06.190,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:06.190,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:06.200,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:06.200,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:06.200,000] <inf> app: Thread is normal
[00:00:06.200,000] <wrn> app: Userspace thread; aborting it.
[00:00:06.200,000] <inf> app: Starting task, stride=100...
[00:00:06.200,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:06.200,000] <dbg> app: main: Spinning in main thread...
[00:00:07.210,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.210,000] <dbg> app: main: Spinning in main thread...
[00:00:08.220,000] <inf> app: Stackoverflowing...
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b50
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a80
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b0
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:08.220,000] <err> os: ***** MPU FAULT *****
[00:00:08.220,000] <err> os:   Stacking error (context area might be not valid)
[00:00:08.230,000] <err> os:   Floating-point lazy state preservation error
[00:00:08.230,000] <err> os: r0/a1:  0x30000ab0  r1/a2:  0x30000944  r2/a3:  0x00000000
[00:00:08.230,000] <err> os: r3/a4:  0x10004089 r12/ip:  0x00000000 r14/lr:  0x30000af0
[00:00:08.230,000] <err> os:  xpsr:  0x10007c00
[00:00:08.230,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.230,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:08.230,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.240,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.240,000] <err> os: fpscr:  0x00000000
[00:00:08.240,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.240,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:08.240,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.240,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.240,000] <inf> app: Thread is normal
[00:00:08.240,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.240,000] <err> os: ***** MPU FAULT *****
[00:00:08.240,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:08.250,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:08.250,000] <err> os:  xpsr:  0x00000000
[00:00:08.250,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.250,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0xffffffff
[00:00:08.250,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.250,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.250,000] <err> os: fpscr:  0x00000000
[00:00:08.260,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.260,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:08.260,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.260,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.260,000] <inf> app: Thread is dead
[00:00:08.260,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.260,000] <inf> app: Starting task, stride=120...
[00:00:08.260,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.260,000] <dbg> app: main: Spinning in main thread...
[00:00:09.270,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:09.270,000] <dbg> app: main: Spinning in main thread...
[00:00:10.280,000] <inf> app: Stackoverflowing...
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b40
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a50
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001960
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:10.280,000] <err> os: ***** MPU FAULT *****
[00:00:10.280,000] <err> os:   Stacking error (context area might be not valid)
[00:00:10.280,000] <err> os:   Floating-point lazy state preservation error
[00:00:10.290,000] <err> os: r0/a1:  0x30000623  r1/a2:  0x30001c00  r2/a3:  0x30000ab0
[00:00:10.290,000] <err> os: r3/a4:  0x30000944 r12/ip:  0x00000000 r14/lr:  0x10004089
[00:00:10.290,000] <err> os:  xpsr:  0x30000a00
[00:00:10.290,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.290,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.290,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.290,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.300,000] <err> os: fpscr:  0x00000000
[00:00:10.300,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:10.300,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.300,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.300,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.300,000] <inf> app: Thread is normal
[00:00:10.300,000] <wrn> app: Userspace thread; aborting it.
[00:00:10.300,000] <err> os: ***** MPU FAULT *****
[00:00:10.300,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:10.300,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:10.310,000] <err> os:  xpsr:  0x00002000
[00:00:10.310,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.310,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.310,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.310,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.310,000] <err> os: fpscr:  0x00000000
[00:00:10.310,000] <err> os: Faulting instruction address (r15/pc): 0x3000185c
[00:00:10.320,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:10.320,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.320,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.320,000] <inf> app: Thread is dead
[00:00:10.320,000] <wrn> app: Userspace thread; aborting it.

The only difference is that I'm clearing the pending SVCall during the thread abort when the current thread == the thread to abort:

diff --git a/arch/arm/core/aarch32/cortex_m/thread_abort.c b/arch/arm/core/aarch32/cortex_m/thread_abort.c
index 028a62fed8..d9b563ba42 100644
--- a/arch/arm/core/aarch32/cortex_m/thread_abort.c
+++ b/arch/arm/core/aarch32/cortex_m/thread_abort.c
@@ -40,6 +40,12 @@ void z_impl_k_thread_abort(k_tid_t thread)
                         * is not an implicit scheduler invocation.
                         */
                        SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;
+                       /* Clear any system calls that may be pending
+                        * as they have a higher priority than the PendSV
+                        * handler and will check the stack of the thread
+                        * being aborted.
+                        */
+                       SCB->SHCSR &= ~SCB_SHCSR_SVCALLPENDED_Msk;
                }
        }

@602p That check actually halts before the call to k_sys_fatal_error_handler(), so we could reasonably call k_panic there as you suggested and the user would only see the panic.

The above log is exactly the same for both MODE_JOIN and !MODE_JOIN, leading me to think that this might actually be solvable by clearing the SVCall pending flag as I wrote.

I wonder if we should still have the check in case we manage to have an (vanishingly small chance) interrupt that's stacking causes this problem. We could clear all interrupts, but that seems... bad.

That's definitely promising! But it still seems something is not totally right that we see double faulting and

[00:00:10.320,000] <inf> app: Thread is dead

on some strides and not others. I'm not familiar enough with the internals to know if this is a problem, but it seems like it might confuse user code if nothing else to call k_sys_fatal_error_handler for the same task twice.

I'll poke around with gdb to figure out what's going on there.

I wonder if we should still have the check in case we manage to have an (vanishingly small chance) interrupt that's stacking causes this problem. We could clear all interrupts, but that seems... bad.

May be wise, but introduces an unfortunate random lockup case. @mnkp noted that SysTick may also fire/stack during handling of a MemManage.

Ha! I caught a systick after one of these. No problem, it seems to not mess with the user stack.

z_arm_fatal_error (reason=reason@entry=2, esf=esf@entry=0x30001790 <z_interrupt_stacks+1904>) at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/fatal.c:63
63              z_fatal_error(reason, esf);
(gdb) n
z_arm_usage_fault () at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:111
111             pop {r0, pc}
(gdb) bt
#0  z_arm_usage_fault ()
    at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:111
#1  <signal handler called>
#2  0x00000000 in ?? ()
(gdb) si
sys_clock_isr (arg=0x0) at /home/jimbri01/src/c/zephyr/zephyr/drivers/timer/cortex_m_systick.c:123
123             elapsed();
(gdb) bt
#0  sys_clock_isr (arg=0x0)
    at /home/jimbri01/src/c/zephyr/zephyr/drivers/timer/cortex_m_systick.c:123
#1  <signal handler called>
#2  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The double-exception is a bit wonky, and is related to Lazy FP stacking (I think).

So I set up a display for the value of the MEMFAULTPEND of the System Control Block's System Handler Control and State Register, and held down enter in the offending exception until it went high:

mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6, timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 0
(gdb)
0x100097a4      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 0
(gdb)
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1

Well! What instruction is that?

Dump of assembler code for function mpsc_pbuf_alloc:
   0x100097a0 <+0>:     stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
=> 0x100097a4 <+4>:     vpush   {d8}

Ah, it's a floating point push. Considering that the rest of the function does not use floating point registers, I wonder why it's there? Anyway, the Floating Point Context Control register indicates that the Stack Pointer limit is violated by lazy stacking:

(gdb) p/t *0xe000ef34
$29 = 11000000000000000000010011111111
(gdb) si
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1
(gdb) p/t *0xe000ef34
$30 = 11000000000000000000010011111110

I think this is probably harmless, as the arm arm says it's ignored in this situation, but still I'll try to find a way to not have it happen.

Edit: Read register incorrectly.

Amusingly, this all happens in the mem_manage_fault, on the first print.

Breakpoint 2, 0x100097a4 in mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6, timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
244             if (wlen > (buffer->size - 1)) {
(gdb) display *0xe000ed24 >> 13 & 1
1: *0xe000ed24 >> 13 & 1 = 0
(gdb) si
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1
(gdb) bt
#0  0x100097a8 in mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6,
    timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
#1  0x10001fa8 in z_log_msg2_alloc (wlen=wlen@entry=6)
    at /home/jimbri01/src/c/zephyr/zephyr/subsys/logging/log_core.c:1216
#2  0x10009dd6 in z_impl_z_log_msg2_static_create (source=0x1000bbdc <log_const_os>, desc=...,
    package=package@entry=0x30001754 <z_interrupt_stacks+1844> "\002", data=data@entry=0x0)
    at /home/jimbri01/src/c/zephyr/zephyr/subsys/logging/log_msg2.c:36
#3  0x10009e60 in z_log_msg2_static_create (source=<optimized out>, desc=..., desc@entry=...,
    package=package@entry=0x30001754 <z_interrupt_stacks+1844> "\002", data=0x0)
    at /home/jimbri01/src/c/zephyr/build/repro/livelock/zephyr/include/generated/syscalls/log_msg2.h:37
#4  0x100037a6 in mem_manage_fault (esf=esf@entry=0x300017d8 <z_interrupt_stacks+1976>,
    from_hard_fault=0, recoverable=0x3000178b <z_interrupt_stacks+1899>)
    at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:222

So the first exception prints that it's a floating point lazy stacking problem, and the second one prints no reason. That part should be simple enough to fix, by checking for that condition and clearing the pending when it happens:

diff --git a/arch/arm/core/aarch32/cortex_m/fault.c b/arch/arm/core/aarch32/cortex_m/fault.c
index 15d1006941..c53439b059 100644
--- a/arch/arm/core/aarch32/cortex_m/fault.c
+++ b/arch/arm/core/aarch32/cortex_m/fault.c
@@ -338,6 +338,17 @@ static uint32_t mem_manage_fault(z_arch_esf_t *esf, int from_hard_fault,
 #endif /* CONFIG_MPU_STACK_GUARD || CONFIG_USERSPACE */
        }

+       /* When we were handling this fault, we may have triggered a fp
+        * lazy stacking Memory Manage fault. At the time of writing, this
+        * can happen when printing.  If that's true, we should clear the
+        * pending flag in addition to the clearing the reason for the fault
+        */
+#if defined(CONFIG_ARMV7_M_ARMV8_M_FP)
+       if ((SCB->CFSR & SCB_CFSR_MLSPERR_Msk) != 0) {
+               SCB->SHCSR &= ~SCB_SHCSR_MEMFAULTPENDED_Msk;
+       }
+#endif /* CONFIG_ARMV7_M_ARMV8_M_FP */
+

With both patches, the following log is produced:

../tf-m/qemu-arm/build/qemu-system-arm -s -cpu cortex-m55 -machine mps3-an547 -nographic -vga none -net none -pidfile qemu.pid  -chardev stdio,id=con,mux=on -d guest_errors,unimp -serial chardev:con -mon chardev=con,mode=readline -icount shift=6,align=off,sleep=off -rtc clock=vm -device loader,file=build/repro/livelock/zephyr/zephyr.elf
Invalid read at addr 0x0, size 4, region '(null)', reason: rejected
Invalid read at addr 0x4, size 4, region '(null)', reason: rejected
qemu-system-arm: warning: nic lan9118.0 has no peer
*** Booting Zephyr OS build zephyr-v3.0.0-945-g1070d3041059  ***
[00:00:00.000,000] <inf> app: Starting task, stride=1...
[00:00:00.000,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:00.000,000] <dbg> app: main: Spinning in main thread...
[00:00:01.010,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:01.010,000] <dbg> app: main: Spinning in main thread...
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001938
--- 80 messages dropped ---
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001930
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001928
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001920
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001910
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001908
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001900
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c0
[00:00:02.020,000] <err> os: ***** MPU FAULT *****
[00:00:02.020,000] <err> os:   Stacking error (context area might be not valid)
[00:00:02.020,000] <err> os: r0/a1:  0x30001820  r1/a2:  0xffffffed  r2/a3:  0x1000bbc0
[00:00:02.020,000] <err> os: r3/a4:  0x00002100 r12/ip:  0x300018ac r14/lr:  0x00000000
[00:00:02.020,000] <err> os:  xpsr:  0x10008c00
[00:00:02.020,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:02.020,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:02.020,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:02.020,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:02.020,000] <err> os: fpscr:  0x00040000
[00:00:02.020,000] <err> os: Faulting instruction address (r15/pc): 0x10008d11
[00:00:02.020,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:02.020,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:02.050,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:02.050,000] <inf> app: Thread is normal
[00:00:02.050,000] <wrn> app: Userspace thread; aborting it.
[00:00:02.060,000] <inf> app: Starting task, stride=10...
[00:00:02.060,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:02.060,000] <dbg> app: main: Spinning in main thread...
[00:00:03.070,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:03.070,000] <dbg> app: main: Spinning in main thread...
[00:00:04.080,000] <inf> app: Stackoverflowing...
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ba8
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b98
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b88
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b78
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b68
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b58
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b38
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b28
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b18
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b08
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001af8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ab8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa8
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a98
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a88
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a78
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a58
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a48
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a38
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a28
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a08
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019e8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019a8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001998
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001978
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001968
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001958
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001938
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001928
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001908
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c8
[00:00:04.130,000] <err> os: ***** MPU FAULT *****
[00:00:04.130,000] <err> os:   Stacking error (context area might be not valid)
[00:00:04.130,000] <err> os: r0/a1:  0x30001820  r1/a2:  0xffffffed  r2/a3:  0x00000000
[00:00:04.130,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x1000bbc0 r14/lr:  0x00002100
[00:00:04.130,000] <err> os:  xpsr:  0x00000000
[00:00:04.130,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:04.140,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:04.140,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:04.140,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:04.140,000] <err> os: fpscr:  0x00040000
[00:00:04.140,000] <err> os: Faulting instruction address (r15/pc): 0x300018b4
[00:00:04.140,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:04.140,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:04.150,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:04.150,000] <inf> app: Thread is normal
[00:00:04.150,000] <wrn> app: Userspace thread; aborting it.
[00:00:04.150,000] <inf> app: Starting task, stride=50...
[00:00:04.150,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:04.150,000] <dbg> app: main: Spinning in main thread...
[00:00:05.160,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:05.160,000] <dbg> app: main: Spinning in main thread...
[00:00:06.170,000] <inf> app: Stackoverflowing...
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b80
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b10
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa0
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a30
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c0
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001950
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:06.180,000] <err> os: ***** MPU FAULT *****
[00:00:06.180,000] <err> os:   Stacking error (context area might be not valid)
[00:00:06.180,000] <err> os: r0/a1:  0x30001bb8  r1/a2:  0x1000c5e0  r2/a3:  0x00000000
[00:00:06.180,000] <err> os: r3/a4:  0x10003bb9 r12/ip:  0x30001820 r14/lr:  0xffffffed
[00:00:06.190,000] <err> os:  xpsr:  0x00000000
[00:00:06.190,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.190,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.190,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.190,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:06.190,000] <err> os: fpscr:  0x00040000
[00:00:06.190,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:06.190,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:06.200,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:06.200,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:06.200,000] <inf> app: Thread is normal
[00:00:06.200,000] <wrn> app: Userspace thread; aborting it.
[00:00:06.200,000] <inf> app: Starting task, stride=100...
[00:00:06.200,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:06.200,000] <dbg> app: main: Spinning in main thread...
[00:00:07.210,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.210,000] <dbg> app: main: Spinning in main thread...
[00:00:08.220,000] <inf> app: Stackoverflowing...
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b50
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a80
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b0
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:08.220,000] <err> os: ***** MPU FAULT *****
[00:00:08.220,000] <err> os:   Stacking error (context area might be not valid)
[00:00:08.230,000] <err> os:   Floating-point lazy state preservation error
[00:00:08.230,000] <err> os: r0/a1:  0x30000ab0  r1/a2:  0x30000944  r2/a3:  0x00000000
[00:00:08.230,000] <err> os: r3/a4:  0x10004095 r12/ip:  0x00000000 r14/lr:  0x30000af0
[00:00:08.230,000] <err> os:  xpsr:  0x10007c00
[00:00:08.230,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.230,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:08.230,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.240,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.240,000] <err> os: fpscr:  0x00000000
[00:00:08.240,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.240,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:08.240,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.240,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.240,000] <inf> app: Thread is normal
[00:00:08.240,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.240,000] <inf> app: Starting task, stride=120...
[00:00:08.250,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.250,000] <dbg> app: main: Spinning in main thread...
[00:00:09.260,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:09.260,000] <dbg> app: main: Spinning in main thread...
[00:00:10.270,000] <inf> app: Stackoverflowing...
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b40
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a50
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001960
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:10.270,000] <err> os: ***** MPU FAULT *****
[00:00:10.270,000] <err> os:   Stacking error (context area might be not valid)
[00:00:10.270,000] <err> os:   Floating-point lazy state preservation error
[00:00:10.280,000] <err> os: r0/a1:  0x30000623  r1/a2:  0x30001c00  r2/a3:  0x30000ab0
[00:00:10.280,000] <err> os: r3/a4:  0x30000944 r12/ip:  0x00000000 r14/lr:  0x10004095
[00:00:10.280,000] <err> os:  xpsr:  0x30000a00
[00:00:10.280,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.280,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.280,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.280,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.290,000] <err> os: fpscr:  0x00000000
[00:00:10.290,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:10.290,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.290,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.290,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.290,000] <inf> app: Thread is normal
[00:00:10.290,000] <wrn> app: Userspace thread; aborting it.

Hi @theotherjimmy, glad to see you have solved the issue. BTW, do you know whether cortex a/r has this issue?

Hi @povergoing I have not looked into the code in Cortex-A/R, so I can't comment on that ATM.

Hi @povergoing I have not looked into the code in Cortex-A/R, so I can't comment on that ATM.

Ok, thanks anyway

@602p Can you let us know if #44321 fixes this issue for you or not?

Sorry for the great delay!

I can't seem to reproduce this fix on my nrf52840 .

With latest zephyr I get the following:

[00:00:00.254,943] <inf> app: Starting task, stride=1...
[00:00:00.255,126] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:00.255,157] <dbg> app: main: Spinning in main thread...
[00:00:02.257,049] <dbg> app: stack_overflow: Overflowing, &arr=0x20002500
...
[00:00:02.257,324] <dbg> app: stack_overflow: Overflowing, &arr=0x200024a0
[00:00:02.257,324] <err> os: ***** MPU FAULT *****
[00:00:02.257,354] <err> os:   Stacking error (context area might be not valid)
[00:00:02.257,385] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x0000ce34
[00:00:02.257,415] <err> os: r3/a4:  0x00002100 r12/ip:  0x20002490 r14/lr:  0x00000000
[00:00:02.257,415] <err> os:  xpsr:  0x61000000
[00:00:02.257,446] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:02.257,476] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:02.257,507] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:02.257,507] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:02.257,537] <err> os: fpscr:  0x00000000
[00:00:02.257,537] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:02.257,568] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:02.257,598] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:03.308,868] <err> app: k_sys_fatal_error_handler invoked.
[00:00:03.340,423] <inf> app: Thread is normal
[00:00:03.365,203] <wrn> app: Userspace thread; aborting it.
[00:00:03.396,453] <inf> app: Starting task, stride=10...
[00:00:03.427,307] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:03.464,294] <dbg> app: main: Spinning in main thread...
[00:00:01.255,187] <dbg> app: my_entry_point: Spinning in user thread...
...
[00:00:04.495,452] <dbg> app: main: Spinning in main thread...
[00:00:05.482,574] <inf> app: Stackoverflowing...
[00:00:05.482,604] <dbg> app: stack_overflow: Overflowing, &arr=0x20002790
...
[00:00:06.625,518] <dbg> app: stack_overflow: Overflowing, &arr=0x200024a0
[00:00:06.662,658] <err> os: ***** MPU FAULT *****
[00:00:06.699,829] <err> os:   Stacking error (context area might be not valid)
[00:00:06.736,968] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:06.774,139] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000ce34 r14/lr:  0x00002100
[00:00:06.811,279] <err> os:  xpsr:  0x61000000
[00:00:06.848,419] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.903,686] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.940,826] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.996,124] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:07.051,361] <err> os: fpscr:  0x00000000
[00:00:07.088,500] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:07.125,640] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:07.162,780] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:07.218,078] <err> app: k_sys_fatal_error_handler invoked.
[00:00:07.255,218] <inf> app: Thread is normal
[00:00:07.292,358] <wrn> app: Userspace thread; aborting it.
[00:00:07.329,528] <inf> app: Starting task, stride=50...
[00:00:07.366,851] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.404,022] <dbg> app: main: Spinning in main thread...
[00:00:08.404,052] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.441,253] <dbg> app: main: Spinning in main thread...
[00:00:09.441,284] <inf> app: Stackoverflowing...
[00:00:09.478,454] <dbg> app: stack_overflow: Overflowing, &arr=0x20002768
...
[00:00:09.966,278] <dbg> app: stack_overflow: Overflowing, &arr=0x200024c8
[00:00:09.991,424] <err> os: ***** MPU FAULT *****
[00:00:10.029,174] <err> os:   Stacking error (context area might be not valid)
[00:00:10.066,955] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:10.103,912] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:10.135,467] <err> os:  xpsr:  0x61000000
[00:00:10.160,247] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.191,436] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.222,106] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.259,063] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.290,191] <err> os: fpscr:  0x00000000
[00:00:10.327,117] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:10.358,245] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.383,270] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:10.420,440] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.457,580] <inf> app: Thread is normal
[00:00:10.494,720] <wrn> app: Userspace thread; aborting it.
[00:00:10.531,921] <inf> app: Starting task, stride=100...
[00:00:10.569,213] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:10.606,384] <dbg> app: main: Spinning in main thread...
[00:00:11.606,414] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:11.643,615] <dbg> app: main: Spinning in main thread...
[00:00:12.643,646] <inf> app: Stackoverflowing...
[00:00:12.680,816] <dbg> app: stack_overflow: Overflowing, &arr=0x20002738
...
[00:00:12.898,895] <dbg> app: stack_overflow: Overflowing, &arr=0x200024c8
[00:00:12.942,382] <err> os: ***** HARD FAULT *****
[00:00:12.967,529] <err> os:   Fault escalation (see below)
[00:00:13.017,761] <err> os: ***** MPU FAULT *****
[00:00:13.067,993] <err> os:   Stacking error (context area might be not valid)
[00:00:13.118,225] <err> os:   Floating-point lazy state preservation error
[00:00:13.168,487] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:13.231,353] <err> os:  xpsr:  0x2100000b
--- 1 messages dropped ---
[00:00:13.269,104] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:13.306,091] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:13.337,615] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:13.362,396] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:13.393,585] <err> os: fpscr:  0x00000000
[00:00:13.424,346] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:13.461,303] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:13.492,431] <err> os: Fault during interrupt handling

[00:00:13.529,357] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:13.560,485] <err> app: k_sys_fatal_error_handler invoked.
[00:00:13.585,510] <inf> app: Thread is normal
[00:00:13.622,650] <wrn> app: Userspace thread; aborting it.
[00:00:13.659,790] <err> os: ***** HARD FAULT *****
[00:00:13.696,929] <err> os:   Fault escalation (see below)
[00:00:13.734,039] <err> os: ***** BUS FAULT *****
[00:00:13.771,179] <err> os:   Precise data bus error
[00:00:13.808,319] <err> os:   BFAR Address: 0xfffffffe
[00:00:13.845,458] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:13.870,910] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:13.902,038] <err> os:  xpsr:  0x2100000b
[00:00:13.927,398] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:13.965,209] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:14.002,655] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:14.046,173] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:14.089,477] <err> os: fpscr:  0x00000000
[00:00:14.139,709] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:14.189,971] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:14.240,203] <err> os: Fault during interrupt handling

[00:00:14.290,466] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:14.315,582] <err> app: k_sys_fatal_error_handler invoked.
[00:00:14.353,332] <inf> app: Thread is dead
[00:00:14.390,991] <wrn> app: Userspace thread; aborting it.
[00:00:14.422,271] <err> os: ***** HARD FAULT *****
[00:00:14.459,228] <err> os:   Fault escalation (see below)
[00:00:14.490,783] <err> os: ***** BUS FAULT *****
[00:00:14.515,533] <err> os:   Precise data bus error
[00:00:14.546,722] <err> os:   BFAR Address: 0xfffffffe
[00:00:14.572,204] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:14.603,332] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:14.628,692] <err> os:  xpsr:  0x2100000b
[00:00:14.659,271] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:14.690,063] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:14.733,581] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:14.777,099] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:14.802,215] <err> os: fpscr:  0x00000000
[00:00:14.852,478] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:14.902,709] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:14.952,972] <err> os: Fault during interrupt handling

[00:00:15.003,234] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:15.028,350] <err> app: k_sys_fatal_error_handler invoked.
[00:00:15.066,101] <inf> app: Thread is dead
[00:00:15.103,759] <wrn> app: Userspace thread; aborting it.
[00:00:15.135,070] <err> os: ***** HARD FAULT *****
[00:00:15.172,027] <err> os:   Fault escalation (see below)
[00:00:15.203,552] <err> os: ***** BUS FAULT *****
[00:00:15.228,179] <err> os:   Precise data bus error
[00:00:15.259,368] <err> os:   BFAR Address: 0xfffffffe
[00:00:15.284,820] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:15.315,917] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:15.341,308] <err> os:  xpsr:  0x2100000b
[00:00:15.371,887] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:15.402,679] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:15.446,197] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:15.489,685] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:15.514,831] <err> os: fpscr:  0x00000000
[00:00:15.565,032] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:15.615,264] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:15.665,496] <err> os: Fault during interrupt handling

zephyrproject-rtos / zephyr

ARM M4 MPU backed userspace livelocks on stack overflow when FPU enabled #42496