zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.72k stars 6.55k forks source link

ARM M4 MPU backed userspace livelocks on stack overflow when FPU enabled #42496

Closed 602p closed 2 years ago

602p commented 2 years ago

Describe the bug When using the MPU-backed userspace feature on the NRF52840 with the FPU enabled, a correctly-constructed stack overflow of the user thread can lead to an unrecoverable livelock where the overflowed thread never successfully aborts.

I am using the userspace feature to sandbox some 3rd-party code that involves a lot of parsing, etc, and want recover in the case of a crash in userspace. I have implemented a k_sys_fatal_error_handler that returns to the kernel to abort userspace threads that have crashed, and normally (oops, divide-by-zero, etc) this works fine. However, when the FPU is enabled (CONFIG_FPU=y) some instances of stack overflow instead result the system looping hard faults and never aborting the user thread. Note that the FPU doesn't need to actually be in use for this to happen. I couldn't reproduce this issue with a kernel thread and CONFIG_HW_STACK_PROTECTION.

Whether or not this occurs depends on by how much the overflowing stack frame exceeds the MPU region. It seems that small overflows behave correctly (or at least are recoverable) where large overflows lead to the livelock behavior.

The exact behavior depends also on the state of the scheduler. Having a second thread runnable at abort-time seems to change the behavior, and exposes another edge case where it double-faults from the overflowing thread but can recover (there are still overflow amounts where it livelocks in this case.)

It's possible to detect this livelock in k_sys_fatal_error_handler by seeing if k_current_get()->base.thread_state is _THREAD_ABORTING or _THREAD_DEAD, and in that case halt the system.

To Reproduce Clone https://github.com/602p/zephyr-livelock-repro and run on an NRF52840DK. I'd suspect it would happen on other M4 devices too.

Uncomment #define ENABLE_MITIGATION in main.c to detect and halt on livelock condition. Uncomment #define MODE_JOIN to cause the main thread to join on the user thread instead of polling; changes exact livelock behaviour. See char to_try[] = { for the exact different cases of failure depending on the amount by which the user thread overflows it's stack.

Expected behavior All amounts of overflow should be recoverable or, at minimum, should bugcheck the kernel instead of livelocking.

Impact Mitigable, but still problematic because it changes the security model of using userspace to sandbox code since it could still crash the whole device.

Logs and console output See https://github.com/602p/zephyr-livelock-repro/blob/main/log.txt . The script tries overflowing a user task by increasingly large stack frames. The first few recover normally, the last gets livelocked.

Environment (please complete the following information):

Cause conjecture Stack overflow is handled at zephyr/arch/arm/core/aarch32/cortex_m/fault.c:325, which resets the stack pointer to a valid address before returning to that task with PendSv set to execute context switching from the user task. Is Zephyr forgetting to also reset some part of the [lazy] FP save/restore info (Floating Point Context Address Register?) and that remains invalid, triggering another MPU fault while trying to ret into the user task to swap it out?

mnkp commented 2 years ago

I have run the sample code on sam_e70_xplained board and can confirm that the problem can be observed also on SAM E70 SoC (Cortex-M7). The exact behavior in regards to how likely the problem is to manifest itself

Whether or not this occurs depends on by how much the overflowing stack frame exceeds the MPU region. It seems that small overflows behave correctly (or at least are recoverable) where large overflows lead to the livelock behavior.

is different. In case of SAM E70 device it is much easier to trigger the invalid behavior. However, the failure mode is exactly the same.

I have analyzed the issue a bit and it does not look good. It seems we have a larger problem in handling stack overflow condition in user mode on ARM platform. As issue's author has written

Stack overflow is handled at zephyr/arch/arm/core/aarch32/cortex_m/fault.c:325, which resets the stack pointer to a valid address before returning to that task with PendSv set to execute context switching from the user task.

That's the line of code and relevant excerpt from the comments

    * [...] Therefore,
    * we manually force the stack pointer to the
    * lowest allowed position, inside the thread's
    * stack.
    *
    * Note:
    * [...]
    * The manual adjustment of PSP is safe, as we
    * will not be re-scheduling this thread again
    * for execution; thread stack corruption is a
    * fatal error and a thread that corrupted its
    * stack needs to be aborted.
    */
__set_PSP(min_stack_ptr);

The stack is reset to the lowest possible stack address, i.e. stack full. That works well unless the stack overflow is triggered while entering SVCall exception. SVCall has higher priority than PendSV exception. Unlike the comment suggests manual adjustment of PSP is not safe in this case since after PSP is adjusted and MemManage exception exits the processor will continue executing SVCall handler. Relevant excerpt from z_arm_svc handler in arch/arm/core/aarch32/swap_helper.S:

SECTION_FUNC(TEXT, z_arm_svc)
    tst lr, #_EXC_RETURN_SPSEL_Msk /* did we come from thread mode ? */
    ite eq  /* if zero (equal), came from handler mode */
        mrseq r0, MSP   /* handler mode, stack frame is on MSP */
        mrsne r0, PSP   /* thread mode, stack frame is on PSP */

    /* Figure out what SVC call number was invoked */

    ldr r1, [r0, #24]   /* grab address of PC from stack frame */
    /* SVC is a two-byte instruction, point to it and read the
     * SVC number (lower byte of SCV instruction)
     */
    ldrb r1, [r1, #-2]

The hard fault seen in the logs is triggered by the last line of the above code. We've entered from the thread mode so the routine is using PSP to retrieve SVC number from the corrupted stack. Again, the stack is corrupted since upon SVCall exception entry the processor tried to save the current state (exception frame) on the PSP (application stack). This failure mode happens always when the stack overflow is caused by SVC exception entry. It was observed when FPU is enabled since exception frame with floating-point storage is much larger (26 4-byte words) than exception frame without floating-point storage (8 4-byte words). Additionally the exception frame with floating-point storage is mostly 0s. That's relevant since when executing

    ldrb r1, [r1, #-2]

the value of r1 register is taken from the stack. If the value of r1 is 0 we get a hard fault. It is a "precise data bus error" which means that upon exception exit we are going to execute exactly the same line of code triggering the same hard fault. Thus the locking behavior. If the value of r1 - 2 happens to point to the location that processor can read then it will retrieve some bogus SVC call number. This will either cause kernel oops if the SVC call number is detected as invalid or we are going to execute some random service call.

Looking at the sample application code

void stack_overflow(int stride) {
    char arr[stride];
    LOG_DBG("Overflowing, &arr=%p", &arr);
    stack_overflow(stride);
}

The invalid behavior was observed when calling

    LOG_DBG("Overflowing, &arr=%p", &arr);

and not when calling stack_overflow(stride);. Behind the scenes LOG_DBG macro is executing SVC instruction.

Since SVCall exception cannot be executed due to corrupted stack MemManage fault handler needs to clear SVCALLPENDED bit in SHCSR register. E.g. by executing

    if (SCB->SHCSR & SCB_SHCSR_SVCALLPENDED_Msk) {
        SCB->SHCSR &= ~SCB_SHCSR_SVCALLPENDED_Msk;
    }

When working on the fix we should take into account that:

microbuilder commented 2 years ago

@theotherjimmy Can you try to reproduce this?

theotherjimmy commented 2 years ago

I have reproduced this on an mps3_an547

theotherjimmy commented 2 years ago

This looks a bit worse than I had anticipated: It affects not just SVC, but all irqs with a higher priority than PendSV.

In particular, the _isr_wrapper touches/pushes to the process stack here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/isr_wrapper.S#L80-L89 which could cause a stack overflow and mem manage fault. I think the only practical solution in the face of this issue, is to force the context switch before entering anything of lower priority than the mem manage fault to have a working stack.

Edit: I thought github would be better and show the snippet inline. It's below


    /*
     * Save away r0-r3, r12 and lr_irq for the previous context to the
     * process stack since they are clobbered here.  Also, save away lr
     * and spsr_irq since we may swap processes and return to a different
     * thread.
     */
    sub lr, lr, #4
    srsdb #MODE_SYS!
    cps #MODE_SYS
    push {r0-r3, r12, lr}

Edit2: So as far as I can tell, I read this assembly incorrectly and this does not push to the process stack.

microbuilder commented 2 years ago

@carlescufi @MaureenHelm I think we should discuss this on the dev review call. Flagging it for review/discussion, and @theotherjimmy and myself can attend.

mnkp commented 2 years ago

In particular, the _isr_wrapper touches/pushes to the process stack here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/isr_wrapper.S#L80-L89 which could cause a stack overflow and mem manage fault.

Just to make it explicit: the code mentioned above is executed on Cortex-R, Cortex-A architectures only (which we also need to review) and not on Cortex-M devices the issue was originally observed on.

theotherjimmy commented 2 years ago

I'm not so sure. I think that's added with the config CONFIG_GEN_SW_ISR_TABLE (here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L17)

zephyr_library_sources_ifdef(CONFIG_GEN_SW_ISR_TABLE isr_wrapper.S)

The same file switches between Cortex A/R and Cortex M later on (here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L24 and here: https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/aarch32/CMakeLists.txt#L32-L33

add_subdirectory_ifdef(CONFIG_CPU_CORTEX_M cortex_m)
...
add_subdirectory_ifdef(CONFIG_CPU_AARCH32_CORTEX_R cortex_a_r)
add_subdirectory_ifdef(CONFIG_CPU_AARCH32_CORTEX_A cortex_a_r)

Further, the Kconfig in the cortex_m subdir has the following:

config GEN_ISR_TABLES
    default y

I think that file applies.

theotherjimmy commented 2 years ago

It seems that this is related to k_sys_fatal_error_handler, as documented (emphasis mine) :

The policy for what to do when encountering a fatal error is determined by the implementation of the k_sys_fatal_error_handler() function. This function has a default implementation with weak linkage that calls LOG_PANIC() to dump all pending logging messages and then unconditionally halts the system with k_fatal_halt().

Applications are free to implement their own error handling policy by overriding the implementation of k_sys_fatal_error_handler(). If the implementation returns, the faulting thread will be aborted and the system will otherwise continue to function. See the documentation for this function for additional details and constraints

It seems to me like we might need an additional check before blindly trusting that the user implementation of k_sys_fatal_error_handler() that we can actually abort the thread as promised.

The check would be something like:

                /* If we had a memory manage fault during another exception or
                 * interrupt, other than PendSV, we will return to that handler
                 * before PendSV with a corrupt user stack pointer. We can't be
                 * sure that the handler will not attempt something such as read
                 * the instruction pointed to by the pc in the exception frame,
                 * which is now corrupt, like the SVCall handler does.
                 */
                if ((esf->basic.xpsr & IPSR_ISR_Msk) != 14
                     && (esf->basic.xpsr & IPSR_ISR_Msk) != 0x0) {
                    k_fatal_halt(reason);
                }

This is somewhat unsatisfying. However, this satisfies the minimum in the OP:

at minimum, should bugcheck the kernel instead of livelocking.

So I'll probably submit a PR that does this, and work on not halting on this problem.

602p commented 2 years ago

It seems to me like we might need an additional check before blindly trusting that the user implementation of k_sys_fatal_error_handler() that we can actually abort the thread as promised.

We should be careful about introducing a check like this since it seems to violate the behavior described viz. "If this function returns, then the currently executing thread will be aborted." If returning from k_sys_fatal_error_handler() is going to sometimes result directly in a halt, that's something that probably needs to be exposed to the user beforehand so they can do whatever hardware shutdown actions, make a policy decision about rebooting, etc.

That having been said, the check proposed here is a lot cleaner than detecting a double fault and should avoid the "execute some random service call" case @mnkp noted. So maybe it should be merged for the meantime with a warning somewhere about "on ARM returning may lock the CPU; check esf->basic.xpsr if you have actions you need to preform before halt."

602p commented 2 years ago

Actually, probably all that needs to happen is that the check calls k_panic instead of k_fatal_halt; then it should reenter k_sys_fatal_error_handler with reason == K_ERR_KERNEL_PANIC and the user code should then know it has to do whatever shutdown handling.

That having been said, it'd still be good to get the underlying issue fixed so that these kinds of user task crashes are recoverable.

theotherjimmy commented 2 years ago

It looks like they are (warning big log):

[00:00:04.150,000] <inf> app: Starting task, stride=50...
[00:00:04.150,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:04.150,000] <dbg> app: main: Spinning in main thread...
[00:00:05.160,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:05.160,000] <dbg> app: main: Spinning in main thread...
[00:00:06.170,000] <inf> app: Stackoverflowing...
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b80
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b10
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa0
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a30
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c0
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001950
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:06.180,000] <err> os: ***** MPU FAULT *****
[00:00:06.180,000] <err> os:   Stacking error (context area might be not valid)
[00:00:06.180,000] <err> os: r0/a1:  0x30001bb8  r1/a2:  0x1000c5d4  r2/a3:  0x00000000
[00:00:06.180,000] <err> os: r3/a4:  0x10003bad r12/ip:  0x30001820 r14/lr:  0xffffffed
[00:00:06.190,000] <err> os:  xpsr:  0x00000000
[00:00:06.190,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.190,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.190,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.190,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:06.190,000] <err> os: fpscr:  0x00040000
[00:00:06.190,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:06.190,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:06.200,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:06.200,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:06.200,000] <inf> app: Thread is normal
[00:00:06.200,000] <wrn> app: Userspace thread; aborting it.
[00:00:06.200,000] <inf> app: Starting task, stride=100...
[00:00:06.200,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:06.200,000] <dbg> app: main: Spinning in main thread...
[00:00:07.210,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.210,000] <dbg> app: main: Spinning in main thread...
[00:00:08.220,000] <inf> app: Stackoverflowing...
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b50
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a80
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b0
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:08.220,000] <err> os: ***** MPU FAULT *****
[00:00:08.220,000] <err> os:   Stacking error (context area might be not valid)
[00:00:08.230,000] <err> os:   Floating-point lazy state preservation error
[00:00:08.230,000] <err> os: r0/a1:  0x30000ab0  r1/a2:  0x30000944  r2/a3:  0x00000000
[00:00:08.230,000] <err> os: r3/a4:  0x10004089 r12/ip:  0x00000000 r14/lr:  0x30000af0
[00:00:08.230,000] <err> os:  xpsr:  0x10007c00
[00:00:08.230,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.230,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:08.230,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.240,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.240,000] <err> os: fpscr:  0x00000000
[00:00:08.240,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.240,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:08.240,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.240,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.240,000] <inf> app: Thread is normal
[00:00:08.240,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.240,000] <err> os: ***** MPU FAULT *****
[00:00:08.240,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:08.250,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:08.250,000] <err> os:  xpsr:  0x00000000
[00:00:08.250,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.250,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0xffffffff
[00:00:08.250,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.250,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.250,000] <err> os: fpscr:  0x00000000
[00:00:08.260,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.260,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:08.260,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.260,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.260,000] <inf> app: Thread is dead
[00:00:08.260,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.260,000] <inf> app: Starting task, stride=120...
[00:00:08.260,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.260,000] <dbg> app: main: Spinning in main thread...
[00:00:09.270,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:09.270,000] <dbg> app: main: Spinning in main thread...
[00:00:10.280,000] <inf> app: Stackoverflowing...
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b40
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a50
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001960
[00:00:10.280,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:10.280,000] <err> os: ***** MPU FAULT *****
[00:00:10.280,000] <err> os:   Stacking error (context area might be not valid)
[00:00:10.280,000] <err> os:   Floating-point lazy state preservation error
[00:00:10.290,000] <err> os: r0/a1:  0x30000623  r1/a2:  0x30001c00  r2/a3:  0x30000ab0
[00:00:10.290,000] <err> os: r3/a4:  0x30000944 r12/ip:  0x00000000 r14/lr:  0x10004089
[00:00:10.290,000] <err> os:  xpsr:  0x30000a00
[00:00:10.290,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.290,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.290,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.290,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.300,000] <err> os: fpscr:  0x00000000
[00:00:10.300,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:10.300,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.300,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.300,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.300,000] <inf> app: Thread is normal
[00:00:10.300,000] <wrn> app: Userspace thread; aborting it.
[00:00:10.300,000] <err> os: ***** MPU FAULT *****
[00:00:10.300,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:10.300,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:10.310,000] <err> os:  xpsr:  0x00002000
[00:00:10.310,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.310,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.310,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.310,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.310,000] <err> os: fpscr:  0x00000000
[00:00:10.310,000] <err> os: Faulting instruction address (r15/pc): 0x3000185c
[00:00:10.320,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:10.320,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.320,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.320,000] <inf> app: Thread is dead
[00:00:10.320,000] <wrn> app: Userspace thread; aborting it.

The only difference is that I'm clearing the pending SVCall during the thread abort when the current thread == the thread to abort:

diff --git a/arch/arm/core/aarch32/cortex_m/thread_abort.c b/arch/arm/core/aarch32/cortex_m/thread_abort.c
index 028a62fed8..d9b563ba42 100644
--- a/arch/arm/core/aarch32/cortex_m/thread_abort.c
+++ b/arch/arm/core/aarch32/cortex_m/thread_abort.c
@@ -40,6 +40,12 @@ void z_impl_k_thread_abort(k_tid_t thread)
                         * is not an implicit scheduler invocation.
                         */
                        SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;
+                       /* Clear any system calls that may be pending
+                        * as they have a higher priority than the PendSV
+                        * handler and will check the stack of the thread
+                        * being aborted.
+                        */
+                       SCB->SHCSR &= ~SCB_SHCSR_SVCALLPENDED_Msk;
                }
        }
theotherjimmy commented 2 years ago

@602p That check actually halts before the call to k_sys_fatal_error_handler(), so we could reasonably call k_panic there as you suggested and the user would only see the panic.

theotherjimmy commented 2 years ago

The above log is exactly the same for both MODE_JOIN and !MODE_JOIN, leading me to think that this might actually be solvable by clearing the SVCall pending flag as I wrote.

theotherjimmy commented 2 years ago

I wonder if we should still have the check in case we manage to have an (vanishingly small chance) interrupt that's stacking causes this problem. We could clear all interrupts, but that seems... bad.

602p commented 2 years ago

That's definitely promising! But it still seems something is not totally right that we see double faulting and

[00:00:10.320,000] <inf> app: Thread is dead

on some strides and not others. I'm not familiar enough with the internals to know if this is a problem, but it seems like it might confuse user code if nothing else to call k_sys_fatal_error_handler for the same task twice.

theotherjimmy commented 2 years ago

I'll poke around with gdb to figure out what's going on there.

602p commented 2 years ago

I wonder if we should still have the check in case we manage to have an (vanishingly small chance) interrupt that's stacking causes this problem. We could clear all interrupts, but that seems... bad.

May be wise, but introduces an unfortunate random lockup case. @mnkp noted that SysTick may also fire/stack during handling of a MemManage.

theotherjimmy commented 2 years ago

Ha! I caught a systick after one of these. No problem, it seems to not mess with the user stack.

z_arm_fatal_error (reason=reason@entry=2, esf=esf@entry=0x30001790 <z_interrupt_stacks+1904>) at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/fatal.c:63
63              z_fatal_error(reason, esf);
(gdb) n
z_arm_usage_fault () at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:111
111             pop {r0, pc}
(gdb) bt
#0  z_arm_usage_fault ()
    at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:111
#1  <signal handler called>
#2  0x00000000 in ?? ()
(gdb) si
sys_clock_isr (arg=0x0) at /home/jimbri01/src/c/zephyr/zephyr/drivers/timer/cortex_m_systick.c:123
123             elapsed();
(gdb) bt
#0  sys_clock_isr (arg=0x0)
    at /home/jimbri01/src/c/zephyr/zephyr/drivers/timer/cortex_m_systick.c:123
#1  <signal handler called>
#2  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
theotherjimmy commented 2 years ago

The double-exception is a bit wonky, and is related to Lazy FP stacking (I think).

So I set up a display for the value of the MEMFAULTPEND of the System Control Block's System Handler Control and State Register, and held down enter in the offending exception until it went high:

mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6, timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 0
(gdb)
0x100097a4      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 0
(gdb)
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1

Well! What instruction is that?

Dump of assembler code for function mpsc_pbuf_alloc:
   0x100097a0 <+0>:     stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
=> 0x100097a4 <+4>:     vpush   {d8}

Ah, it's a floating point push. Considering that the rest of the function does not use floating point registers, I wonder why it's there? Anyway, the Floating Point Context Control register indicates that the Stack Pointer limit is violated by lazy stacking:

(gdb) p/t *0xe000ef34
$29 = 11000000000000000000010011111111
(gdb) si
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1
(gdb) p/t *0xe000ef34
$30 = 11000000000000000000010011111110

I think this is probably harmless, as the arm arm says it's ignored in this situation, but still I'll try to find a way to not have it happen.

Edit: Read register incorrectly.

theotherjimmy commented 2 years ago

Amusingly, this all happens in the mem_manage_fault, on the first print.

Breakpoint 2, 0x100097a4 in mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6, timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
244             if (wlen > (buffer->size - 1)) {
(gdb) display *0xe000ed24 >> 13 & 1
1: *0xe000ed24 >> 13 & 1 = 0
(gdb) si
0x100097a8      244             if (wlen > (buffer->size - 1)) {
1: *0xe000ed24 >> 13 & 1 = 1
(gdb) bt
#0  0x100097a8 in mpsc_pbuf_alloc (buffer=buffer@entry=0x300008ac <log_buffer>, wlen=wlen@entry=6,
    timeout=...) at /home/jimbri01/src/c/zephyr/zephyr/lib/os/mpsc_pbuf.c:244
#1  0x10001fa8 in z_log_msg2_alloc (wlen=wlen@entry=6)
    at /home/jimbri01/src/c/zephyr/zephyr/subsys/logging/log_core.c:1216
#2  0x10009dd6 in z_impl_z_log_msg2_static_create (source=0x1000bbdc <log_const_os>, desc=...,
    package=package@entry=0x30001754 <z_interrupt_stacks+1844> "\002", data=data@entry=0x0)
    at /home/jimbri01/src/c/zephyr/zephyr/subsys/logging/log_msg2.c:36
#3  0x10009e60 in z_log_msg2_static_create (source=<optimized out>, desc=..., desc@entry=...,
    package=package@entry=0x30001754 <z_interrupt_stacks+1844> "\002", data=0x0)
    at /home/jimbri01/src/c/zephyr/build/repro/livelock/zephyr/include/generated/syscalls/log_msg2.h:37
#4  0x100037a6 in mem_manage_fault (esf=esf@entry=0x300017d8 <z_interrupt_stacks+1976>,
    from_hard_fault=0, recoverable=0x3000178b <z_interrupt_stacks+1899>)
    at /home/jimbri01/src/c/zephyr/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:222

So the first exception prints that it's a floating point lazy stacking problem, and the second one prints no reason. That part should be simple enough to fix, by checking for that condition and clearing the pending when it happens:

diff --git a/arch/arm/core/aarch32/cortex_m/fault.c b/arch/arm/core/aarch32/cortex_m/fault.c
index 15d1006941..c53439b059 100644
--- a/arch/arm/core/aarch32/cortex_m/fault.c
+++ b/arch/arm/core/aarch32/cortex_m/fault.c
@@ -338,6 +338,17 @@ static uint32_t mem_manage_fault(z_arch_esf_t *esf, int from_hard_fault,
 #endif /* CONFIG_MPU_STACK_GUARD || CONFIG_USERSPACE */
        }

+       /* When we were handling this fault, we may have triggered a fp
+        * lazy stacking Memory Manage fault. At the time of writing, this
+        * can happen when printing.  If that's true, we should clear the
+        * pending flag in addition to the clearing the reason for the fault
+        */
+#if defined(CONFIG_ARMV7_M_ARMV8_M_FP)
+       if ((SCB->CFSR & SCB_CFSR_MLSPERR_Msk) != 0) {
+               SCB->SHCSR &= ~SCB_SHCSR_MEMFAULTPENDED_Msk;
+       }
+#endif /* CONFIG_ARMV7_M_ARMV8_M_FP */
+

With both patches, the following log is produced:

../tf-m/qemu-arm/build/qemu-system-arm -s -cpu cortex-m55 -machine mps3-an547 -nographic -vga none -net none -pidfile qemu.pid  -chardev stdio,id=con,mux=on -d guest_errors,unimp -serial chardev:con -mon chardev=con,mode=readline -icount shift=6,align=off,sleep=off -rtc clock=vm -device loader,file=build/repro/livelock/zephyr/zephyr.elf
Invalid read at addr 0x0, size 4, region '(null)', reason: rejected
Invalid read at addr 0x4, size 4, region '(null)', reason: rejected
qemu-system-arm: warning: nic lan9118.0 has no peer
*** Booting Zephyr OS build zephyr-v3.0.0-945-g1070d3041059  ***
[00:00:00.000,000] <inf> app: Starting task, stride=1...
[00:00:00.000,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:00.000,000] <dbg> app: main: Spinning in main thread...
[00:00:01.010,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:01.010,000] <dbg> app: main: Spinning in main thread...
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001938
--- 80 messages dropped ---
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001930
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001928
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001920
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001910
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001908
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001900
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d0
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c8
[00:00:02.020,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c0
[00:00:02.020,000] <err> os: ***** MPU FAULT *****
[00:00:02.020,000] <err> os:   Stacking error (context area might be not valid)
[00:00:02.020,000] <err> os: r0/a1:  0x30001820  r1/a2:  0xffffffed  r2/a3:  0x1000bbc0
[00:00:02.020,000] <err> os: r3/a4:  0x00002100 r12/ip:  0x300018ac r14/lr:  0x00000000
[00:00:02.020,000] <err> os:  xpsr:  0x10008c00
[00:00:02.020,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:02.020,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:02.020,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:02.020,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:02.020,000] <err> os: fpscr:  0x00040000
[00:00:02.020,000] <err> os: Faulting instruction address (r15/pc): 0x10008d11
[00:00:02.020,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:02.020,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:02.050,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:02.050,000] <inf> app: Thread is normal
[00:00:02.050,000] <wrn> app: Userspace thread; aborting it.
[00:00:02.060,000] <inf> app: Starting task, stride=10...
[00:00:02.060,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:02.060,000] <dbg> app: main: Spinning in main thread...
[00:00:03.070,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:03.070,000] <dbg> app: main: Spinning in main thread...
[00:00:04.080,000] <inf> app: Stackoverflowing...
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ba8
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b98
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b88
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b78
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b68
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b58
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b38
[00:00:04.080,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b28
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b18
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b08
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001af8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ab8
[00:00:04.090,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa8
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a98
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a88
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a78
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a58
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a48
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a38
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a28
[00:00:04.100,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a08
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019e8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019a8
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001998
[00:00:04.110,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001978
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001968
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001958
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001938
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001928
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001908
[00:00:04.120,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018f8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018d8
[00:00:04.130,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018c8
[00:00:04.130,000] <err> os: ***** MPU FAULT *****
[00:00:04.130,000] <err> os:   Stacking error (context area might be not valid)
[00:00:04.130,000] <err> os: r0/a1:  0x30001820  r1/a2:  0xffffffed  r2/a3:  0x00000000
[00:00:04.130,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x1000bbc0 r14/lr:  0x00002100
[00:00:04.130,000] <err> os:  xpsr:  0x00000000
[00:00:04.130,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:04.140,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:04.140,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:04.140,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:04.140,000] <err> os: fpscr:  0x00040000
[00:00:04.140,000] <err> os: Faulting instruction address (r15/pc): 0x300018b4
[00:00:04.140,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:04.140,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:04.150,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:04.150,000] <inf> app: Thread is normal
[00:00:04.150,000] <wrn> app: Userspace thread; aborting it.
[00:00:04.150,000] <inf> app: Starting task, stride=50...
[00:00:04.150,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:04.150,000] <dbg> app: main: Spinning in main thread...
[00:00:05.160,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:05.160,000] <dbg> app: main: Spinning in main thread...
[00:00:06.170,000] <inf> app: Stackoverflowing...
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b80
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b48
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b10
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ad8
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001aa0
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a68
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a30
[00:00:06.170,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019f8
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019c0
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001988
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001950
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001918
[00:00:06.180,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:06.180,000] <err> os: ***** MPU FAULT *****
[00:00:06.180,000] <err> os:   Stacking error (context area might be not valid)
[00:00:06.180,000] <err> os: r0/a1:  0x30001bb8  r1/a2:  0x1000c5e0  r2/a3:  0x00000000
[00:00:06.180,000] <err> os: r3/a4:  0x10003bb9 r12/ip:  0x30001820 r14/lr:  0xffffffed
[00:00:06.190,000] <err> os:  xpsr:  0x00000000
[00:00:06.190,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.190,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.190,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.190,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:06.190,000] <err> os: fpscr:  0x00040000
[00:00:06.190,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:06.190,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:06.200,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:06.200,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:06.200,000] <inf> app: Thread is normal
[00:00:06.200,000] <wrn> app: Userspace thread; aborting it.
[00:00:06.200,000] <inf> app: Starting task, stride=100...
[00:00:06.200,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:06.200,000] <dbg> app: main: Spinning in main thread...
[00:00:07.210,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.210,000] <dbg> app: main: Spinning in main thread...
[00:00:08.220,000] <inf> app: Stackoverflowing...
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b50
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ae8
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a80
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a18
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019b0
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001948
[00:00:08.220,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e0
[00:00:08.220,000] <err> os: ***** MPU FAULT *****
[00:00:08.220,000] <err> os:   Stacking error (context area might be not valid)
[00:00:08.230,000] <err> os:   Floating-point lazy state preservation error
[00:00:08.230,000] <err> os: r0/a1:  0x30000ab0  r1/a2:  0x30000944  r2/a3:  0x00000000
[00:00:08.230,000] <err> os: r3/a4:  0x10004095 r12/ip:  0x00000000 r14/lr:  0x30000af0
[00:00:08.230,000] <err> os:  xpsr:  0x10007c00
[00:00:08.230,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:08.230,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:08.230,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:08.240,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:08.240,000] <err> os: fpscr:  0x00000000
[00:00:08.240,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:08.240,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:08.240,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:08.240,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:08.240,000] <inf> app: Thread is normal
[00:00:08.240,000] <wrn> app: Userspace thread; aborting it.
[00:00:08.240,000] <inf> app: Starting task, stride=120...
[00:00:08.250,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.250,000] <dbg> app: main: Spinning in main thread...
[00:00:09.260,000] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:09.260,000] <dbg> app: main: Spinning in main thread...
[00:00:10.270,000] <inf> app: Stackoverflowing...
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001b40
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001ac8
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001a50
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300019d8
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x30001960
[00:00:10.270,000] <dbg> app: stack_overflow: Overflowing, &arr=0x300018e8
[00:00:10.270,000] <err> os: ***** MPU FAULT *****
[00:00:10.270,000] <err> os:   Stacking error (context area might be not valid)
[00:00:10.270,000] <err> os:   Floating-point lazy state preservation error
[00:00:10.280,000] <err> os: r0/a1:  0x30000623  r1/a2:  0x30001c00  r2/a3:  0x30000ab0
[00:00:10.280,000] <err> os: r3/a4:  0x30000944 r12/ip:  0x00000000 r14/lr:  0x10004095
[00:00:10.280,000] <err> os:  xpsr:  0x30000a00
[00:00:10.280,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.280,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.280,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.280,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.290,000] <err> os: fpscr:  0x00000000
[00:00:10.290,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:10.290,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.290,000] <err> os: Current thread: 0x30000000 (user_thread)
[00:00:10.290,000] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.290,000] <inf> app: Thread is normal
[00:00:10.290,000] <wrn> app: Userspace thread; aborting it.
povergoing commented 2 years ago

Hi @theotherjimmy, glad to see you have solved the issue. BTW, do you know whether cortex a/r has this issue?

theotherjimmy commented 2 years ago

Hi @povergoing I have not looked into the code in Cortex-A/R, so I can't comment on that ATM.

povergoing commented 2 years ago

Hi @povergoing I have not looked into the code in Cortex-A/R, so I can't comment on that ATM.

Ok, thanks anyway

microbuilder commented 2 years ago

@602p Can you let us know if #44321 fixes this issue for you or not?

602p commented 2 years ago

Sorry for the great delay!

I can't seem to reproduce this fix on my nrf52840 .

With latest zephyr I get the following:

[00:00:00.254,943] <inf> app: Starting task, stride=1...
[00:00:00.255,126] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:00.255,157] <dbg> app: main: Spinning in main thread...
[00:00:02.257,049] <dbg> app: stack_overflow: Overflowing, &arr=0x20002500
...
[00:00:02.257,324] <dbg> app: stack_overflow: Overflowing, &arr=0x200024a0
[00:00:02.257,324] <err> os: ***** MPU FAULT *****
[00:00:02.257,354] <err> os:   Stacking error (context area might be not valid)
[00:00:02.257,385] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x0000ce34
[00:00:02.257,415] <err> os: r3/a4:  0x00002100 r12/ip:  0x20002490 r14/lr:  0x00000000
[00:00:02.257,415] <err> os:  xpsr:  0x61000000
[00:00:02.257,446] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:02.257,476] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:02.257,507] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:02.257,507] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:02.257,537] <err> os: fpscr:  0x00000000
[00:00:02.257,537] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:02.257,568] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:02.257,598] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:03.308,868] <err> app: k_sys_fatal_error_handler invoked.
[00:00:03.340,423] <inf> app: Thread is normal
[00:00:03.365,203] <wrn> app: Userspace thread; aborting it.
[00:00:03.396,453] <inf> app: Starting task, stride=10...
[00:00:03.427,307] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:03.464,294] <dbg> app: main: Spinning in main thread...
[00:00:01.255,187] <dbg> app: my_entry_point: Spinning in user thread...
...
[00:00:04.495,452] <dbg> app: main: Spinning in main thread...
[00:00:05.482,574] <inf> app: Stackoverflowing...
[00:00:05.482,604] <dbg> app: stack_overflow: Overflowing, &arr=0x20002790
...
[00:00:06.625,518] <dbg> app: stack_overflow: Overflowing, &arr=0x200024a0
[00:00:06.662,658] <err> os: ***** MPU FAULT *****
[00:00:06.699,829] <err> os:   Stacking error (context area might be not valid)
[00:00:06.736,968] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:06.774,139] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000ce34 r14/lr:  0x00002100
[00:00:06.811,279] <err> os:  xpsr:  0x61000000
[00:00:06.848,419] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:06.903,686] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:06.940,826] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:06.996,124] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:07.051,361] <err> os: fpscr:  0x00000000
[00:00:07.088,500] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:07.125,640] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:07.162,780] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:07.218,078] <err> app: k_sys_fatal_error_handler invoked.
[00:00:07.255,218] <inf> app: Thread is normal
[00:00:07.292,358] <wrn> app: Userspace thread; aborting it.
[00:00:07.329,528] <inf> app: Starting task, stride=50...
[00:00:07.366,851] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:07.404,022] <dbg> app: main: Spinning in main thread...
[00:00:08.404,052] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:08.441,253] <dbg> app: main: Spinning in main thread...
[00:00:09.441,284] <inf> app: Stackoverflowing...
[00:00:09.478,454] <dbg> app: stack_overflow: Overflowing, &arr=0x20002768
...
[00:00:09.966,278] <dbg> app: stack_overflow: Overflowing, &arr=0x200024c8
[00:00:09.991,424] <err> os: ***** MPU FAULT *****
[00:00:10.029,174] <err> os:   Stacking error (context area might be not valid)
[00:00:10.066,955] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:00:10.103,912] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
[00:00:10.135,467] <err> os:  xpsr:  0x61000000
[00:00:10.160,247] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:10.191,436] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:10.222,106] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:10.259,063] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:10.290,191] <err> os: fpscr:  0x00000000
[00:00:10.327,117] <err> os: Faulting instruction address (r15/pc): 0x00003b81
[00:00:10.358,245] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:10.383,270] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:10.420,440] <err> app: k_sys_fatal_error_handler invoked.
[00:00:10.457,580] <inf> app: Thread is normal
[00:00:10.494,720] <wrn> app: Userspace thread; aborting it.
[00:00:10.531,921] <inf> app: Starting task, stride=100...
[00:00:10.569,213] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:10.606,384] <dbg> app: main: Spinning in main thread...
[00:00:11.606,414] <dbg> app: my_entry_point: Spinning in user thread...
[00:00:11.643,615] <dbg> app: main: Spinning in main thread...
[00:00:12.643,646] <inf> app: Stackoverflowing...
[00:00:12.680,816] <dbg> app: stack_overflow: Overflowing, &arr=0x20002738
...
[00:00:12.898,895] <dbg> app: stack_overflow: Overflowing, &arr=0x200024c8
[00:00:12.942,382] <err> os: ***** HARD FAULT *****
[00:00:12.967,529] <err> os:   Fault escalation (see below)
[00:00:13.017,761] <err> os: ***** MPU FAULT *****
[00:00:13.067,993] <err> os:   Stacking error (context area might be not valid)
[00:00:13.118,225] <err> os:   Floating-point lazy state preservation error
[00:00:13.168,487] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:13.231,353] <err> os:  xpsr:  0x2100000b
--- 1 messages dropped ---
[00:00:13.269,104] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:13.306,091] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:13.337,615] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:13.362,396] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:13.393,585] <err> os: fpscr:  0x00000000
[00:00:13.424,346] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:13.461,303] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:13.492,431] <err> os: Fault during interrupt handling

[00:00:13.529,357] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:13.560,485] <err> app: k_sys_fatal_error_handler invoked.
[00:00:13.585,510] <inf> app: Thread is normal
[00:00:13.622,650] <wrn> app: Userspace thread; aborting it.
[00:00:13.659,790] <err> os: ***** HARD FAULT *****
[00:00:13.696,929] <err> os:   Fault escalation (see below)
[00:00:13.734,039] <err> os: ***** BUS FAULT *****
[00:00:13.771,179] <err> os:   Precise data bus error
[00:00:13.808,319] <err> os:   BFAR Address: 0xfffffffe
[00:00:13.845,458] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:13.870,910] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:13.902,038] <err> os:  xpsr:  0x2100000b
[00:00:13.927,398] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:13.965,209] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:14.002,655] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:14.046,173] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:14.089,477] <err> os: fpscr:  0x00000000
[00:00:14.139,709] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:14.189,971] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:14.240,203] <err> os: Fault during interrupt handling

[00:00:14.290,466] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:14.315,582] <err> app: k_sys_fatal_error_handler invoked.
[00:00:14.353,332] <inf> app: Thread is dead
[00:00:14.390,991] <wrn> app: Userspace thread; aborting it.
[00:00:14.422,271] <err> os: ***** HARD FAULT *****
[00:00:14.459,228] <err> os:   Fault escalation (see below)
[00:00:14.490,783] <err> os: ***** BUS FAULT *****
[00:00:14.515,533] <err> os:   Precise data bus error
[00:00:14.546,722] <err> os:   BFAR Address: 0xfffffffe
[00:00:14.572,204] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:14.603,332] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:14.628,692] <err> os:  xpsr:  0x2100000b
[00:00:14.659,271] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:14.690,063] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:14.733,581] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:14.777,099] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:14.802,215] <err> os: fpscr:  0x00000000
[00:00:14.852,478] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:14.902,709] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:14.952,972] <err> os: Fault during interrupt handling

[00:00:15.003,234] <err> os: Current thread: 0x20000000 (user_thread)
[00:00:15.028,350] <err> app: k_sys_fatal_error_handler invoked.
[00:00:15.066,101] <inf> app: Thread is dead
[00:00:15.103,759] <wrn> app: Userspace thread; aborting it.
[00:00:15.135,070] <err> os: ***** HARD FAULT *****
[00:00:15.172,027] <err> os:   Fault escalation (see below)
[00:00:15.203,552] <err> os: ***** BUS FAULT *****
[00:00:15.228,179] <err> os:   Precise data bus error
[00:00:15.259,368] <err> os:   BFAR Address: 0xfffffffe
[00:00:15.284,820] <err> os: r0/a1:  0x200023c0  r1/a2:  0x00000000  r2/a3:  0x20002450
[00:00:15.315,917] <err> os: r3/a4:  0x00000000 r12/ip:  0x0000a8d1 r14/lr:  0xffffffed
[00:00:15.341,308] <err> os:  xpsr:  0x2100000b
[00:00:15.371,887] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:15.402,679] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:15.446,197] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:15.489,685] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:15.514,831] <err> os: fpscr:  0x00000000
[00:00:15.565,032] <err> os: Faulting instruction address (r15/pc): 0x000037bc
[00:00:15.615,264] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:15.665,496] <err> os: Fault during interrupt handling