Hard fault (reason=25) when the thread is waiting for the message queue

lnhao commented 2 months ago

Describe the bug The recurrence of this bug is irregular, and I don't know when it will occur. Some devices have been running for more than 4 days and are still functioning normally. But some devices may experience this after running for more than a day. The function entry I checked during debugging was generated from the k_msgq_get() function. There is very little other useful information. It's not like a thread stack overflow.

What target platform are you using? -> STM32WB55
What have you tried to diagnose or workaround this issue? -> west debug / codedump.

To Reproduce Steps to reproduce the behavior:

Connect a stm32wb55 board to PC.
cd ./zephyrproject/app_stm32wb55/application/app
west build -b board -p
west flash
west debug

Expected behavior I don't think the program should enter the Hard Fault handler.

Impact The customer's device has also encountered this bug.

Logs and console output

(gdb) info stack

#0  arch_system_halt (reason=25) at /home/alexliao/git/zephyrproject/app_stm32wb55/application/app/src/main.c:65

#1  0x080367ec in k_sys_fatal_error_handler (reason=reason@entry=25, esf=esf@entry=0x200101c0 <z_interrupt_stacks+2048>) at /home/alexliao/git/zephyrproject/zephyr/kernel/fatal.c:46

#2  0x0803681a in z_fatal_error (reason=reason@entry=25, esf=esf@entry=0x200101c0 <z_interrupt_stacks+2048>) at /home/alexliao/git/zephyrproject/zephyr/kernel/fatal.c:131

#3  0x0803347e in z_arm_fatal_error (reason=reason@entry=25, esf=esf@entry=0x200101c0 <z_interrupt_stacks+2048>) at /home/alexliao/git/zephyrproject/zephyr/arch/arm/core/aarch32/fatal.c:63

#4  0x08017b74 in z_arm_fault (msp=<optimized out>, psp=<optimized out>, exc_return=<optimized out>, callee_regs=<optimized out>)

    at /home/alexliao/git/zephyrproject/zephyr/arch/arm/core/aarch32/cortex_m/fault.c:1138

#5  0x08017c58 in z_arm_usage_fault () at /home/alexliao/git/zephyrproject/zephyr/arch/arm/core/aarch32/cortex_m/fault_s.S:102

#6  <signal handler called>

#7  arch_swap (key=0) at /home/alexliao/git/zephyrproject/zephyr/arch/arm/core/aarch32/swap.c:53

#8  0x08026afa in z_swap_irqlock (key=<optimized out>) at /home/alexliao/git/zephyrproject/zephyr/kernel/include/kswap.h:185

#9  0x080263de in z_impl_k_msgq_get (msgq=msgq@entry=0x20003078 <sys_event>, data=data@entry=0x2000f954 <app_main_thread_stack+5076>, timeout=...)

    at /home/alexliao/git/zephyrproject/zephyr/kernel/msg_q.c:260

#10 0x08015dca in k_msgq_get (timeout=..., data=0x2000f954 <app_main_thread_stack+5076>, msgq=0x20003078 <sys_event>)

    at /home/alexliao/git/zephyrproject/app_stm32wb55/application/app/build/zephyr/include/generated/syscalls/kernel.h:1176

#11 sys_event_message_receive (rmsg=rmsg@entry=0x2000f98c <app_main_thread_stack+5132>, msg_len=msg_len@entry=0x2000f98a <app_main_thread_stack+5130>, timeout=...)

    at /home/alexliao/git/zephyrproject/app_stm32wb55/application/app/src/msg.c:167

#12 0x080139d4 in app_main_thread () at /home/alexliao/git/zephyrproject/app_stm32wb55/application/app/src/main_thread.c:599

#13 0x08032c90 in z_thread_entry (entry=0x8013889 <app_main_thread>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/alexliao/git/zephyrproject/zephyr/lib/os/thread_entry.c:36

#14 0x08032c90 in z_thread_entry (entry=0x8013889 <app_main_thread>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /home/alexliao/git/zephyrproject/zephyr/lib/os/thread_entry.c:36

#15 0x94fb2920 in ?? ()

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Environment (please complete the following information):

OS: Ubuntu
Toolchain: zephyr-sdk-0.16.0
Zephyr version: v3.3.0, v3.6.0

aescolar commented 2 months ago

@lnhao without a reproduction case it may be quite difficult for anybody to guess what the problem is. Are you able to provide one?

peter-mitsis commented 2 months ago

Based on the limited information so far, there is not much that I can add yet. That being said, it looks like exception (which according to the 'reason' in the debug dump is a "precise data bus error") occurred at line 53 of swap.c in arch_swap(). This corresponds to ..

return _current->arch.swap_return_value;

Not being an ARM expert, I am uncertain what causes this error. However, given the code location I am suspicious of the _current pointer. Perhaps this pointer became corrupt in some way?

One other thing of note as far as execution flow goes, slightly earlier in arch_swap(), we generated a PENDSV interrupt (necessary for scheduling). If I remember things correctly, PENDSV is the lowest priority interrupt, which makes it ideal for scheduling. And in arch_swap(), it does not get serviced until the interrupts are unlocked via irq_unlock(0);.

Given the location, I do wonder if this is reproducible on other cortex-m platforms as this code is pretty foundational to it.

lnhao commented 2 months ago

@aescolar @peter-mitsis thank you all. This problem is difficult to reproduce. Some devices have not been replicated yet. Some devices can be replicated after running for 24 hours. I cannot provide a sample at the moment; it will take some more time.

peter-mitsis commented 1 month ago

@lnhao - Figured I would check in and see if there is any additional information that can be added to this issue.

lnhao commented 1 month ago

Hi @peter-mitsis @nashif @aescolar, another issue was found during the same program testing, where the I2C interrupt was continuously triggered after the device ran for 20 hours, causing other threads to be unable to run properly. And this problem is easy to reproduce. Then I tried to disable the I2C interrupt CONFIG_I2C_STM32_INTERRUPT=n, and after running for more than 10 days, it still works fine. The issue with this Hard Fault has also disappeared. I think Zephyr's STM32 I2C interrupt will malfunction after prolonged working.

peter-mitsis commented 4 weeks ago

Re-assigning to @erwango as this is looking like it is going in the direction of the STM32 I2C.

teburd commented 4 weeks ago

Maybe related #70077

zephyrproject-rtos / zephyr

Hard fault (reason=25) when the thread is waiting for the message queue #71743