tests: kernel tests hardfault on nucleo_l073rz

ABOSTM commented 3 years ago

Describe the bug Hardfault occurs on nucleo_l073rz, while executing some tests on automatic test bench. Mainly kernel tests, but not exclusively. HardFault is reproducible easily under some circumstances (see analysis below). List of faulty tests

tests/arch/arm/arm_runtime_nmi/arch.interrupt.arm.nmi, nucleo_l073rz tests/benchmarks/latency_measure/benchmark.kernel.latency.stm32, nucleo_l073rz tests/kernel/context/kernel.common, nucleo_l073rz tests/kernel/fifo/fifo_api/kernel.fifo, nucleo_l073rz tests/kernel/fifo/fifo_timeout/kernel.fifo.timeout, nucleo_l073rz tests/kernel/interrupt/arch.interrupt, nucleo_l073rz tests/kernel/lifo/lifo_usage/kernel.lifo.usage, nucleo_l073rz tests/kernel/mbox/mbox_api/kernel.mailbox.api, nucleo_l073rz tests/kernel/mem_heap/k_heap_api/kernel.k_heap_api, nucleo_l073rz tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser, nucleo_l073rz tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace, nucleo_l073rz tests/kernel/mem_slab/mslab_api/kernel.memory_slabs.api, nucleo_l073rz tests/kernel/mem_slab/mslab_concept/kernel.memory_slabs.concept, nucleo_l073rz tests/kernel/msgq/msgq_api/kernel.message_queue, nucleo_l073rz tests/kernel/msgq/msgq_usage/kernel.message_queue_usage, nucleo_l073rz tests/kernel/mutex/mutex_api/kernel.mutex, nucleo_l073rz tests/kernel/mutex/sys_mutex/system.mutex.nouser, nucleo_l073rz tests/kernel/pending/kernel.objects, nucleo_l073rz tests/kernel/pipe/pipe/kernel.pipe, nucleo_l073rz tests/kernel/pipe/pipe_api/kernel.pipe.api, nucleo_l073rz tests/kernel/profiling/profiling_api/kernel.common.profiling, nucleo_l073rz tests/kernel/sched/deadline/kernel.scheduler.deadline, nucleo_l073rz tests/kernel/sleep/kernel.common.timing, nucleo_l073rz tests/kernel/threads/dynamic_thread/kernel.threads.dynamic, nucleo_l073rz tests/kernel/threads/thread_init/kernel.threads.init, nucleo_l073rz tests/kernel/threads/tls/kernel.threads.tls, nucleo_l073rz tests/kernel/threads/tls/kernel.threads.tls.userspace, nucleo_l073rz tests/kernel/timer/timer_monotonic/kernel.timer.monotonic, nucleo_l073rz tests/kernel/workq/work/kernel.work.api, nucleo_l073rz tests/kernel/workq/work_queue/kernel.workqueue, nucleo_l073rz tests/lib/time/libraries.libc.time, nucleo_l073rz tests/subsys/logging/log_core_additional/logging.add.async, nucleo_l073rz tests/subsys/pm/power_mgmt/subsys.pm.device_pm, nucleo_l073rz

To Reproduce Steps to reproduce the behavior:

Enable debug and PowerManagement in prj.conf: CONFIG_DEBUG=y and CONFIG_PM=y (to force the bug systematically)
west build -p auto -b nucleo_l073rz tests/kernel/mutex/mutex_api/
west flash
See error

Logs and console output

*** Booting Zephyr OS build zephyr-v2.6.0-1131-ge287639479d7  ***
Running test suite mutex_api
===================================================================
START - test_mutex_lock_unlock
 PASS - test_mutex_lock_unlock in 0.2 seconds
===================================================================
START - test_mutex_reent_lock_forever
access resource from main thread
access resource from main thread
E: ***** HARD FAULT *****
E: r0/a1:  0x2000077c  r1/a2:  0x00000000  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x08007575
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x08003c32
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x200003e0 (idle 00)
E: Halting system

Environment (please complete the following information):

OS: Linux and Windows
Toolchain Zephyr SDK
Commit SHA 35011489229b5acb049f325fcca3a38ba012fdd4

ABOSTM commented 3 years ago

Analysis After some (difficult) analysis, I came to the conclusion that this Hardfault comes with a conjunction of circumstances: ALL the following conditions need to be true to reproduce this issue:

at least one bit of the DBGMCU_CR is set (DBG_STANDBY, DBG_STOP or DBG_SLEEP). This could happens when flashing with OpenOCD or when enabling both CONFIG_DEBUG and CONFIG_PM (in stm32_power_init(), call to LL_DBGMCU_EnableDBGStopMode() ) It is also possible to directly write those bitfields in soc init for test purpose. Note: Power Management is not a requirement to reproduce this issue. Those bits prevent to disable HCLK and FCLK when MCU is going to Standby, Stop or sleep. This is useful to use a debugger while using lowpower. When those bits are forces to 0, the problem vanished.
Following single commit should be merged: "kernel/idle: Replace stolen IRQ lock" sha1 39a8f3b4f957ed0e50c848414891ad5fab4500bb (from PR #32848) Thanks to git bisect, I found that this issue appears after merge of this commit. When reverting this commit on main branch, problem vanished.
CONFIG_ZTEST=y I am not sure this is absolutely necessary, but it has direct or indirect impact: If I test sample/basic/blinky, I don't reproduce the issue, but if I transform this blinky test with CONFIG_ZTEST=y (with thread, stack, test, ...) then I reproduce the HardFault

Hardfault analysis: Unwinding Hardfault call stack, I found that Program Counter pc=0x08003c32 (same address also provided by console log) is not aligned on an instruction, but in the middle of an instruction ... causing the HardFault. But I could not found why this pc is no aligned (corrputed stack, ) It is to be noticed that, this is always the middle of the same instruction whatever the test executed.

      /* Enter low power state */
      wfi
   8003c2c:   bf30        wfi

      /*
       * Clear PRIMASK and flush instruction buffer to immediately service
       * the wake-up interrupt.
       */
      cpsie   i
   8003c2e:   b662        cpsie   i
      isb
   8003c30:   f3bf 8f6f   isb sy

      bx  lr
   8003c34:   4770        bx  lr
   8003c36:   46c0        nop         ; (mov r8, r8)

It is asm function "arch_cpu_idle" (arch/arm/core/aarch32/cpu_idel.s) Note that pc address is very close to "cpsie i" instruction which will enable Interrupts.

I also found that adding asm instruction: cpsid i just after this comment (despite it is said not necessary), problem vanished.

    /*
     * For all the other ARM architectures that do not implement BASEPRI,
     * PRIMASK is used as the interrupt locking mechanism, and it is not
     * necessary to set PRIMASK here, as PRIMASK would have already been
     * set by the caller as part of interrupt locking if necessary
     * (i.e. if the caller sets _kernel.idle).
     */
     cpsid  i

Note when debugging step by step, I could not reproduce the Hardfault, so it is very difficult to get to the root cause of the issue. (maybe due to something linked to interrupt enabling ??)

@andyross, do you have any idea how this issue could be linked to your commit 39a8f3b4f957ed0e50c848414891ad5fab4500bb ? Your commit add an a instruction "cpsie i" (ARMV6-M), which clear PRIMASK and enable interrupts so I currently found 2 workaround, one is to remove cpsid (revert your commit), the other is to add cpsid (disabling interrupt). Thus either we are not disabling interrupt (revert your patch), or we are disabling interrupts, both are around disabling/enabling interrupts. Would that be possible that disabling/enabling interrupts should come by pair, but there is a path in which this is not respected, causing hardfault ?

@ioannisg, your Cortex M expertise is welcome too.

tagunil commented 3 years ago

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M, but I can't see where exactly 39a8f3b4f957ed0e50c848414891ad5fab4500bb enables interrupts. Is that the right commit?

tagunil commented 3 years ago

Looks like there is no arch_irq_lock() in some place where it should be present.

ABOSTM commented 3 years ago

@tagunil,

I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?.

My bad, arch_irq_lock will disable interrupts (set primask). I updated my description

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M,

Yes #22078 looks very similar, thanks for point this. The fix #23511 lead to the comment I already mentioned for ARMV6 arch:

    /*
     * For all the other ARM architectures that do not implement BASEPRI,
     * PRIMASK is used as the interrupt locking mechanism, and it is not
     * necessary to set PRIMASK here, as PRIMASK would have already been
     * set by the caller as part of interrupt locking if necessary
     * (i.e. if the caller sets _kernel.idle).
     */

which I don't understand (I don't have enough zephyr kernel knowledge)

erwango commented 3 years ago

@ioannisg I've set the issue to medium, don't hesitate to raise to high if requested

tagunil commented 3 years ago

@ABOSTM What I can't understand is why your bisection points to the commit that disables interrupts, while your experiment shows that disabling interrupts by adding "cpsid i" helps.

tagunil commented 3 years ago

Also it could be related with idle API fragility discussed in #24255.

ABOSTM commented 3 years ago

^^ @stephanosio

ABOSTM commented 3 years ago

closed by mistake

erwango commented 3 years ago

@ioannisg, @andyross would you have time answering questions in this comment https://github.com/zephyrproject-rtos/zephyr/issues/37119#issuecomment-884338591 ?

FRASTM commented 3 years ago

Since commit e0bed3b989ef95952c8474fabb551c01e8d7ae16, a similar hardfault occurs when testing the stm32g071rb nucleo board with "test suite timer_api" :

*** Booting Zephyr OS build zephyr-v2.6.0-2072-ge0bed3b989ef  ***           
Running test suite timer_api                                                
===================================================================         
START - test_time_conversions                                               
 PASS - test_time_conversions in 0.189 seconds                              
===================================================================         
START - test_timer_duration_period                                          
E: ***** HARD FAULT *****

This hardfault is definitely linked to the USERSPACE and that PR "Cortex-R MPU support" #28231 applied on a cortex M0+ with MPU devices like stm32g071 or stm32l073 especially the first commit " arch: arm: cortex_r: Add MPU and USERSPACE support " When CONFIG_TEST_USERSPACE=n the testcase tests/kernel/timer/timer_api can run to its end.

ABOSTM commented 3 years ago

@FRASTM, Hardfault on stm32g071rb nucleo board is not link to the current issue (see issue 38421)

zephyrproject-rtos / zephyr

tests: kernel tests hardfault on nucleo_l073rz #37119