Closed ABOSTM closed 3 years ago
Analysis After some (difficult) analysis, I came to the conclusion that this Hardfault comes with a conjunction of circumstances: ALL the following conditions need to be true to reproduce this issue:
at least one bit of the DBGMCU_CR is set (DBG_STANDBY, DBG_STOP or DBG_SLEEP). This could happens when flashing with OpenOCD or when enabling both CONFIG_DEBUG and CONFIG_PM (in stm32_power_init(), call to LL_DBGMCU_EnableDBGStopMode() ) It is also possible to directly write those bitfields in soc init for test purpose. Note: Power Management is not a requirement to reproduce this issue. Those bits prevent to disable HCLK and FCLK when MCU is going to Standby, Stop or sleep. This is useful to use a debugger while using lowpower. When those bits are forces to 0, the problem vanished.
Following single commit should be merged: "kernel/idle: Replace stolen IRQ lock"
sha1 39a8f3b4f957ed0e50c848414891ad5fab4500bb (from PR #32848)
Thanks to git bisect, I found that this issue appears after merge of this commit.
When reverting this commit on main branch, problem vanished.
CONFIG_ZTEST=y
I am not sure this is absolutely necessary, but it has direct or indirect impact:
If I test sample/basic/blinky, I don't reproduce the issue,
but if I transform this blinky test with CONFIG_ZTEST=y (with thread, stack, test, ...) then I reproduce the HardFault
Hardfault analysis: Unwinding Hardfault call stack, I found that Program Counter pc=0x08003c32 (same address also provided by console log) is not aligned on an instruction, but in the middle of an instruction ... causing the HardFault. But I could not found why this pc is no aligned (corrputed stack, ) It is to be noticed that, this is always the middle of the same instruction whatever the test executed.
/* Enter low power state */ wfi 8003c2c: bf30 wfi /* * Clear PRIMASK and flush instruction buffer to immediately service * the wake-up interrupt. */ cpsie i 8003c2e: b662 cpsie i isb 8003c30: f3bf 8f6f isb sy bx lr 8003c34: 4770 bx lr 8003c36: 46c0 nop ; (mov r8, r8)
It is asm function "arch_cpu_idle" (arch/arm/core/aarch32/cpu_idel.s) Note that pc address is very close to "cpsie i" instruction which will enable Interrupts.
I also found that adding asm instruction: cpsid i
just after this comment (despite it is said not necessary), problem vanished.
/*
* For all the other ARM architectures that do not implement BASEPRI,
* PRIMASK is used as the interrupt locking mechanism, and it is not
* necessary to set PRIMASK here, as PRIMASK would have already been
* set by the caller as part of interrupt locking if necessary
* (i.e. if the caller sets _kernel.idle).
*/
cpsid i
Note when debugging step by step, I could not reproduce the Hardfault, so it is very difficult to get to the root cause of the issue. (maybe due to something linked to interrupt enabling ??)
@andyross, do you have any idea how this issue could be linked to your commit 39a8f3b4f957ed0e50c848414891ad5fab4500bb ? Your commit add an a instruction "cpsie i" (ARMV6-M), which clear PRIMASK and enable interrupts so I currently found 2 workaround, one is to remove cpsid (revert your commit), the other is to add cpsid (disabling interrupt). Thus either we are not disabling interrupt (revert your patch), or we are disabling interrupts, both are around disabling/enabling interrupts. Would that be possible that disabling/enabling interrupts should come by pair, but there is a path in which this is not respected, causing hardfault ?
@ioannisg, your Cortex M expertise is welcome too.
All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M, but I can't see where exactly 39a8f3b4f957ed0e50c848414891ad5fab4500bb enables interrupts. Is that the right commit?
Looks like there is no arch_irq_lock() in some place where it should be present.
@tagunil,
I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?.
My bad, arch_irq_lock will disable interrupts (set primask). I updated my description
All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M,
Yes #22078 looks very similar, thanks for point this. The fix #23511 lead to the comment I already mentioned for ARMV6 arch:
/*
* For all the other ARM architectures that do not implement BASEPRI,
* PRIMASK is used as the interrupt locking mechanism, and it is not
* necessary to set PRIMASK here, as PRIMASK would have already been
* set by the caller as part of interrupt locking if necessary
* (i.e. if the caller sets _kernel.idle).
*/
which I don't understand (I don't have enough zephyr kernel knowledge)
@ioannisg I've set the issue to medium, don't hesitate to raise to high if requested
@ABOSTM What I can't understand is why your bisection points to the commit that disables interrupts, while your experiment shows that disabling interrupts by adding "cpsid i" helps.
Also it could be related with idle API fragility discussed in #24255.
^^ @stephanosio
closed by mistake
@ioannisg, @andyross would you have time answering questions in this comment https://github.com/zephyrproject-rtos/zephyr/issues/37119#issuecomment-884338591 ?
Since commit e0bed3b989ef95952c8474fabb551c01e8d7ae16, a similar hardfault occurs when testing the stm32g071rb nucleo board with "test suite timer_api" :
*** Booting Zephyr OS build zephyr-v2.6.0-2072-ge0bed3b989ef ***
Running test suite timer_api
===================================================================
START - test_time_conversions
PASS - test_time_conversions in 0.189 seconds
===================================================================
START - test_timer_duration_period
E: ***** HARD FAULT *****
This hardfault is definitely linked to the USERSPACE and that PR "Cortex-R MPU support" #28231 applied on a cortex M0+ with MPU devices like stm32g071 or stm32l073 especially the first commit " arch: arm: cortex_r: Add MPU and USERSPACE support " When CONFIG_TEST_USERSPACE=n the testcase tests/kernel/timer/timer_api can run to its end.
@FRASTM, Hardfault on stm32g071rb nucleo board is not link to the current issue (see issue 38421)
Describe the bug Hardfault occurs on nucleo_l073rz, while executing some tests on automatic test bench. Mainly kernel tests, but not exclusively. HardFault is reproducible easily under some circumstances (see analysis below). List of faulty tests
To Reproduce Steps to reproduce the behavior:
Logs and console output
Environment (please complete the following information):