Closed cinlyooi-intel closed 5 years ago
@cinlyooi-intel Please always specify what tests failed exactly, this project has multiple tests, the reproduction steps are not correct.
Also, -DARCH=arm
are not needed.
more debug info:
DEVICE: [00:00:00.000,000] <dbg> mpu.arm_mpu_init: total region count: 16
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [0] 0x00400000 0x07020028
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [1] 0x20400000 0x110b0022
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [2] 0x20440000 0x110b0020
DEVICE: [00:00:00.000,010] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure thread 0x00000000's domain
DEVICE: [00:00:00.000,011] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure domain: 0x00403441
DEVICE: [00:00:00.000,013] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure domain: 0x00403441
DEVICE: ***** USAGE FAULT *****
DEVICE: Unaligned memory access
DEVICE: ***** Hardware exception *****
DEVICE: Current thread ID = 0x00000000
DEVICE: Faulting instruction address = 0x403692
DEVICE: Fatal fault in ISR! Spinning...
@ioannisg @andrewboie any hints?
I cannot reproduce this on nRF52X boards, need a sam e70, which i am getting soon, actually. Perhaps @aurel32 could help with sending a log from e70, as well?
I am unable to reproduce the issue. I tried both commit 9c2c115716700da3279fd4ab541f0f598d0c4d6d and the current master with both the SDK 0.10.0 and the Debian toolchain. in all cases the test pass.
While attempting to recheck this issue, I cannot boot if xoroshiro enabled (#14244) .
Oh trying harder I realized that the xoroshiro random generator is not the default one when using the above instructions. Manually selecting the option after configuring the test, I am able to reproduce the issue. The crash happens in the MPU code, more precisely there:
/* Populate internal ARM MPU region configuration structure. */
region_conf.base = new_region->start;
_get_region_attr_from_k_mem_partition_info(®ion_conf.attr,
40383a: e9d1 4300 ldrd r4, r3, [r1]
/* in ARMv7-M MPU the base address is not required
* to determine region attributes
*/
(void) base;
The compiler considers that *new_region
is 64-bit aligned, and try to fetch both start
and size
with a single instruction.
Could it be that we do not respect the 8 bytes stack alignment in an interrupt?
Could it be that we do not respect the 8 bytes stack alignment in an interrupt?
This is hard to believe, STKALIGN is set to 1 during initialization, I wrote this myself:
#if defined(CONFIG_STACK_ALIGN_DOUBLE_WORD)
/* Enforce double-word stack alignment on exception entry
* for Cortex-M3 and Cortex-M4 (ARMv7-M) MCUs. For the rest
* of ARM Cortex-M processors this setting is enforced by
* default and it is not configurable.
*/
#if defined(CONFIG_CPU_CORTEX_M3) || defined(CONFIG_CPU_CORTEX_M4)
SCB->CCR |= SCB_CCR_STKALIGN_Msk;
#endif
#endif /* CONFIG_STACK_ALIGN_DOUBLE_WORD */
ARM Note for CM7:
In the Cortex-M7 processor CCR.STKALIGN is read-only and has a value of 1. This means that the exception stack frame starting address is always 8-byte aligned.
Oh trying harder I realized that the xoroshiro random generator is not the default one when using the above instructions. Manually selecting the option after configuring the test, I am able to reproduce the issue. The crash happens in the MPU code, more precisely there:
/* Populate internal ARM MPU region configuration structure. */ region_conf.base = new_region->start; _get_region_attr_from_k_mem_partition_info(®ion_conf.attr, 40383a: e9d1 4300 ldrd r4, r3, [r1] /* in ARMv7-M MPU the base address is not required * to determine region attributes */ (void) base;
The compiler considers that
*new_region
is 64-bit aligned, and try to fetch bothstart
andsize
with a single instruction.
Might be worth of bringing this up; in ARMv7-M we allow unaligned access, @aurel32. That's why we also don't enable the corresponding UsageFault trap.
Interesting documentation here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0646b/BABFAIGG.html
The Cortex-M7 processor supports unaligned access only for the following instructions:
LDR, LDRT.
LDRH, LDRHT.
LDRSH, LDRSHT.
STR, STRT.
STRH, STRHT.
Now, @aurel32 reports an LDRD here.
I wonder if it is time to enforce aligned access also for ARMv7-M.
The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.
The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.
This needs investigation, I agree.
However, I would like to stress once more that perhaps it is time to forbid unaligned access in ARMv7-M. This is a recommendation for having v7-M backwards compatible (binary compatibility) with ARMv8-M:
Un-aligned access is also forbidden in ARmv6-M.
What do you think?
The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.
Digging more into address alignment: ARMv7-M requires address in LDRD, STRD to be word-aligned. Now, I wonder why an address of a stack object (struct k_mem_partition) wouldn't be word-aligned...
@aurel32 in the UsageFault handler, could you print the contents of the stacked R1, to see if it is word-aligned or not?
It's not even half-word aligned: r1 = 0x403575. This can be seen in the GDB backtrace:
#0 _mpu_configure_region (index=3 '\003', new_region=0x403575 <_isr_wrapper>) at /home/aurel32/git/zephyrproject/zephyr/arch/arm/core/cortex_m/mpu/arm_mpu.c:91
This is an address on the flash, which corresponds to _isr_wrapper
with the lower bit set (thumb mode). I therefore guess it comes from a stack overflow somewhere.
Note that the problem doesn't happen on a nucleo l432kc board (i had to force the MPU to be enabled), despite having the same LDRD instruction.
Hm, interesting. Could you compile with --no_unaligned_access option? I am curious about the result.
Hm, stack overflows should be caught by mpu. This is not a synchronous MPU/USG fault. So if the memory is corrupted by earlier stack overflow, mpu fault should have triggered.
I guess you mean -mno-unaligned-access
. In that case I have no output at all, using gdb the code seems to loop in printk related functions. A simple hello world work though.
The issue might be located in the SAM entropy driver. The problem disappears if I remove the call to k_yield()
in entropy_sam_wait_ready()
. Now I am a bit lost, as this call happens from a thread and because the STM32 driver does the same and do not have the issue (or at least I can't reproduce it).
There are other fishy things even after disabling the k_yield()
. For example if I also enable the shell:
***** Booting Zephyr OS v1.14.0-rc1-1065-g0e7cb91b27b6 *****
Running test suite common_test
===================================================================
starting test - test_rand32
Generating random numbers
PASS - test_rand32
===================================================================
Test suite common_test succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL
uart:~$ kernel threads
***** USAGE FAULT *****
Division by zero
***** Hardware exception *****
Current thread ID = 0x204004f0
Faulting instruction address = 0x403272
Fatal fault in thread 0x204004f0! Aborting.
ASSERTION FAIL [z_spin_lock_valid(l)] @ /home/aurel32/git/zephyrproject/zephyr/include/spinlock.h:76
Recursive spinlock
***** HARD FAULT *****
Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x204004f0
Faulting instruction address = 0x40153a
Fatal fault in ISR! Spinning...
The division by 0 happens in stack_unused_space_get()
because size is 0.
The issue might be located in the SAM entropy driver. The problem disappears if I remove the call to
k_yield()
inentropy_sam_wait_ready()
. Now I am a bit lost, as this call happens from a thread and because the STM32 driver does the same and do not have the issue (or at least I can't reproduce it).
This makes more sense to me. The driver might be corrupting the kernel memory inside ISR...
This makes more sense to me. The driver might be corrupting the kernel memory inside ISR...
I have disabled userspace and reduced the driver to a simple constant number generator and the shell problem is still there...
Skipping the problematic division, I got:
uart:~$ kernel threads
Threads:
*0x20400498 shell_uart
options: 0x0, priority: 14
stack size 0, unused 0, usage 0 / 0 (0 %)
0x20400034 logging
options: 0x0, priority: 14
stack size 0, unused 0, usage 0 / 0 (0 %)
0x204005ec idle
options: 0x1, priority: 15
stack size 0, unused 0, usage 0 / 0 (0 %)
It seems the stack size happened because I didn't run a make clean after disabling the userspace.
I am now convinced the issue is due to k_yield() call during the initialization of the Xoroshiro random number generator through the SAM driver. I have replaced the driver by a very simple version:
diff --git a/drivers/entropy/entropy_sam.c b/drivers/entropy/entropy_sam.c
index d4593f3a12..9ea2204888 100644
--- a/drivers/entropy/entropy_sam.c
+++ b/drivers/entropy/entropy_sam.c
@@ -47,6 +47,11 @@ static int entropy_sam_get_entropy(struct device *dev, u8_t *buffer,
{
Trng *const trng = DEV_CFG(dev)->regs;
+ //k_yield();
+
+ buffer[0] = 42;
+ return 1;
+
while (length > 0) {
size_t to_copy;
u32_t value;
@@ -72,6 +77,8 @@ static int entropy_sam_init(struct device *dev)
{
Trng *const trng = DEV_CFG(dev)->regs;
+ return 0;
+
/* Enable the user interface clock */
soc_pmc_peripheral_enable(DT_ENTROPY_SAM_TRNG_PERIPHERAL_ID);
Enabling the k_yield()
line causes a crash.
It looks to me that the issue is in the SAM driver. The driver should provide a get_entropy_isr
method which can be called from ISR and which doesn't call k_yield(). In addition it should respect the ENTROPY_BUSYWAIT
flag and not wait if it is not passed.
This makes me realize that the driver is not thread safe. It looks tricky to provide a thread/ISR safe version of the driver without disabling the interrupts between check that entropy is available and the read of that entropy with the current way it is done. I guess the easiest way is to read the entropy in a single place from the TRNG interrupt and putting it in a k_msgq based entropy pool. From there it should be trivial to provide the various versions.
@erwango I have noticed that the STM32 RNG driver is also using k_yield()
so it is likely affected by the same issue (although the condition to trigger it are probably different).
@aurel32 Is this issue root caused?
@nashif @aurel32 @cinlyooi-intel @wentongwu could we just initialize the xoroshiro device in POST_KERNEL? AFAIK, we can't properly use kernel APIs before the kernel is initialized....
If I set this to POST_KERNEL, the tests executes successfully.
To Reproduce Steps to reproduce the behavior:
Screenshots or console output
Environment (please complete the following information):