zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.86k stars 6.62k forks source link

tests/crypto/rand32/crypto.rand32.random_hw_xoroshiro: Usage fault "Fatal fault in ISR! Spinning..." #13935

Closed cinlyooi-intel closed 5 years ago

cinlyooi-intel commented 5 years ago

To Reproduce Steps to reproduce the behavior:

  1. mkdir build; cd build
  2. cmake -DBOARD=sam_e70_xplained -DARCH=arm
  3. make
  4. make run

Screenshots or console output

***** USAGE FAULT *****
Unaligned memory access
***** Hardware exception *****
Current thread ID = 0x00000000
Faulting instruction address = 0x403618
Fatal fault in ISR! Spinning...

Environment (please complete the following information):

nashif commented 5 years ago

@cinlyooi-intel Please always specify what tests failed exactly, this project has multiple tests, the reproduction steps are not correct.

Also, -DARCH=arm are not needed.

nashif commented 5 years ago

more debug info:

DEVICE: [00:00:00.000,000] <dbg> mpu.arm_mpu_init: total region count: 16
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [0] 0x00400000 0x07020028
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [1] 0x20400000 0x110b0022
DEVICE: [00:00:00.000,000] <dbg> mpu._region_init: [2] 0x20440000 0x110b0020
DEVICE: [00:00:00.000,010] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure thread 0x00000000's domain
DEVICE: [00:00:00.000,011] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure domain: 0x00403441
DEVICE: [00:00:00.000,013] <dbg> mpu._arch_configure_dynamic_mpu_regions: configure domain: 0x00403441
DEVICE: ***** USAGE FAULT *****
DEVICE:   Unaligned memory access
DEVICE: ***** Hardware exception *****
DEVICE: Current thread ID = 0x00000000
DEVICE: Faulting instruction address = 0x403692
DEVICE: Fatal fault in ISR! Spinning...
nashif commented 5 years ago

@ioannisg @andrewboie any hints?

ioannisg commented 5 years ago

I cannot reproduce this on nRF52X boards, need a sam e70, which i am getting soon, actually. Perhaps @aurel32 could help with sending a log from e70, as well?

aurel32 commented 5 years ago

I am unable to reproduce the issue. I tried both commit 9c2c115716700da3279fd4ab541f0f598d0c4d6d and the current master with both the SDK 0.10.0 and the Debian toolchain. in all cases the test pass.

cinlyooi-intel commented 5 years ago

While attempting to recheck this issue, I cannot boot if xoroshiro enabled (#14244) .

aurel32 commented 5 years ago

Oh trying harder I realized that the xoroshiro random generator is not the default one when using the above instructions. Manually selecting the option after configuring the test, I am able to reproduce the issue. The crash happens in the MPU code, more precisely there:

        /* Populate internal ARM MPU region configuration structure. */
        region_conf.base = new_region->start;
        _get_region_attr_from_k_mem_partition_info(&region_conf.attr,
  40383a:       e9d1 4300       ldrd    r4, r3, [r1]
        /* in ARMv7-M MPU the base address is not required
         * to determine region attributes
         */
        (void) base;

The compiler considers that *new_region is 64-bit aligned, and try to fetch both start and size with a single instruction.

aurel32 commented 5 years ago

Could it be that we do not respect the 8 bytes stack alignment in an interrupt?

ioannisg commented 5 years ago

Could it be that we do not respect the 8 bytes stack alignment in an interrupt?

This is hard to believe, STKALIGN is set to 1 during initialization, I wrote this myself:

#if defined(CONFIG_STACK_ALIGN_DOUBLE_WORD)
    /* Enforce double-word stack alignment on exception entry
     * for Cortex-M3 and Cortex-M4 (ARMv7-M) MCUs. For the rest
     * of ARM Cortex-M processors this setting is enforced by
     * default and it is not configurable.
     */
#if defined(CONFIG_CPU_CORTEX_M3) || defined(CONFIG_CPU_CORTEX_M4)
    SCB->CCR |= SCB_CCR_STKALIGN_Msk;
#endif
#endif /* CONFIG_STACK_ALIGN_DOUBLE_WORD */

ARM Note for CM7:

In the Cortex-M7 processor CCR.STKALIGN is read-only and has a value of 1. This means that the exception stack frame starting address is always 8-byte aligned.

ioannisg commented 5 years ago

Oh trying harder I realized that the xoroshiro random generator is not the default one when using the above instructions. Manually selecting the option after configuring the test, I am able to reproduce the issue. The crash happens in the MPU code, more precisely there:

        /* Populate internal ARM MPU region configuration structure. */
        region_conf.base = new_region->start;
        _get_region_attr_from_k_mem_partition_info(&region_conf.attr,
  40383a:       e9d1 4300       ldrd    r4, r3, [r1]
        /* in ARMv7-M MPU the base address is not required
         * to determine region attributes
         */
        (void) base;

The compiler considers that *new_region is 64-bit aligned, and try to fetch both start and size with a single instruction.

Might be worth of bringing this up; in ARMv7-M we allow unaligned access, @aurel32. That's why we also don't enable the corresponding UsageFault trap.

ioannisg commented 5 years ago

Interesting documentation here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0646b/BABFAIGG.html

The Cortex-M7 processor supports unaligned access only for the following instructions:

LDR, LDRT.
LDRH, LDRHT.
LDRSH, LDRSHT.
STR, STRT.
STRH, STRHT.

Now, @aurel32 reports an LDRD here.

I wonder if it is time to enforce aligned access also for ARMv7-M.

aurel32 commented 5 years ago

The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.

ioannisg commented 5 years ago

The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.

This needs investigation, I agree.

However, I would like to stress once more that perhaps it is time to forbid unaligned access in ARMv7-M. This is a recommendation for having v7-M backwards compatible (binary compatibility) with ARMv8-M:

https://developer.arm.com/docs/100235/latest/the-cortexm33-instruction-set/cortexm33-instructions/binary-compatibility-with-other-cortex-processors

Un-aligned access is also forbidden in ARmv6-M.

What do you think?

ioannisg commented 5 years ago

The alignment constraints are the same for Cortex-M4. I wonder if GCC is right to believe that the structure is 64-bit aligned and we are doing something wrong somewhere or if it is a GCC bug.

Digging more into address alignment: ARMv7-M requires address in LDRD, STRD to be word-aligned. Now, I wonder why an address of a stack object (struct k_mem_partition) wouldn't be word-aligned...

ioannisg commented 5 years ago

@aurel32 in the UsageFault handler, could you print the contents of the stacked R1, to see if it is word-aligned or not?

aurel32 commented 5 years ago

It's not even half-word aligned: r1 = 0x403575. This can be seen in the GDB backtrace:

#0 _mpu_configure_region (index=3 '\003', new_region=0x403575 <_isr_wrapper>) at /home/aurel32/git/zephyrproject/zephyr/arch/arm/core/cortex_m/mpu/arm_mpu.c:91

This is an address on the flash, which corresponds to _isr_wrapper with the lower bit set (thumb mode). I therefore guess it comes from a stack overflow somewhere.

Note that the problem doesn't happen on a nucleo l432kc board (i had to force the MPU to be enabled), despite having the same LDRD instruction.

ioannisg commented 5 years ago

Hm, interesting. Could you compile with --no_unaligned_access option? I am curious about the result.

Hm, stack overflows should be caught by mpu. This is not a synchronous MPU/USG fault. So if the memory is corrupted by earlier stack overflow, mpu fault should have triggered.

aurel32 commented 5 years ago

I guess you mean -mno-unaligned-access. In that case I have no output at all, using gdb the code seems to loop in printk related functions. A simple hello world work though.

aurel32 commented 5 years ago

The issue might be located in the SAM entropy driver. The problem disappears if I remove the call to k_yield() in entropy_sam_wait_ready(). Now I am a bit lost, as this call happens from a thread and because the STM32 driver does the same and do not have the issue (or at least I can't reproduce it).

aurel32 commented 5 years ago

There are other fishy things even after disabling the k_yield(). For example if I also enable the shell:

***** Booting Zephyr OS v1.14.0-rc1-1065-g0e7cb91b27b6 *****
Running test suite common_test
===================================================================
starting test - test_rand32
Generating random numbers
PASS - test_rand32
===================================================================
Test suite common_test succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

uart:~$ kernel threads 
***** USAGE FAULT *****
  Division by zero
***** Hardware exception *****
Current thread ID = 0x204004f0
Faulting instruction address = 0x403272
Fatal fault in thread 0x204004f0! Aborting.
ASSERTION FAIL [z_spin_lock_valid(l)] @ /home/aurel32/git/zephyrproject/zephyr/include/spinlock.h:76
        Recursive spinlock
***** HARD FAULT *****
  Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x204004f0
Faulting instruction address = 0x40153a
Fatal fault in ISR! Spinning...

The division by 0 happens in stack_unused_space_get() because size is 0.

ioannisg commented 5 years ago

The issue might be located in the SAM entropy driver. The problem disappears if I remove the call to k_yield() in entropy_sam_wait_ready(). Now I am a bit lost, as this call happens from a thread and because the STM32 driver does the same and do not have the issue (or at least I can't reproduce it).

This makes more sense to me. The driver might be corrupting the kernel memory inside ISR...

aurel32 commented 5 years ago

This makes more sense to me. The driver might be corrupting the kernel memory inside ISR...

I have disabled userspace and reduced the driver to a simple constant number generator and the shell problem is still there...

aurel32 commented 5 years ago

Skipping the problematic division, I got:

uart:~$ kernel threads
Threads:
*0x20400498 shell_uart
        options: 0x0, priority: 14
        stack size 0, unused 0, usage 0 / 0 (0 %)

 0x20400034 logging   
        options: 0x0, priority: 14
        stack size 0, unused 0, usage 0 / 0 (0 %)

 0x204005ec idle      
        options: 0x1, priority: 15
        stack size 0, unused 0, usage 0 / 0 (0 %)
aurel32 commented 5 years ago

It seems the stack size happened because I didn't run a make clean after disabling the userspace.

aurel32 commented 5 years ago

I am now convinced the issue is due to k_yield() call during the initialization of the Xoroshiro random number generator through the SAM driver. I have replaced the driver by a very simple version:

diff --git a/drivers/entropy/entropy_sam.c b/drivers/entropy/entropy_sam.c
index d4593f3a12..9ea2204888 100644
--- a/drivers/entropy/entropy_sam.c
+++ b/drivers/entropy/entropy_sam.c
@@ -47,6 +47,11 @@ static int entropy_sam_get_entropy(struct device *dev, u8_t *buffer,
 {
        Trng *const trng = DEV_CFG(dev)->regs;

+       //k_yield();
+
+       buffer[0] = 42;
+       return 1;
+
        while (length > 0) {
                size_t to_copy;
                u32_t value;
@@ -72,6 +77,8 @@ static int entropy_sam_init(struct device *dev)
 {
        Trng *const trng = DEV_CFG(dev)->regs;

+       return 0;
+
        /* Enable the user interface clock */
        soc_pmc_peripheral_enable(DT_ENTROPY_SAM_TRNG_PERIPHERAL_ID);

Enabling the k_yield() line causes a crash.

aurel32 commented 5 years ago

It looks to me that the issue is in the SAM driver. The driver should provide a get_entropy_isr method which can be called from ISR and which doesn't call k_yield(). In addition it should respect the ENTROPY_BUSYWAIT flag and not wait if it is not passed.

This makes me realize that the driver is not thread safe. It looks tricky to provide a thread/ISR safe version of the driver without disabling the interrupts between check that entropy is available and the read of that entropy with the current way it is done. I guess the easiest way is to read the entropy in a single place from the TRNG interrupt and putting it in a k_msgq based entropy pool. From there it should be trivial to provide the various versions.

aurel32 commented 5 years ago

@erwango I have noticed that the STM32 RNG driver is also using k_yield() so it is likely affected by the same issue (although the condition to trigger it are probably different).

wentongwu commented 5 years ago

@aurel32 Is this issue root caused?

ioannisg commented 5 years ago

@nashif @aurel32 @cinlyooi-intel @wentongwu could we just initialize the xoroshiro device in POST_KERNEL? AFAIK, we can't properly use kernel APIs before the kernel is initialized....

If I set this to POST_KERNEL, the tests executes successfully.