zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.88k stars 6.63k forks source link

arm: cortex-R & M: CONFIG_USERSPACE: intermittent Memory region write access failures #22275

Closed pderwin closed 4 years ago

pderwin commented 4 years ago

Describe the bug

I am getting intermittent run-time failures such as:

 syscall z_hdlr_k_uptime_get failed check: Memory region 0x0002ff70 (size 8) write access denied
 ***** Hardware exception *****
 Current thread ID = 0x000221b0
 Faulting instruction address = 0x1a
 Fatal fault in thread 0x000221b0! Aborting.

In this example, k_uptime_get() was called, which is going to save its 64-bit return value into memory (due to the syscall macros). The syscall gear is checking to see if the memory address is valid. The routines wind down into mpu_buffer_validate(), which will call is_user_accessible_region() and finally into get_region_ap(). The code is failing because get_region_ap() has a critical code section which is not wrapped with an irq_lock()/irq_unlock(). In this failure, we write the MPU index register, then take an external interrupt, go off to another thread for a while ( which is going to re-load the MPU entries), return to the current thread (again re-loading the MPU entries), and read the attribute register (for the wrong index) and fail.

In reading the current Cortex-M code, I don't see it protecting these critical sections either, so I'm left wondering if the 'M' class will save the index register during interrupt processing. If so, I would like to fix the 'R' code in a similar fashion.

There are several of these utility subroutines with this same critical code sections.

To Reproduce

I saw this on a Cortex-R based system. Ioannis believes the problem exists in an 'M' based system as well.

This is an intermittent failure that will be hard to reproduce.

A unit test might be to have a thread which is doing nothing more than calling k_uptime_get(); This will be calling the Z_SYSCALL_MEMORY_WRITE macro to check the destination buffer address. Asynchronous to this, do something on a timer interrupt to send a message to a higher priority thread. In my exact case, I was sending a message to a user mode work Q. The higher priority thread could just read the message/work-item and wait again. All we need to do is force the context switch to occur. I think this test should fail fairly quickly.

Expected behavior The memory access should have passed. The location I was writing to was valid.

Impact memory access fault. System shuts down

Screenshots or console output n/a

Environment (please complete the following information):

Additional context discussed in devel mailing list #6695

jhedberg commented 4 years ago

Is this reproducible with the latest master branch? Cortex R was not supported by Zephyr 1.14.

andrewboie commented 4 years ago

Raising priority as this appears to affect Cortex M as well.

pderwin commented 4 years ago

Our code base was just updated to V2.1 today, so it will be a day or two before I can move up to that level.

I will go ahead and create the unit test that I was speaking of, and assuming that fails, post that for testing on your M systems.

andrewboie commented 4 years ago

Our code base was just updated to V2.1 today, so it will be a day or two before I can move up to that level.

I will go ahead and create the unit test that I was speaking of, and assuming that fails, post that for testing on your M systems.

Just posted a patch #22281 for Cortex-M.

pderwin commented 4 years ago

I was able to create a small contained testcase that shows a failure on R. I suspect this will fail on M as well. It is attached as phil.c.gz. Sample serial output below.


(15:03:40 169/ 0:00 0:00) uart:~$ idle 135 IDLE task pid: 0x00010fd4 priority: 15 (15:03:40 169/ 0:00 0:00) init_thread_user_mode 52 high priority: 1 (15:03:40 169/ 0:00 0:00) low priority thread in user mode, Calling k_uptime_get() forever my priority: 2 (15:03:40 169/ 0:00 0:00) LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLinit_thread_user_mode 59 begin real test (15:03:45 169/ 0:05 0:05) LLLLLsyscall z_hdlr_k_uptime_get failed check: Memory region 0x000163b8 (size 8) write access denied (15:03:46 169/ 0:06 0:01) Hardware exception (15:03:46 169/ 0:06 0:00) Current thread ID = 0x00010a4c (15:03:46 169/ 0:06 0:00) Faulting instruction address = 0x0 (15:03:46 169/ 0:06 0:00) Fatal fault in thread 0x00010a4c! Aborting. (15:03:46 169/ 0:06 0:00) H (15:03:4`

phil.c.gz

`

andrewboie commented 4 years ago

@pderwin thanks! I'll take this and adapt it to our user mode testing suite.

I talked to @ioannisg earlier, he's going to send a PR which will save/restore RNR when we take interrupts, I've marked my PR DNM meanwhile.

andrewboie commented 4 years ago

@pderwin one thing I noticed in your code:

   k_thread_create(&low_priority_thread,
           low_priority_thread_stack,
           sizeof(low_priority_thread_stack),
           low_priority_thread_entry,
           NULL, NULL, NULL,
           2,
           ( K_INHERIT_PERMS | K_USER ),
           K_NO_WAIT);

Use K_THREAD_STACK_SIZEOF() and not sizeof(). Depending on architecture, there can be extra data within the stack object, aside from the actual buffer.

andrewboie commented 4 years ago

confirmed this affects Cortex-M, I can reproduce with mps2_an385 under QEMU.

andrewboie commented 4 years ago

Also reproduces on mps2_an521

I implemented a torture test inspired by @pderwin see #22288. I found that ARC is broken as well, I opened #22290. x86/x86_64 passes.