Consider using MPU region register bank like a TLB

andrewboie commented 5 years ago

Is your enhancement proposal related to a problem? Please describe. Many MPUs have a small number of regions that can be programmed, such as 8. Given boot-time regions, gap-filling for MPUs that don't support region overlaps, and thread stacks, this can lead to a very limited number of free regions available for memory domain partitions.

It's easy to eat up memory domain partitions, for example to fully use newlib requires two dedicated partitions for its globals and malloc arena.

Describe the solution you'd like Consider a design where the MPU bank of regions is instead used like a TLB, with the full set of permissible memory regions managed by software. MPU faults would need to be trapped and the MPU reprogrammed on the fly to reflect the actual memory access policy.

Pros:

No more limits on number of memory regions for thread memory access policy

Cons:

Possibility for thrashing
Difficult to predict latency added to memory accesses which may be unsatisfactory from a hard real-time perspective. May have to pin some regions, at least those involved in memory access while IRQs are locked.

I think this may be worth investigating, although maintaining real-time latency guarantees would be a hard requirement.

Describe alternatives you've considered None that I can think of.

andrewboie commented 5 years ago

Adding this to 2.1 scope. There's a real need for this functionality on devices with only 8 MPU regions

andrewboie commented 5 years ago

Overall design:

Let N equal the number of regions available in the MPU hardware. Some regions will need to be pinned to always be active, as there is not much to be gained by making them ever not active:

The active thread's stack buffer
Program text / ROM
HW Stack overflow guard for supervisor mode

The set of active MPU regions for any given thread needs to be stored in thread->arch. Let us call this set X. On context switch, this set will be quickly programmed into the MPU. This needs to be as fast as possible. Just go down the list and program them all.

We need a per-thread set S which represents a bounded total of the number of "virtual" MPU regions. The maximum number of these can be some large value. This will need to be a data structure that is also stored in the thread struct. Of paramount importance is the lookup function:

/* Return the virtual MPU region corresponding to the provided memory
 * address and access type (read, write, execute)
 */
struct mpu_region *z_thread_mpu_region_find(void *addr, access_type_t access);

I think a good underlying data structure could be a red-black tree, which would store the MPU regions in order by base address. If it returns NULL, that means that S does not contain a memory region covering that address allowing the provided access type.

When memory domains are programmed for a thread, the partitions will be added to S. They will also be added to X until it fills up (no more available regions).

The next piece is the memory fault handler. Upon a memory access exception, the memory fault handler will query z_thread_mpu_region_find() using the faulting memory address and access type provided to the exception by the hardware.

If the region was found, we need to query the eviction function to determine what active MPU region to evict, and then program that region with the region that was found. The exception returns and the faulting instruction is restarted by the CPU, now succeeding.
If the region was not found or didn't have the right access attributes, trigger a fatal error.
We should update some bookkeeping to track how many faults have occurred

The last piece is the eviction algorithm. It will need to know that some regions are pinned and never evicted. So far it does not look like our MPUs set any kind of 'accessed' bit like in many MMU page tables. This limits our options for algorithms:

Random eviction
FIFO eviction
From Wikipedia: "Another example is used by the Linux kernel on ARM. The lack of hardware functionality is made up for by providing two page tables – the processor-native page tables, with neither referenced bits nor dirty bits, and software-maintained page tables with the required bits present. The emulated bits in the software-maintained table are set by page faults. In order to get the page faults, clearing emulated bits in the second table revokes some of the access rights to the corresponding page, which is implemented by altering the native table." We could possibly do something like this, and then implement some marking algorithms like LRU, Second Chance, Clock, etc.

We can do something simple for the initial eviction algorithm and iterate on this later.

Once this is all working we can then implement gap-filling on top of it, with all the gap-filling calculations taking place when the thread is created or the memory domain configuration adjusted, instead of on context switch.

andrewboie commented 5 years ago

I spent some more time thinking about this and also spoke in person with @vonhust . I've come to a few conclusions:

This is not a forward-looking feature. Newer MPUs (ARC MPU v3, ARM v8) have more regions with 16 being typical, where this problem largely goes away.
Reprogramming the MPU based on MPU faults really screws up determinism and latency.
We have limited eviction algorithm options due to the nature of the hardware.

I am thinking we could take a simpler, different approach: instead of reprogramming the MPU based on faults, allow user threads to adjust their memory domain configuration instead, such that the proper partitions can be added and removed at runtime as they are needed:

Track k_mem_partition and k_mem_domain as a kernel object.
Threads can be granted access to k_mem_partition and k_mem_domain like other kernel objects.
Add syscalls for some memory domain APIs. A user thread may add a partition to a domain if it has permission on both the domain and partition objects. A user thread may remove a partition if it has permission on the domain object.

I think this can help for infrequently used partitions. For example, suppose a thread knows that it needs to make some mbedtls calls. It has previously been granted permission on the mbedtls partition, as well as its own memory domain. It may then add the mbedtls partition to its domain, make mbedtls calls, and when finished remove that partition from its domain. Everything is done in a very predictable way, nothing is unexpected.

This is opposed to the current design, where memory domains cannot be adjusted by user threads at all, and some other supervisor thread would need to change the domain configuration; on a practical level a user thread is stuck with the partitions that were in its domain on startup.

I think this would not be very hard to implement, certainly simpler than the TLB idea, it would just need very good documentation and examples.

@ioannisg @ruuddw @vonhust @andyross @wentongwu any comments?

pizi-nordic commented 5 years ago

I think that the fact that we are developing an operating system should be taken under consideration. At the very high level, the OS takes care of the things to release application from low-level duties. And this is the direction we should go. I like the TLB approach, as this is 100% transparent to the App (excluding latencies) and does not force application developer to do a PhD about Zephyr userspace and MPU handling.

The latency could be controlled using the opposite approach to the proposed one: In the big OSes some line of TLB are locked and not taken under consideration during eviction to ensure that the OS response will be possible (there is kernel code mapped all the time) and fast. So instead of forcing the App to dynamically switch the memory areas, we should do something like madvise() in order to allow kernel to receive hints what memory is likely to be accessed.

andrewboie commented 5 years ago

I think that the fact that we are developing an operating system should be taken under consideration. At the very high level, the OS takes care of the things to release application from low-level duties. And this is the direction we should go. I like the TLB approach, as this is 100% transparent to the App (excluding latencies)

No I don't agree. We are creating a Real Time operating system that aims for functional safety certifications. I am vigorously opposed to adding a mechanism which adds random latencies to memory accesses based on MPU faults. It's not bounded or predictable, even with pinned partitions and access hints. We might be able to get it tolerable 99% of the time, but I don't think that is good enough.

does not force application developer to do a PhD about Zephyr userspace and MPU handling.

You are overstating the case here. The developer always needs to know what memory partitions to add, this does not change. Instead of the memory domain partitions being completely fixed wrt user threads, user threads can adjust their active partitions on the fly based on what code they are calling into. And this is really only something that needs to be managed on legacy MPU systems with a small number of MPU regions, if you have 16 or 32 or whatever you can just add them all when the domain is set up and you are done and don't need to worry about this. This is really only a problem for ARMv7 MPU, ARC MPU version 2, etc.

Worth noting also: adding syscalls for memory domain APIs and implementing TLB like semantics aren't mutually exclusive. But I feel the inherent unpredictability makes the latter not a good idea for an RTOS, and the amount of effort required isn't appropriate for a problem that only affects older MPUs.

wentongwu commented 5 years ago

memory domain and memory partition is to make every thread have its own "memory space" as linux process, thread A can't access the memory of thread B. If tracking k_mem_partition and k_mem_domain as a kernel object, that means accessing everyone (e.g. char key[]) belongs to one memory partition have to go though system call, otherwise other threads not given access of the above key can also read and write it, that maybe not efficient. @andrewboie

ruuddw commented 5 years ago

Good discussions. Predictability/determinism would rank first for me, efficiency 2nd, and ease of use 3rd. Having said that, I don't think that an on-demand, exception triggered MPU mechanism is always unpredictable or hampering real-time performance: if there are enough MPU regions to fit everything, no exceptions will trigger. And with some locking mechanism and control over the eviction, timing critical stuff could be locked. What remains is extra blocking time for MPU reprogramming, but that is probably not much worse for fault-triggered reprogramming compared to context switching. In summary, I'd prefer an automated reprogramming approach, but with user control over the replacement algorithm, to control what regions can be used dynamically. And probably explicit evict/load functions. The 'mbedtls' example could dynamically manipulate the eviction policy (mark a region for deletion, and/or use explicitly evict/load functions) to achieve the same.

andrewboie commented 5 years ago

I think I'm coming around to you all's thinking on this. Key thing is that memory domains only control access for user threads, so for example there should be no thrashing at all for any ISR. If we can additionally find a way to implement the MPU/Page fault handling to do the reprogramming in a way that doesn't leave IRQs locked the entire time that would greatly reduce my concerns about trying this.

Unfortunately I'm not personally going to have the bandwidth to work on this ticket for 2.1, I'm going to take this out of 2.1 scope for now and leave the assignee open.

d3zd3z commented 4 years ago

Arm had an implementation of this (I forget what it was called) that could be brought into MBED. As far as I know, at least on typical microcontroller devices, the overhead was quite significant. My kind of gut suspicion is that MCUs that are performant enough for this to work probably also don't have a tiny number of MPU regions. But, it might be worth investigating.

andrewboie commented 4 years ago

Still setting this aside for now, as most of the pain related to limited MPU regions comes from needing multiple partitions for library data: https://github.com/zephyrproject-rtos/zephyr/issues/25891

zephyrbot commented 9 months ago

Hi @dcpleung,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@andrewboie you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!

andyross commented 9 months ago

Just to vote that this stay open. I always thought this was a really clever idea.

zephyrproject-rtos / zephyr

Consider using MPU region register bank like a TLB #13074