xtensa: Enable userspace with Xtensa MMU

ceolin commented 9 months ago

Implement support for userspace on Xtensa using its MMU.

Syscalls are not inline functions like in other architectures because some compiler issues when using multiple registers to pass parameters to the syscall. So here we have a function call so we can use registers as we need.
TLS is not supported by xcc in xtensa and reading PS register is a privileged instruction. So, we have to use threadptr to know if a thread is an user mode thread.
Support double memory mapping

For more details look individual commit messages.

andyross commented 9 months ago

FWIW: Cadence seems to have plugged a few of these holes with XEA3, frankly I think they got ahead of themselves with the memory protection design and had to rework stuff after the fact. But that's still a futures kind of thing for us, all Zephyr hardware planned is still XEA2 AFAIK. (It might be worth checking Linux to see what that port does. IIRC it's doing the full register spill with non-HAL code already, we could probably use that for a starting point.)

ceolin commented 9 months ago

I've been thinking on the syscall entry problem, and I think we actually have a serious problem here. The handler here is doing SPILL_ALL_WINDOWS to write out the userspace window state, which I had fixated on as an inneficiency vs. the CROSS_STACK_CALL that the standard handler does (which leaves the window state in place and arranges the calls to spill naturally via exceptions during the handler).

Yep, we can't let window exception happens with different contents from different contexts.

But... those are BOTH very wrong when you have an untrusted userspace! The window spills work by writing the various 4-register quads into their spill locations found via offsets against the stack pointers also found in those registers. Which is to say, any exception, interrupt or syscall is an instruction to the kernel to write arbitrary userspace-thread-controlled data to arbitrary addresses. Needless to say, we can't do that.

I think I have a few ideas here: for the syscall case, it's actually not so bad because the process can just spill its own registers using the exceptions (although it can't do it with the assembly we use for SPILL_ALL_WINDOWS: ROTW is a priviledged instruction, we'd have to arrange to do it with callx12 et. al.) and then wipe the WINDOWSTART bits from kernel mode knowing that the registers are clear.

The interrupt/exception case is hardest, since we can't rely on that. We'd need to walk the registers manually and validate all the stack pointers ourselves, by testing directly against the page tables (or MPU table) I guess, or maybe just validating against the thread's stack extents if we want to cheat (although that would break if we have userspace wanting to do things like green threads).

Which... brings up another idea I've had bouncing around in my head for a long time: what if we just give up and build using the CALL0 ABI (i.e. just use a standard 16 register set and don't do any window nonsense) instead? It eliminates this problem entirely. And in fact there are significant advantages, because you can use the extra registers for e.g. interrupted frames (i.e. just ROTW on interrupt/exception entry and you don't need to do any context switch!) or as "cached" context switch handles (i.e. the last thread or two would still be "stored" in registers waiting to resume).

Entirely remove window abi and using call0 abi everywhere ? Even ask compiler to follow this ABI ?

As mentioned, I have a MPU rig more or less waiting to play with, so I think we can parallelize as I don't have MTL hardware (though I do have the toolchain -- do you have a recipe for running this against xt-sim[1]?)

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

(Sticking a -1 on this just to prevent anyone from merging it accidentally while we work out the security design.)

[1] Another digression is that we don't have a qemu emulator working for either the MPU or MMU. Has anyone looked at that? It would be pretty annoying if the only emulation environment for this code was either hardware or a proprietary simulator. We really need CI validation of userspace if we want to ship it.

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

jcmvbkbc commented 9 months ago

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

@ceolin This is not true. QEMU MMU implementation is rather full, and specifically auto-refill is there, otherwise Xtensa linux would not work in QEMU. Please let me know if you have more questions regarding xtensa support in QEMU.

jcmvbkbc commented 9 months ago

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

@ceolin xt-sim is capable of precise hardware emulation, just run it without the --turbo option.

ceolin commented 9 months ago

Qemu MMU implementation is quite limited (I don't remember all limitations, but auto-refill is one thing that is not there)

@ceolin This is not true. QEMU MMU implementation is rather full, and specifically auto-refill is there, otherwise Xtensa linux would not work in QEMU. Please let me know if you have more questions regarding xtensa support in QEMU.

@jcmvbkbc I meant the sample controller target (the one we have).

File: zephyr/soc/sample_controller/xtensa/config/core-isa.h
 617  24 #define XCHAL_HAVE_PTP_MMU     0   /* full MMU (with page table
 619  24                           usable for an MMU-based OS */
 623  15 #define XCHAL_MMU_ASID_BITS        0   /* number of bits in ASIDs */
 624  15 #define XCHAL_MMU_RINGS            1   /* number of rings (1..4) */
 625  15 #define XCHAL_MMU_RING_BITS        0   /* num of bits in RING field */

Please if there is another target that we can use I would be more than happy to check.

ceolin commented 9 months ago

I was using an emulator built on top of xt-sim, so it is possible. But there are things that it does not emulates well, the cache aliasing issue just happened in the hw.

@ceolin xt-sim is capable of precise hardware emulation, just run it without the --turbo option.

Good to know, will try it out. It was really annoying figure out this problem late.

jcmvbkbc commented 9 months ago

I meant the sample controller target (the one we have).

That core does not have the MMU feature configured, it only has the region protection.

if there is another target that we can use I would be more than happy to check.

dc233c has been the default MMU-capable core for the xtensa linux, the configuration overlay for it is available e.g. here.

ceolin commented 9 months ago

Hi @andyross, I've addressed some comments and replied for many others, could you take another look please ? Specially the comment regarding re-using the same L1 handler.

andyross commented 9 months ago

dc233c has been the default MMU-capable core for the xtensa linux,

OK, excellent. For sure we want to get that integrated to the toolchain and a platform added for it. FWIW, for anyone interested in picking that up: note that we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case, which is otherwise a somewhat obscure configuration (though NXP and one of the AMD DSPs do that, per my local collection of core-isa.h files).

Though the more I think about this the more I really would like to see numbers on the CALL0 idea if I can find some way to get them cleanly. Register windows only really help in the "deep leafy" call tree case, but all the performance cases we're looking at for DSP work tend to be flat compute work in single stack frames with lots of inlining and[1] lots of use of SIMD registers that notably don't even have windowing![2].

Having a "userspace needs CALL0" rule really seems like not so much a hardship to me...

[1] In a handful of backend cases, though frustratingly not in SOF itself yet due to issues with toolchain availability.

[2] i.e. what's the point of having clean lazy spill/fill of your GPRs when most of your data is in SIMD registers that the ABI is going to make you spill anyway

andyross commented 9 months ago

Coming back. I... really don't think we can merge this with a giant security hole. That's just not being responsible. I know (or think) we still have some weasel wording in the userspace docs about not trusting it as a security boundary, so maybe there's some flexibility there. But we all know that's not how people interpret it and for sure that's not where this particular userspace implementation is aimed.

And, less firmly, we really "shouldn't" merge it without a mechanism for exercising the code in community work and in particular in the project CI. We know this is going to need significant rework (not the TLB handling per se, but all the entry/exit/context code which involves the MMU), and that becomes impossible if the only people who can do that are working on pre-production hardware for one vendor[1]. I mean, consider that this merges and then I start work on CALL0 arch variant and/or a full-window-spill context layer (something I've kinda/sorta already started); now I really can't without breaking things, because I can't run the code I need to modify.

Bottom line: this just isn't finished and can't be used in the form it's in right now. Let's hold off and get it right before merge.

As a compromise, maybe we can prune out a utility layer to the TLB/MMU handling such that you can exercise it with the mem_domain/partition code, but leave all the thread lifecycle stuff unchanged? That actually might make for good cleanup regardless.

[1] Though FWIW: I have a MTL board on the way now for unrelated work, so for me personally this looks like it may resolve itself. But the principle remains the same: we're an open source project and shouldn't be rushing to merge stuff from hardware vendors that can't be exercised in the community.

nashif commented 9 months ago

dc233c has been the default MMU-capable core for the xtensa linux,

OK, excellent. For sure we want to get that integrated to the toolchain and a platform added for it. FWIW, for anyone interested in picking that up: note that we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case, which is otherwise a somewhat obscure configuration (though NXP and one of the AMD DSPs do that, per my local collection of core-isa.h files).

yes, we need a new target with that, leave everything else as it is and introduce qemu_xtensa_mmu or whatever.

And, less firmly, we really "shouldn't" merge it without a mechanism for exercising the code in community work and in particular in the project CI.

We did introduce ARC userspace support in the past without CI (f81dee0b2b87), actually the architecture was maintained with any means to do anything in CI until Qemu was added later. This comes with its own issues and we had a few of those because of lack of any CI, however, we currently have Xtensa MMU in the tree, also without CI, userspace is on top of that and userspace itself and in general is covered by other architectures. Doing this will need to come with a commitment to run some CI, somewhere and address issues similar to how we do anything else without CI coverage (and we have plenty of HW features not covered in CI).

[1] Though FWIW: I have a MTL board on the way now for unrelated work, so for me personally this looks like it may resolve itself. But the principle remains the same: we're an open source project and shouldn't be rushing to merge stuff from hardware vendors that can't be exercised in the community.

Not sure his will help, MTL ADAP HW does not have MMU support enabled.

jcmvbkbc commented 9 months ago

we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case

dc233c also has 32 windowed registers.

I really would like to see numbers on the CALL0 idea if I can find some way to get them cleanly.

FWIW a while ago I've measured linux kernel boot times (specifically timestamp of the message "Run /sbin/init as init process" that marks start of the userspace) on esp32s3 for the kernels built with windowed and call0 ABIs. Got 3.164921s for the call0 and 2.798135s for the windowed ABI, all other settings being the same.

Register windows only really help in the "deep leafy" call tree case,

My observation is that generally windowed ABI generates smaller code (above mentioned kernels were 1803172 bytes for call0 and 1595820 bytes for windowed ABI, that's more than 10% difference).

andyross commented 9 months ago

[Max's numbers showing CALL0 Linux builds having a 13% penalty on initialization time and 10% on code size]

So... actually I'm thinking that doesn't look so bad at all? :)

I mean, kernel boot is basically pessimal for CALL0: it's a huge tree of rapid calls being done all at once all across every subsystem and driver in the image. By definition, it's not spending time doing "compute", it's not touching HiFi registers at all, it's not in hot loops, it's showing very poor locality of reference. And... just 13%!

Did you get numbers on syscall performance or interrupt latency? That's where a CALL0 scheme is going to shine. (I'll straight up say that I'm salivating at the idea of interrupt entry just doing the equivalent of "rotw 4; mov a1, _isr_stack; call0 _isr_handler_in_c"!).

And FWIW, the code size issue is surely real, but in a uC/DSP environment it needs to be weighed against the much smaller context code in the kernel needed to support it. We're spending like 2kb or something like that on this stuff (c.f. this PR that had to fork off an entirely separate entry path) that would shrink very significantly.

Also we'd almost certainly be deploying CALL0 only on userspace-enabled apps, where we're already committing to significant excess memory usage due to all the page alignment. 10% on the .text segment isn't that big a deal in comparison.

nashif commented 9 months ago

we don't want to eliminate sample_controller, as that's a good exerciser for the XCHAL_NUM_AREGS=32 case

dc233c also has 32 windowed registers.

FWIW, someone is already looking into enabling this with the right toolchain. Let's see how this will work out.

dcpleung commented 9 months ago

@andyross how soon do you think you can get call0 enabled on current main? Call0 would probably simplify the interrupt handling since we don't have to worry about switching stack and how registers are spilled.

andyross commented 9 months ago

I'm really hoping to have something prototyped this weekend? :) With.... like a metric ton of salt applied for all the unknowns?

We could try to parallelize too. The "modify the existing code to save/restore all 64 GPRs if the faulting/interrupted process is userspace" isn't too outrageously hard, though naively it would end up with a lot of duplicated code paths (i.e. you couldn't use the process stack and would need space in the thread struct, nested interrupts would need the old code, surely more complexities).

jcmvbkbc commented 9 months ago

Did you get numbers on syscall performance or interrupt latency?

I didn't. Given how heavy linux kernel entry/exit is I didn't expect to see significant difference because of the number of saved registers.

ceolin commented 9 months ago

@andyross how soon do you think you can get call0 enabled on current main? Call0 would probably simplify the interrupt handling since we don't have to worry about switching stack and how registers are spilled.

We would still need to switch stack, what we will probably not need is that cross stack call and worrying about spilling registers, there are other few bits that need to change like not enable window overflow exceptions that is done in many places, but it is minor. The major work is in the interruption handler and context switch.

@andyross I will be out for two weeks but will try to catch up if you send changes to use call0 ABI, otherwise I will look it when I come back.

andyross commented 9 months ago

Those interested can check out very early work on call0 at https://github.com/andyross/zephyr/tree/xtensa-win0

It's coming together now. The code is complete (modulo a bunch of minor features that are still stubbed) and it's running hello_world (which requires a context switch) and traps a divide by zero (so exception entry works). I figure it's above 50% coverage now. Indeed it's a LOT simpler than asm2 has become (all the assembly lives in one 460 line file). No numbers on size/performance yet, but it's looking really not so bad. I'm particularly pleased with some areas, like system call entry[1] which works in (I swear I'm not joking) exactly 45 instructions from (and including) the syscall trap to the first instruction of the C handler. And that includes 9 instructions required to wipe unused registers on return (to hide kernel data) that might be elidable on audio DSPs where data security isn't part of the threat model.

[1] Which is actually mocked right now as I don't have a userspace device to test! I hacked up a table of function pointers as a "system call" table and stubbed out some CONFIG_USERSPACE #ifdefs to test it. Seemed to work (maybe), but I'm eager to get it up on the qemu userspace rig.

dcpleung commented 9 months ago

Those interested can check out very early work on call0 at https://github.com/andyross/zephyr/tree/xtensa-win0

Errr... toolchain complains that libgcc.a is compiled for windowed ABI. Toolchain will need to support both (via multilib) or we will need to have two separate toolchains for one target.

andyross commented 9 months ago

Errr... toolchain complains that libgcc.a is compiled for windowed ABI.

See details in #62117 , as of right now you don't need to use a separate toolchain, but you do need to build one. Ideally we'd find some kind of multilib configuration in the linker, but we can do it ourselves with cmake hackery too. I'm wide open to suggestions, SDK configuration isn't really my area of expertise.

ceolin commented 8 months ago

@andyross new series with userspace enabled in dc233c (qemu_xtensa_mmu) target. Would you mind take a look again ? I'd like to propose to have this implementation in (possibly documenting the problem with spilling register in usrespace provided memory) and after that focus in the call0 ABI. Now we have a way to test the implementation in CI and it should bring some sort of confidence ...

andyross commented 8 months ago

That seems a little questionable: we can't actually ship this userspace implementation as-is, at least not with current security models (i.e. it provides memory separation but no security). Can we really not unify the two PRs? This introduces a ton of changes that will need to be essentially reverted. I really thought @dcpleung had it mostly put together already?

Tell you what, give me until early next week to see if I can do the unification myself. But if I can't do it then there's no particular reason to hold this up from further testing; we should just hide it behind a CONFIG_XTENSA_INSECURE_USERSPACE=y setting or something so no one gets confused.

dcpleung commented 8 months ago

I haven't been able to get CALL0 and userspace to work. I hit a store privilege error at start of xtensa_excint1_c on the L2 page table array which is weird since the array is mapped. I am definitely missing something. The latest I have is @ https://github.com/dcpleung/zephyr/tree/xtensa/call0_userspace

andyross commented 8 months ago

OK, challenge accepted; I'll see what I can do this week. FWIW, we should absolutely get to some kind of at least "demonstration quality example" of a userspace syscall entry and exit using call0 before this merges. It's one thing to say "this other feature is required but not ready" and quite another to have something in the tree we can't ship even theoretically.

ceolin commented 7 months ago

That seems a little questionable: we can't actually ship this userspace implementation as-is, at least not with current security models (i.e. it provides memory separation but no security). Can we really not unify the two PRs? This introduces a ton of changes that will need to be essentially reverted. I really thought @dcpleung had it mostly put together already?

That is true at certain level. We are talking about something marked as experimental, it should be acceptable. Put in another perspective, we have other areas were we didn't assign vulnerabilities because the code was considered experimental.

Tell you what, give me until early next week to see if I can do the unification myself. But if I can't do it then there's no particular reason to hold this up from further testing; we should just hide it behind a CONFIG_XTENSA_INSECURE_USERSPACE=y setting or something so no one gets confused.

ceolin commented 7 months ago

OK, challenge accepted; I'll see what I can do this week. FWIW, we should absolutely get to some kind of at least "demonstration quality example" of a userspace syscall entry and exit using call0 before this merges. It's one thing to say "this other feature is required but not ready" and quite another to have something in the tree we can't ship even theoretically.

I disagree ... there are different ways to see it. First, code can always be removed later, we are not blocking incomplete / experimental code to get in because of future changes. Also, nothing block us to have an implementation that contains some constraints (obviously well documented and with all possible alerts). Just to be clear, I am NOT, AT ALL, saying that is what should be done or the plan, period. Just saying that is arguable.

andyross commented 7 months ago

Just to update: I had a good run over the weekend and think I have a prototype working (it traps out the the call0 state and for sure lands in the handler, though there's plenty of wiring left to do).

And I think I see how this should work: port/modify the lowest-level MMU code (the entry handlers, a "swap page tables" call, the syscall wrappers) on top of the call0 PR. Rework the content here so the higher level management stuff works in terms of those, and drop this on top. Really shouldn't be hard, I don't think. Give me a few more days and maybe one more weekend. I'm just wary of having to merge all these changes to the old asm2 code just to have to back them out later.

andyross commented 7 months ago

Oh, and I realized that AFAICT there's no cache emulation in qemu, so the ASID handling really can't be tested outside of xt-sim with the appropriate configuration. So that will require some back and forth with you guys, or else stub with a complete cache flush on table changes.

ceolin commented 7 months ago

Oh, and I realized that AFAICT there's no cache emulation in qemu, so the ASID handling really can't be tested outside of xt-sim with the appropriate configuration. So that will require some back and forth with you guys, or else stub with a complete cache flush on table changes.

I will take a look.

jennifersmith94538 commented 7 months ago

there's no cache emulation in qemu, so the ASID handling really can't be tested

ASIDs are not really related to cache, they can be tested even without it. Managing them incorrectly may result in getting TLB multi hit exceptions.

stub with a complete cache flush on table changes

Why? xtensa caches are physically tagged. In the absence of cache aliasing there's no reason to flush caches on context switch. In the presence of aliasing caches flushing is needed for physical pages whose color is about to change.

andyross commented 7 months ago

Ah, OK. I swear I had it in my head that the L1 on Xtensa cores was VIVT with an ASID tag? Thanks, that's the kind of expertise that is a little lacking in this effort.

ceolin commented 7 months ago

there's no cache emulation in qemu, so the ASID handling really can't be tested

ASIDs are not really related to cache, they can be tested even without it. Managing them incorrectly may result in getting TLB multi hit exceptions.

It is not related indeed, but you have to be careful with pages in the cache and auto-refill

stub with a complete cache flush on table changes

Why? xtensa caches are physically tagged. In the absence of cache aliasing there's no reason to flush caches on context switch. In the presence of aliasing caches flushing is needed for physical pages whose color is about to change.

Yep, the hw we are using has cache aliasing and since we are not coloring the pages the safest thing is to flush it.

jennifersmith94538 commented 7 months ago

since we are not coloring the pages the safest thing is to flush it

that's kind of defeating the purpose of having the big cache? also it's hard to imagine a situation when a lot of pages need to change color between context switches. even the mappings shared between processes can be easily kept coherent by placing them at addresses that are multiple of a way size.

ceolin commented 7 months ago

since we are not coloring the pages the safest thing is to flush it

that's kind of defeating the purpose of having the big cache? also it's hard to imagine a situation when a lot of pages need to change color between context switches. even the mappings shared between processes can be easily kept coherent by placing them at addresses that are multiple of a way size.

Don't disagree, but that is something that can be done later. Initially, should be only the page table itself that are mapped in two different virtual address, but then you have MMIO and possible other things.

Anyway, patches are welcome :)

andyross commented 7 months ago

OK... I think I'm seeing this a little more clearly. @jennifersmith94538 please correct whatever's wrong here. But this is another sorta deep problem I think:

The TLB refill fetches are downstream of the dcache on the local CPU (which is a little surprising as the TLB is itself a cache and this ends up with the same data stored twice on the CPU).
The TLB refill fetches will also populate the dcache like any other load
The dcache is local to the current CPU, and we're an SMP platform
The Zephyr page table management code works on physical addresses for the pages in the page table, and not the virtual one, leading to an aliasing issue.

Add all that up and... we're missing bits:

Our current flush at context switch time is an artifact of the aliasing issue above (any page table changes made were to the "wrong address" from the perspective of the MMU which will read the same data from a different address). We need to be flushing a table modification time.
Even then, that's not sufficient as on SMP platforms we can't guarantee that the same mappings won't be in use on another CPU. We need a mechanism to lock out memory_partition users on other CPUs while we muck with mappings and guarantee that the partitions are unused until they're flushed and ready[1].

I'll see if I can work a reasonable set of choices in. The page table code is actually in-tree already, we just need to add some flush calls to it.

[1] @peter-mitsis has some work in progress for a generic-ish "get this other thread to halt on all CPUs and block until it does" tool. I keep being picky about it, but the problem is real and this would be a good place to use it.

andyross commented 7 months ago

Forgot a third point:

We need a mechanism for telling other CPUs to flush their page table dcache regions on change, maybe with an optimization to specify specific ranges or whatever. They might have data sitting around in their dcache that will shadow anything we've flushed out underneath it.

nashif commented 7 months ago

@jennifersmith94538 you work for cadence?

andyross commented 7 months ago

Actually, reading through the ISA Reference as carefully as I can (which I gotta be honest isn't super clear about this stuff) the root of the cache interactions here is actually not special CPU behavior at all? It's just that we have the page tables mapped (twice) as cached memory. The CPU populates the dcache on a TLB refill fetch because the TLB entry controlling that page table memory (itself part of the page tables in memory!) told it to. We should be mapping this uncached, both for simplicity/correctness and performance (to avoid polluting the dcache with TLB entries that are supposed to be cached in the TLB)

jennifersmith94538 commented 7 months ago

The TLB refill fetches are downstream of the dcache on the local CPU

Don't know what that means. I'm sure that TLB refill fetches use the dcache as all other fetches do, as you point later.

The dcache is local to the current CPU, and we're an SMP platform

I'm sure that's one possible implementation, but not the only. The processor data book should have the details.

I agree with all other points.

jennifersmith94538 commented 7 months ago

@jennifersmith94538 you work for cadence?

I do not.

ceolin commented 7 months ago

Added a new Kconfig symbol and a build warning to alert about the spilling issue.

andyross commented 7 months ago

FWIW, still working on my end with a call0 integration, which is starting to subsume more of the core stuff. Didn't have as much time over the weekend as I'd like but still really hoping to have something this week.

andyross commented 7 months ago

let's get this in with the disclaimer that it is experimental

Please don't. I'm really very close, and it's a ton of reversion needed to do this incrementally. It's the start of a release cycle, no need to rush.

andyross commented 7 months ago

Among other things, I have ASID integration working (er, "working") now that avoids the need to do any TLB invalidation or cache flushing at all on the context switch path.

andyross commented 7 months ago

OK, I think I'm there. After three (and a half, sorta) rewrites, I have a MMU integration that is mostly in its final form now. See the call0 PR which now contains it at https://github.com/zephyrproject-rtos/zephyr/pull/62117

This is enough to boot the MMU platform with full page table, includes a complete API-compliant syscall implementation (wired up through the same mocking layer and hello_world rig as before), supports clean ASID-based context switching without needs for TLB or cache flushing (in principle the overhead is as low as 6 instructions to switch memory spaces!).

Now I'm going to sit down and (for the second and a half time) fork lift the userspace changes on top of it. Shouldn't take too long, the API between the layers is only three functions. But I'm sure that all the edge cases will take a while to stabilize.

Also please review (@jcmvbkbc especially) the README-MMU.rst file in that PR to make sure it's clear and not incorrect. Getting clued in on this hardware is sort of a painful journey and I'm hoping to leave some better breadcrumbs for posterity.

nashif commented 7 months ago

Please don't. I'm really very close, and it's a ton of reversion needed to do this incrementally. It's the start of a release cycle, no need to rush.

@andyross https://github.com/zephyrproject-rtos/zephyr/pull/62117 is a complete overhaul of xtensa it seems and anything but incremental, i.e. adding new calling conventions, reimplementing MMU and trying to add userspace, caching and TLB all at the same time.

Welcoming the improvements and new things being added, but we all know how such an effort will go when we try to do too many things at the same time. It will take a lot of time and lots of iterations and we will be stuck for a while.

I suggest to proceed incrementaly and get changes in one by one instead of trying to redesign xtensa architecture in one single step.

I still think we should get this PR in. Getting this right now where userspace already works will also give us a baseline and we should be able to compare and monitor KPIs and performance changes if any.

andyross commented 7 months ago

Right, but... we just can't keep the asm2/window-based code. Can we all please work together on just one version that meets all requirements? I'm not trying to rewrite stuff, I swear I'm not (really I thought you guys were going to do the MMU rework). But we have to have code that's going to be shippable, we all agree on that part, right?

What's the plan for what happens if we merge this? Doesn't all the same stuff have to happen, just with more work on everyone's part?

ceolin commented 7 months ago

Right, but... we just can't keep the asm2/window-based code. Can we all please work together on just one version that meets all requirements? I'm not trying to rewrite stuff, I swear I'm not (really I thought you guys were going to do the MMU rework). But we have to have code that's going to be shippable, we all agree on that part, right?

What's the plan for what happens if we merge this? Doesn't all the same stuff have to happen, just with more work on everyone's part?

Not necessarily ... yep, there is a problem where a malicious user thread tampering call stack can cause a kernel issue, but if the target trust the application and wants to catch programming mistakes and provide memory isolation that is a valid implementation. For a certain period we can have two implementations living side-by-side.

You are talking about re-work, but actually migrating everything to call0 ABI also needs more work beyond this qemu target. That's said, that other pr is far from finished, having some syscalls working is very different from having the whole test suite passing.

andyross commented 7 months ago

Why is this turning into a fight? It wasn't supposed to be a fight. I thought there was consensus on the call0 path months ago, no? If there's not, what's the plan?

Not necessarily ... yep, there is a problem where a malicious user thread tampering call stack can cause a kernel issue, but if the target trust the application and wants to catch programming mistakes and provide memory isolation that is a valid implementation.

Yeah, I gotta stop that right there. I can guarantee that's not how this feature was sold to partners, it's supposed to be a security feature. It doesn't provide security.

And we have a path to a solution that does, why aren't we just working at getting that finished? Why the rush at merging the incomplete one? Hardware to do this is still way out in the future, let's take the time to get this right.

zephyrproject-rtos / zephyr

xtensa: Enable userspace with Xtensa MMU #61303