zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.68k stars 6.53k forks source link

Improve Arm (And other Arch) Context Swap Performance #79069

Open teburd opened 2 weeks ago

teburd commented 2 weeks ago

Is your enhancement proposal related to a problem? Please describe. Some benchmarks shows Zephyr behind in context swap performance compared to ThreadX.

Describe the solution you'd like Avoid any branch-link (function call) operations in PendSV handling for Arm, likely other archs could have the same idea implemented.

Every bl op is a potential pipeline flush, certainly some lost context, we almost immediately call out to a C function handler for PendSV handling (used for Arm context swap). There's several other bl ops involved depending on which options are involved.

ThreadX avoids almost all bl ops except a hook for swap in/swap out that is opt in. Otherwise has ~80 asm instructions for PendSV handling. Clearly has some performance implications somewhere here, maybe partially due to the branch out of inline asm. Perhaps other things, needs investigating.

https://github.com/eclipse-threadx/threadx/blob/master/ports_arch/ARMv7-M/threadx/gnu/src/tx_thread_schedule.S#L131

https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/cortex_m/swap_helper.S#L56

Describe alternatives you've considered Not doing anything

Additional context

Benchmark report showing difference in context swap performance, on a cortex-m4 https://www.dropbox.com/scl/fi/opimwfbvkd9coeprc7d5h/Beningo_RtosPerformance_2024_Report.pdf?rlkey=s3n007s6hgubnj37ovto88bs2&e=3&dl=0

In large part the difference is due to our MPU usage for hw stack protection by default, but this isn't the only thing playing a part.

stephanosio commented 2 weeks ago

Reminds me of https://github.com/zephyrproject-rtos/zephyr/pull/65071#pullrequestreview-2020028109 ...

andyross commented 1 week ago

I still argue that eliminating PendSV on the context switch path entirely would be even better. Other architectures don't work that way, it's specific to cortex-m.

teburd commented 1 week ago

Maybe some evidence to support this idea here https://gvpress.com/journals/IJSH/vol9_no2/10.pdf

JarmouniA commented 1 week ago

I still argue that eliminating PendSV on the context switch path entirely would be even better. Other architectures don't work that way, it's specific to cortex-m.

I'm new to this subject, but isn't the benefit of PendSV is that it allows having minimal context switching code for all Cortex-M CPUs? Also, the fact that it is asynchronous allows avoiding context switching in the middle of ISR & thus arguably improves ISR handling.

https://developer.arm.com/documentation/107706/0100/System-exceptions/Pended-SVC---PendSV

andyross commented 1 week ago

@JarmouniA it does reduce code size by unifying the "context switch on interrupt exit" and "synchronous/cooperative context switch" cases (by effectively making the latter a trap to an interrupt). But:

  1. As a cost, it pays for a full interrupt entry/exit on the cooperative case, which is how we ended up here.
  2. It's actually something you can do on almost all architectures if you think about it, almost everything has the ability to flag a low priority interrupt to be handled after the current ISR returns. The fact that no one else does this should be informative[1].
  3. And my personal peeve: it makes context-switch non-atomic. The kernel likes to be able to set some state based on the old/new threads it's selected and then call swap()/switch() to effect that. But that breaks on cortex-m because the actual context switch won't happen until the PendSV exception, which can be preempted by an interrupt! There's some really subtle/horrible code gated under CONFIG_SWAP_NONATOMIC that we have to carry to track this, and I'd love to see it die.

[1] OK, it does need to be mentiond that cortex-m specifically has very light weight interrupts, something other architectures are much weaker at. But even this falls down when you start adding more stuff: nested interrupts, MPU/MMU/FPU state handling, stack switching, etc... pollute that pretty badly when you start turning features on.

teburd commented 1 week ago

It’s also potentially faster given the paper I linked. Seems like a compounding set of reasons to try and make a smaller swap with inline asm and no pendsv, avoiding many quirks. Who’s gonna try it?