Open jharris-intel opened 3 years ago
Also see https://github.com/zephyrproject-rtos/zephyr/pull/32557#discussion_r583292495 for some context.
@carlocaione - you may be interested in this (you seem to be interested in AArch64 improvements)
Hi @carlocaione,
This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.
Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.
@jharris-intel you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.
Thanks!
Is your enhancement proposal related to a problem? Please describe. The current spinlock implementation on AArch64 SMP leaves a lot of performance on the floor.
Describe the solution you'd like As spinlock performance can quickly become a bottleneck in SMP, I would like to open the discussion to how to improve spinlock performance on AArch64 SMP.
I have collected a fair bit of data on the subject locally, using a microbenchmark on a bare-metal quad-core A53 system. Some setup:
while (1) {wfe}
(again, with interrupt disabled.wfi
instead ofwfe
, butwfi
wakes for physical interrupts regardless of if the interrupts are masked.k_spin_lock
/ increment 64-bit variable in memory /k_spin_unlock
.(I'm going to be adding more later here as comments; this is just a start.)
Baseline is what is on trunk (roughly; this branch does have differences from trunk to enable this working at all, but the code the test code is exercising is what's on trunk).
SeqRel is the following patch on both k_spin_unlock and k_spin_release:
(Which obviously isn't the "proper" way to do it, but hopefully suffices for a quick test.)
Prefetch is SeqRel + adding the ARM-recommended prefetch before the spinlock. (Note: ahead of time I'm suspecting here that this won't matter much on the A53, but may on larger cores.)
(Again, obviously non-portable as-is, but suffices for this.)
Spoiler: I'm not going to bother going through all of the cases here, because both 1 and 4C are slower than without the prefetch (although still faster than baseline). Dead end (for the A53; may be better on other cores), move on.
Litmus is SeqRel + replacing the CAS in lock with this monstrosity taken from the ARMv8-A ARM:
(Which is pretty much verbatim what ARM recommends here, with the exception of dropping the preload.) (Yes, this does reuse
val
for two different things here.) (I'm personally not convinced that the SEVL is worth it here over an unconditional branch, but meh.) (Interestingly, even though we're using WFE here we don't need a corresponding SEV. ARMv8 changed it so that clearing a global monitor sends an event, so another core doing a write to locked will wake us up.)The main advantages of this are that a) cores that are waiting for the spinlock have substantially less memory traffic while doing so, and b) two major wrinkles that I will get into later.
The major wrinkles are as follows:
blocking_cas
or somesuch. (You can get close by adding a weak CAS + a wfe hint API, but even that has issues.)Comparison:
(So e.g. two contended cores trying to take a spinlock with the spin release patch each take ~192 cycles to do a lock + unlock.)
All told, just the SeqRel change ends up with about 140% of the performance when unloaded, dropping down to 110-120% when loaded.
(Note that even in the best case, N cores would take about N times the cycles to grab a single contended spinlock.)
And the Litmus version is ~1.5x the speed unloaded, increasing (!) to ~2x baseline when loaded.
So, what's the catch? Several, one less major, one major:
So let's talk about the memory model for a moment. (You may wish to look at #32133 for some background.)
The current spinlocks are documented as a full memory barrier. This optimized lock has
k_spin_lock
function as an acquire barrier only, andk_spin_lock_release
function as arelease
barrier only. Or, translated somewhat:For reference:
Some potential approaches here:
k_spin_lock_acquire_barrier_only
(and ditto for release) (or some such - I am terrible at names). Upside: compatible with existing code. Downside: existing code cannot take advantage of new API (or performance) without changes. Downside: additional API surface to maintain.k_spin_lock
/release
to an acquire/release only, but add memory barriers to all current callers, with the intent of removing barriers over time as appropriate. Upside: keeps API surface the same (ish). Downside: requires (rote) change to application code to restore current behavior.Thoughts?