zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.93k stars 6.65k forks source link

AArch64 SMP spinlock enhancements #32686

Open jharris-intel opened 3 years ago

jharris-intel commented 3 years ago

Is your enhancement proposal related to a problem? Please describe. The current spinlock implementation on AArch64 SMP leaves a lot of performance on the floor.

Describe the solution you'd like As spinlock performance can quickly become a bottleneck in SMP, I would like to open the discussion to how to improve spinlock performance on AArch64 SMP.

I have collected a fair bit of data on the subject locally, using a microbenchmark on a bare-metal quad-core A53 system. Some setup:

  1. All cores have interrupts disabled.
  2. N cores are running the testing loop, and the other 4-N cores are running while (1) {wfe} (again, with interrupt disabled.
    1. This results in minimal memory traffic on unused cores. I would use wfi instead of wfe, but wfi wakes for physical interrupts regardless of if the interrupts are masked.
  3. Each testing loop is just a loop of k_spin_lock / increment 64-bit variable in memory / k_spin_unlock.
  4. The "uncontended" ("nc") cases are where each testing loop has its own spin lock and counter in a unique cache line (read: spaced far enough in memory that you don't have false sharing); the "contended" ("c") cases are where all testing loops are using the same spin lock and variable.
  5. The cycle counts are captured over a ~50mcyc period (note: total; divided by the number of cores active.)

(I'm going to be adding more later here as comments; this is just a start.)

Baseline is what is on trunk (roughly; this branch does have differences from trunk to enable this working at all, but the code the test code is exercising is what's on trunk).

SeqRel is the following patch on both k_spin_unlock and k_spin_release:

-   // atomic_clear(&l->locked);
+   __atomic_store_n(&l->locked, 0, __ATOMIC_SEQ_CST);

(Which obviously isn't the "proper" way to do it, but hopefully suffices for a quick test.)

Prefetch is SeqRel + adding the ARM-recommended prefetch before the spinlock. (Note: ahead of time I'm suspecting here that this won't matter much on the A53, but may on larger cores.)

+   __asm__("PRFM PSTL1KEEP, %0" : : "Q"(l->locked));
    while (!atomic_cas(&l->locked, 0, 1)) {
    }

(Again, obviously non-portable as-is, but suffices for this.)

Spoiler: I'm not going to bother going through all of the cases here, because both 1 and 4C are slower than without the prefetch (although still faster than baseline). Dead end (for the A53; may be better on other cores), move on.

Litmus is SeqRel + replacing the CAS in lock with this monstrosity taken from the ARMv8-A ARM:

    atomic_val_t val;
    atomic_val_t one = 1;
    __asm__ volatile(
        "SEVL" "\r\n"
        "1:" "\r\n"
        "WFE" "\r\n"
        "LDAXR %w[val], %[where]" "\r\n"
        "CBNZ %w[val], 1b" "\r\n"
        "STXR %w[val], %w[one], %[where]" "\r\n"
        "CBNZ %w[val], 1b" "\r\n" :
        [where]"+Q"(l->locked), [val]"=&r"(val) :
        [one]"r"(one)
        :
        "memory");

(Which is pretty much verbatim what ARM recommends here, with the exception of dropping the preload.) (Yes, this does reuse val for two different things here.) (I'm personally not convinced that the SEVL is worth it here over an unconditional branch, but meh.) (Interestingly, even though we're using WFE here we don't need a corresponding SEV. ARMv8 changed it so that clearing a global monitor sends an event, so another core doing a write to locked will wake us up.)

The main advantages of this are that a) cores that are waiting for the spinlock have substantially less memory traffic while doing so, and b) two major wrinkles that I will get into later.

The major wrinkles are as follows:

  1. There is no simple extension to the current atomics API that permits the use of WFE, short of adding e.g. blocking_cas or somesuch. (You can get close by adding a weak CAS + a wfe hint API, but even that has issues.)
  2. This relaxes the memory model. More on that below.

Comparison:

k_spin_lock Normalized k_spin_unlock Normalized total Normalized
Baseline SeqRel Litmus SeqRel Litmus Baseline SeqRel Litmus SeqRel Litmus Baseline SeqRel Litmus SeqRel Litmus
1 38.50 41.00 32.20 93.90% 119.57% 30.50 9.00 12.00 338.89% 254.17% 69.00 50.00 44.20 138.00% 156.11%
4 NC 38.50 38.75 32.20 99.35% 119.57% 30.50 9.75 12.00 312.82% 254.16% 69.00 48.50 44.20 142.27% 156.11%
2 C 173.29 180.99 125.67 95.75% 137.90% 54.66 10.64 12.09 513.72% 452.08% 227.96 191.62 137.76 118.96% 165.48%
3 C 398.20 433.76 251.34 91.80% 158.43% 78.04 10.44 13.10 747.51% 595.74% 476.24 444.20 264.44 107.21% 180.09%
4 C 680.45 664.65 384.82 102.38% 176.82% 95.01 11.47 13.37 828.33% 710.73% 775.46 676.12 398.19 114.69% 194.75%

(So e.g. two contended cores trying to take a spinlock with the spin release patch each take ~192 cycles to do a lock + unlock.)

All told, just the SeqRel change ends up with about 140% of the performance when unloaded, dropping down to 110-120% when loaded.

(Note that even in the best case, N cores would take about N times the cycles to grab a single contended spinlock.)

And the Litmus version is ~1.5x the speed unloaded, increasing (!) to ~2x baseline when loaded.


So, what's the catch? Several, one less major, one major:

  1. Although the SeqRel change can be done using the current indirection-through-atomics approach, I have been unable to find a good way to do the Litmus change in the same manner.
  2. These relax the spinlock memory model from what it is documented as. (Even the "SeqRel" version as-is - it turns into a store-release, which does not prevent store reordering around it. Again, C11 atomics being "amusing" again and only considering ordering around other atomics.)

So let's talk about the memory model for a moment. (You may wish to look at #32133 for some background.)

The current spinlocks are documented as a full memory barrier. This optimized lock has k_spin_lock function as an acquire barrier only, and k_spin_lock_release function as a release barrier only. Or, translated somewhat:

  1. If the spinlock is used purely to ensure ordering of operations inside the spinlock, there's no behavioral difference.
  2. There are cases where an external observer can observe reordering of a write done inside a spinlock with a write done outside of said spinlock.
  3. There are cases where an external observer can observe reordering of a write done before the spinlock with a write done after the spinlock.

For reference:

  1. Any issues here will likely mainly affect ARM (and other weak memory models).
  2. Any issues here will only affect SMP (every architecture under the sun observes sequentially-consistent stores within its own core)
  3. In practice my local smoketests of AArch64 SMP didn't find any cases where Zephyr was relying on spinlocks as a full memory barrier.
  4. In practice my quick glance through uses of spinlock didn't find any cases where Zephyr was relying on spinlocks as a full memory barrier - though again, quick glance.
  5. Any issues here would likely be very much long-tail in nature. (So not so much "it always fails" as "it fails once in a billion operations").

Some potential approaches here:

  1. Add new API calls k_spin_lock_acquire_barrier_only (and ditto for release) (or some such - I am terrible at names). Upside: compatible with existing code. Downside: existing code cannot take advantage of new API (or performance) without changes. Downside: additional API surface to maintain.
  2. Change k_spin_lock / release to an acquire/release only, but add memory barriers to all current callers, with the intent of removing barriers over time as appropriate. Upside: keeps API surface the same (ish). Downside: requires (rote) change to application code to restore current behavior.
  3. Ignore it (always worth calling out as an option). Upside: keeps API surface the same. Upside: no application breakage. Downside: performance.

Thoughts?

jharris-intel commented 3 years ago

Also see https://github.com/zephyrproject-rtos/zephyr/pull/32557#discussion_r583292495 for some context.

jharris-intel commented 3 years ago

@carlocaione - you may be interested in this (you seem to be interested in AArch64 improvements)

zephyrbot commented 9 months ago

Hi @carlocaione,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@jharris-intel you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!