live-lock on 4-core SMP platform during the tests/lib/spsc_pbuf/

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.74k stars 6.56k forks source link

live-lock on 4-core SMP platform during the tests/lib/spsc_pbuf/ #62043

Closed amr-sc closed 11 months ago

amr-sc commented 1 year ago

Describe the bug tests/lib/spsc_pbuf/ test hangs due to live-lock on RISC-V 4-core SMP platform. Most of the times the live-lock happens on sched_spinlock in the form when CPU0 always wins the contention for this spinlock while the other cores continue to spin on it.

To workaround this issue the additional delay could be added to k_spin_unlock: 15 divisions on zero was enough for me to resolve the live-lock:

static inline void cpu_relax(void)
{
    int dummy;
    __asm__ volatile ("div %0, %0, zero" : "=r" (dummy));
    fence();
}

The other way to resolve this live-lock is to "halt" and then "resume" the execution via the OpenOCD

To Reproduce Steps to reproduce the behavior:

west build -p -b riscv_scr_custom tests/lib/spsc_pbuf/
flash the riscv_scr_custom board and run the tests/lib/spsc_pbuf/
Observe the test execution stops after awhile. To kick the test further the "halt" + "resume" OpenOCD commands may be used, the test may eventually pass successfully if being kicked this way.

Expected behavior tests/lib/spsc_pbuf/ passes on all platforms, no live-locks happen due to spinlock acquisition unfairness

Impact Zephyr kernel could live-lock on some real-life multi-core SMP platforms due to spinlock acquisition unfairness

Logs and console output

*** Booting Zephyr OS build zephyr-v3.4.0-112-gdd2ccd3a3650 ***        
Running TESTSUITE test_spsc_pbuf                                       
===================================================================      
START - test_0cpy                                                        
 PASS - test_0cpy in 0.001 seconds                                       
===================================================================      
START - test_0cpy_corner1                                                
 PASS - test_0cpy_corner1 in 0.001 seconds                               
===================================================================      
START - test_0cpy_corner2                                                
 PASS - test_0cpy_corner2 in 0.001 seconds                               
===================================================================      
START - test_0cpy_discard                                                
 PASS - test_0cpy_discard in 0.001 seconds                               
===================================================================      
START - test_0cpy_smaller                                                
 PASS - test_0cpy_smaller in 0.001 seconds                               
===================================================================      
START - test_largest_alloc                                               
 PASS - test_largest_alloc in 0.001 seconds                              
===================================================================      
START - test_spsc_pbuf_ut                                                
 PASS - test_spsc_pbuf_ut in 0.001 seconds                               
===================================================================      
START - test_spsc_pbuf_ut_cache                                          
 PASS - test_spsc_pbuf_ut_cache in 0.001 seconds                         
===================================================================      
START - test_stress                                                      
Reading from an interrupt, writing from a thread                         
66% remaining:0 ms mss                                                   
Ztress execution report:                                                 
         context 0:                                                      
                 - executed:37475, preempted:0                           
                 - ticks initial:4, optimized:4                          
         context 1:                                                      
                 - executed:18692, preempted:1                           
                 - ticks initial:4, optimized:4                          
        Average CPU load:18%, measurements:150                           
Writes:28041 failures: 0                                                 
Writing from an interrupt, reading from a thread                         
100% remaining:0 ms mss                                                  
Ztress execution report:                                                 
         context 0:                                                      
                 - executed:36937, preempted:0                           
                 - ticks initial:4, optimized:4                          
         context 1:                                                      
                 - executed:18407, preempted:2                           
                 - ticks initial:4, optimized:4                          
        Average CPU load:19%, measurements:150                           
Writes:72147 failures: 11357                                             
 PASS - test_stress in 30.081 seconds                                    
===================================================================      
START - test_stress_0cpy                                                 
1% remaining:7000 mss

Environment (please complete the following information):

OS: Linux
Zephyr SDK v.0.16.1
zephyr-v3.4.0

Additional context Current Zephyr spinlock implementation is based on single atomic variable and doesn't guarantee locking fairness across multiple CPUs. So such live-locks are expected in certain conditions. One of the solutions is to use different spinlock implementation which guarantee locking order and lock acquisition fairness, e.g. in Linux such issues are addressed with Ticket Spinlocks: https://lwn.net/Articles/267968/

Ticket spinlocks proposal for Zephyr could be found here: https://github.com/zephyrproject-rtos/zephyr/pull/61541

andyross commented 1 year ago

Why can't you just put that delay in arch_spin_relax(), which is there for exactly this purpose?

andyross commented 1 year ago

Also, is there an understanding of what lock is being contended so as to produce the failure? A single core winning every time is not normally the same thing as "live lock" (in fact sometimes that's desirable for performance reasons due to cache residency concerns!). Are we 100% sure this isn't a bug in the test or spsc_pbuf? Is something expecting (or promising) some kind of "fairness" that spinlocks aren't ever really going to provide? Maybe there's a different IPC abstraction that would work better?

amr-sc commented 1 year ago

Why can't you just put that delay in arch_spin_relax(), which is there for exactly this purpose?

Hi @andyross , currently arch_spin_relax is used for relaxing the CPUs which are spinning, and not the CPU which just released the spinlock. So it will not help out-of-the-box. I made some experiments and in order to workaround this issue added the additional delay to k_spin_unlock: 15 divisions on zero was enough for me to resolve the live-lock:

static inline void cpu_relax(void)
{
    int dummy;
    __asm__ volatile ("div %0, %0, zero" : "=r" (dummy));
    fence();
}

So, you are right, this may work. But it doesn't look efficient to always relax this way.

andyross commented 1 year ago

Wait.... You need backoff on the locking thread (edit: sorry, should have said "winner thread")? Now you have me really confused. I don't see how you get that without code doing the equivalent of a synchronous "unlock();lock()" (e.g. to "let the other threads run" or whatever). Any non-trivial amount of code (I mean, seriously, like a dozen instructions even) between the two should be letting another CPU in to get the lock.

And that's 100% an abuse. Spinlocks aren't semaphores, you don't use them for telling someone else something, they only protect critical state.

Let's find this thing. I can almost guarantee that there's a root cause to this bug that doesn't involve the spinlock internals.

amr-sc commented 1 year ago

Also, is there an understanding of what lock is being contended so as to produce the failure? A single core winning every time is not normally the same thing as "live lock" (in fact sometimes that's desirable for performance reasons due to cache residency concerns!). Are we 100% sure this isn't a bug in the test or spsc_pbuf? Is something expecting (or promising) some kind of "fairness" that spinlocks aren't ever really going to provide? Maybe there's a different IPC abstraction that would work better?

Most of the times (about 90% of time) the sched_spinlock was the source of a live-lock, but sometimes it was a spinlock in timing subsystem. sched_spinlock is one hot synchronization element in the kernel. I believe the behavior of how multiple CPUs may contend for such "hot" spinlocks is HW-specific (CPU pipelines, caches, memory latency, NUMA, etc). The live-lock with the "simple" spinlock implementation is expected for some HW configurations. This is a known problem and it was addressed in e.g. Linux long time ago.

andyross commented 1 year ago

Do you have backtraces? You can't get a live lock with just the lock itself, you need some code in a loop spinning on it. Neither the scheduler nor the timer behave like that: they get into the critical section, do their work, and get back out, with presumptively-non-trivial app code interspersed. More to the point: neither of these things should be caring about fairness at all. We need to find that loop, because without it this is all voodoo.

Among other things: are we sure this isn't a memory ordering bug somewhere else causing a broken or double-locked lock? Did you try stuffing extra barriers into the spinlock paths instead?

andyross commented 1 year ago

Also: what are the platforms that show the issue, so I can try to reproduce? FWIW: the behavior of spinlocks is essentially identical in qemu and other emulators, so this should show up there too. It's possible to hack at the qemu_x86_64 configuration to enable more cores, though I haven't done it in a long time.

andyross commented 1 year ago

Just to clarify from above: point me to the place(s) in code where you can say "CPU A just released this lock <here>, but it's going to grab it again <here> before CPU B gets a chance, and then it's going to loop back around via <there> without ever exiting until CPU B gets to run".

That documents the live lock. But... again I can't find any spots like that in the scheduler or timeout subsystems. It's just not how they behave, they don't have loops with unbounded exit conditions that locking might extend, everything that happens is done on finite stuff.

amr-sc commented 1 year ago

@andyross ,

The backtraces are looking more or less the same as the one presented in the bug description. The test just stops during the test_stress_0cpy and when I look at the debugger (after awhile, you may take a few minutes nap or check immediately)

I always see the same picture: one of the CPUs is owning the sched_spinlock, while the others are spinning on it. Next comes the funny thing. The debugger uses "halt" and "resume" OpenOCD commands, and when I push resume, the live-lock is getting resolved. And the test is running further until it locks again. The test could even pass successfully if "kicked" this way with "halt" + "resume".

I will try to get access to the system and reconstruct the conditions when the issue had been reproducing (we live with Ticket Spinlocks internally for a few months already) and will provide you with call stacks showing the live-lock on sched_spinlock.

The platform I used was a 4-core RISC-V SMP system with multi-level caches, kind of bigLITTLE (2 big cores, 2 smaller ones), it's not yet upstreamed, but we are working on it.

And the tests/lib/spsc_pbuf/ is the single test from the entire tests/ folder which shows such behavior. There is nothing special in it, it is focused on a pipe-like IPC, but it applies a "right" kind of pressure to the system.

I'll prepare a spinlock test to verify the fairness of spinlock acquisition. Let's see how it will work with different spinlock implementations and make conclusions.

Again, I believe there is no bug in the kernel, this kind of behavior is just expected in some conditions. The most representative example is NUMA. When some CPU just has much faster/shorter access to the memory and will always win the contention. A lot of this behavior depends on caches and branch prediction, we just can't expect such effects to be applicable exclusively to NUMA systems.

andyross commented 1 year ago

I always see the same picture: one of the CPUs is owning the sched_spinlock, while the others are spinning on it.

So... just to be nitpicky: that's a sufficient description of a deadlock, not a live lock. And indeed, changing the order of operations (e.g. with a fairness patch to spinlocks) can be sufficient to resolve either without actually fixing the root cause.

Again, I believe there is no bug in the kernel, this kind of behavior is just expected in some conditions.

Repeating: that's... not really correct. You can only get a lockup if there's a cycle between two contexts waiting on each other to progress. The kernel should 100% be tolerant of arbitrary unfairness in locking. It doesn't matter if CPU0 "always wins", because it will only ever hammer the lock for finite time before going back to app code (or the idle thread, or whatever). You can only get a lockup if CPU0's work is unbounded, and if we have that in the kernel it's a bug.

FWIW: knowing that this is on new hardware (and complicated hardware), and only on that new hardware, my bet is that we have a memory ordering issue instead that's left a stuck lock from the perspective of one CPU, and that adding extra atomic operations is just pushing it out of the reorder window. Alternatively you've found an equivalent issue with e.g. the bespoke spinlock on the thread.switch_handle field, either within the kernel (see kernel/include/kswap.h) or the arch layer that writes it. Maybe sprinkle memory fences around and see if something turns up?

amr-sc commented 1 year ago

The deadlock wouldn't be resolved with relaxing the last-owner CPU, right? So in my case "holding" such CPU for awhile out of spinlock helps other CPUs to detect the free spinlock until the "lucky CPU" grabs it again.

I understand what you are saying, we shouldn't swipe the potential bug in the kernel under the carpet using the fair spinlocks as a WAR for this bug. For sure we should look and optimize the kernel if some flaw is really there. But at the same time this is not a justification for standing out of spinlock acquisition fairness, because, as you say, the kernel should be tolerant to how the HW behaves in relation to memory access. Let's say there is just a certain level of unfairness the HW can sustain without live-locking (NUMA) and SW can't really control the "portion" of unfairness. We either should have predictable behavior, or there will be always a probability of a live-lock. This is not an invention of mine, this is a known fundamental problem, please see https://lwn.net/Articles/267968/

Bringing in the Ticket Spinlocks doesn't mean we replace the original spinlocks with them, since it's disabled by default. We just provide a solution for users with specific kind of HW, the way they did it with Linux. The kernel works with basic spinlocks by default and we are still able to reproduce and debug the potential SW issue.

andyross commented 1 year ago

Let's say there is just a certain level of unfairness the HW can sustain without live-locking (NUMA)

Again, you keep saying that[1], but that's not correct. Unfairness and "live lock" are different things. Live lock requires a cycle of dependency: CPU A can't finish it's work until CPU B does something, but CPU B doesn't get to run because of unfairness. And I'm saying that almost always (and in the kernel, always always!), that kind of dependency in spinning code represents a bug that needs to be fixed[2], you don't magically make it better with fairness.

[1] And also citing that LWN article, which I promise you I've read. Pretty sure I read it 15 years ago when it was new, too. It's not about live lock conditions, it's about performance. And I 100% agree that there are regimes where spinlock fairness can impact performance. But there are regimes where fair spinlocks are a net loss too, and IMHO Zephyr sits more in the latter than the former. I'm not saying no, I'm saying not like this. (And I'm also saying that the circumstances argue strongly that this is a hardware interaction with e.g. a missing or incorrectly-applied barrier and not a logic bug.)

[2] Because we're an RTOS, obviously. Literally everything the kernel does needs to represent finite/bounded work guaranteed to finish before whatever time metric the app has decided is its latency tolerance.

npitre commented 1 year ago

Repeating: that's... not really correct. You can only get a lockup if there's a cycle between two contexts waiting on each other to progress. The kernel should 100% be tolerant of arbitrary unfairness in locking. It doesn't matter if CPU0 "always wins", because it will only ever hammer the lock for finite time before going back to app code (or the idle thread, or whatever).

This is true of Linux as well. Yet, user space apps may experience fairness issues because of the kernel space spinlock implementation.

Because we're an RTOS, obviously. Literally everything the kernel does needs to represent finite/bounded work guaranteed to finish before whatever time metric the app has decided is its latency tolerance.

Predictability goes a long way towards that. Unfair spinlocks opens the way for jitters and unpredictability, two things that are unwelcome in an RTOS.

Quoting a Linux commit log:

commit 314cdbefd1fd0a7acf3780e9628465b77ea6a836
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Wed Jan 30 13:31:21 2008 +0100

x86: FIFO ticket spinlocks

Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

Zephyr is not immune to such issues either. And when apps are made into kernel threads then spinlock usage may induce active starvation above simple unfairness.

The quoted hardware above isn't particularly embedded-like, but such issues were present a long time ago. And strange hardware is likely to come about in the embedded world more than ever before.

So why such resistance towards a generic solution that has proven itself elsewhere?

You keep mentioning performance issues. Can you substantiate that? The Linux community is full of performance freaks so I really doubt they'd kept the ticket implementation if that was significant. In fact, I don't see how the performance could be measurably different in the uncontended case. And not having an LL/SC sequence in the contended loop would definitely improve the contended case for not having to lock the affected cache line over and over (at least on ARM64, most likely on RISCV too).

And, if I recall correctly, predictability is more important than performance for an RTOS.

AS to the size increase... I tried the philosopher sample on qemu_riscv64_smp and the difference is around 0.1% of the total size which is not worth writing home about. If we duplicate the smartness from the Linux implementation (like packing both indexes in the same word) then we'd get back most of the memory usage increase.

we live with Ticket Spinlocks internally for a few months already

Good to know. We'll simply do the same and recommend it to our customers over the upstream version as well. The tradeoff is definitely in favor of this.

andyross commented 1 year ago

This is going in circles. The putative live lock condition is unsubstantiated, period. Show me where it is and I'll fix it. Don't demand we merge an extremely impactful (and frankly quite risky) feature to fix a bug we can't actually find, and that shows up only one test on one non-public hardware platform.

I'm not saying no to ticket spinlocks. I'm just saying we aren't going to replace the existing spinlocks on what amounts to a hunch.

andyross commented 1 year ago

Sorry, realized I missed one bit I should have replied to:

The deadlock wouldn't be resolved with relaxing the last-owner CPU, right?

No, it absolutely could be. Putting that relaxation step in doesn't change the logic of the behavior, but it changes the ordering of operations and the sizes of the various critical windows needed to exercise things like races and deadlocks. It doesn't "fix" the bug, but it might make it a million times (or more) less likely.

I really think we have two separate things being conflated here:

This bug, which I'm increasingly convinced isn't about spinlock fairness at all, and...
A much less contentious discussion about how to measure/review/deploy a nifty new feature for locking fairness.

cfriedt commented 11 months ago

Some numbers from today's main. It isn't a problem limited to 1 out-of-tree platform, but I would agree with Andy that it is more of a question of fairness (and of course, Qemu is very finicky, mileage may vary from one run to another).

qemu_x86_64:
CPU0 acquired spinlock 0 times, expected 1000
CPU1 acquired spinlock 2000 times, expected 1000

qemu_riscv64_smp:
CPU0 acquired spinlock 275 times, expected 1000
CPU1 acquired spinlock 1725 times, expected 1000

qemu_cortex_a53_smp:
CPU0 acquired spinlock 1599 times, expected 1000
CPU1 acquired spinlock 401 times, expected 1000

amr-sc commented 11 months ago

Hi @cfriedt , could you please clarify: is it with ticket spinlocks, or without? You've just run the fairness_test "as is" from the review branch, right?

cfriedt commented 11 months ago

Hi @cfriedt , could you please clarify: is it with ticket spinlocks, or without? You've just run the fairness_test "as is" from the review branch, right?

That is without ticket spinlocks

npitre commented 11 months ago

Fixed with #61541