zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.99k stars 6.69k forks source link

kernel crashes with small CONFIG_TIMESLICE_SIZE #27533

Closed andrewboie closed 4 years ago

andrewboie commented 4 years ago

Describe the bug For some time now we've received reports that two tests are failing sporadically in CI for mysterious reasons, #13813 and #19701. As it turns out, what these tests have in common is that in their prj.conf, they both set CONFIG_TIMESLICE_SIZE=1.

There seems to be some kind of race condition in the kernel which is made worse by having timeslicing set to this. Now that almost all of our platforms enable icount in the QEMU configuration, this now doesn't happen randomly as all tests run in a deterministic fashion and it's harder to get unlucky.

Looking through test configurations, we do not have any dedicated tests for timeslicing. Only three tests change the default value of CONFIG_TIMESLICE_SIZE=0, which means unless I'm missing something, only these tests are exercising the timeslice feature, and all of them are actually testing something else:

tests/kernel/mem_slab/mslab_threadsafe/prj.conf sets it to 1 tests/kernel/mem_pool/mem_pool_threadsafe/prj.conf sets it to 1 tests/kernel/mem_protect/syscalls/prj.conf sets it to 10 (for the syscall torture test)

To Reproduce

To best show this issue, we need to default the timeslice size to 1 and disable icount:

diff --git a/cmake/emu/qemu.cmake b/cmake/emu/qemu.cmake
index 26e7622dee..a05ca93cd3 100644
--- a/cmake/emu/qemu.cmake
+++ b/cmake/emu/qemu.cmake
@@ -60,12 +60,6 @@ endif()
 # Connect monitor to the console chardev.
 list(APPEND QEMU_FLAGS -mon chardev=con,mode=readline)

-if(CONFIG_QEMU_ICOUNT)
-  list(APPEND QEMU_FLAGS
-         -icount shift=${CONFIG_QEMU_ICOUNT_SHIFT},align=off,sleep=off
-         -rtc clock=vm)
-endif()
-
 # Add a BT serial device when building for bluetooth, unless the
 # application explicitly opts out with NO_QEMU_SERIAL_BT_SERVER.
 if(CONFIG_BT)
diff --git a/kernel/Kconfig b/kernel/Kconfig
index dda626e136..198f646744 100644
--- a/kernel/Kconfig
+++ b/kernel/Kconfig
@@ -399,7 +399,7 @@ config TIMESLICING

 config TIMESLICE_SIZE
        int "Time slice size (in ms)"
-       default 0
+       default 1
        range 0 2147483647
        depends on TIMESLICING
        help

Then run sanitycheck. In a typical run I would see

There were 43 total failures when testing sanitycheck -T tests/kernel which seems abnormal even with icount disabled. I looked through the logs:

qemu_cortex_r5 tests/kernel/mbox/mbox_api/kernel.mailbox.api

START - test_mbox_async_put_to_waiting_get
E: ***** HARD FAULT *****
E: r0/a1:  0xfffffffd  r1/a2:  0xe000ed00  r2/a3:  0xffffffff
E: r3/a4:  0xffffffff r12/ip:  0x00000000 r14/lr:  0x00003f87
E:  xpsr:  0x81000000
E: Faulting instruction address (r15/pc): 0x00005dec
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x200005a0 (unknown)
E: Halting system

qemu_cortex_m0 tests/kernel/sched/schedule_api/kernel.scheduler.multiq

START - test_lock_preemptible
E: ***** HARD FAULT *****
E: r0/a1:  0xfffffffd  r1/a2:  0xe000ed00  r2/a3:  0xffffffff
E: r3/a4:  0xffffffff r12/ip:  0x00000000 r14/lr:  0x00003ee3
E:  xpsr:  0x81000000
E: Faulting instruction address (r15/pc): 0x00005a84
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x20000758 (unknown)
E: Halting system

qemu_cortex_m0 tests/kernel/sched/schedule_api/kernel.scheduler

START - test_lock_preemptible
E: ***** HARD FAULT *****
E: r0/a1:  0xfffffffd  r1/a2:  0xe000ed00  r2/a3:  0xffffffff
E: r3/a4:  0xffffffff r12/ip:  0x00000000 r14/lr:  0x00003ee3
E:  xpsr:  0x81000000
E: Faulting instruction address (r15/pc): 0x00005aa4
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x20000758 (unknown)
E: Halting system

qemu_cortex_r5 tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser

START - test_sem_take_multiple
ASSERTION FAIL [thread->base.pended_on] @ WEST_TOPDIR/zephyr/kernel/sched.c:475
@ WEST_TOPDIR/zephyr/lib/os/assert.c:45
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x000060b0 (unknown)
E: Halting system

qemu_cortex_r5 tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool.threadsafe

START - test_mpool_threadsafe
E: ***** PREFETCH ABORT *****
E: Permission Fault @ 0xffb41092
E: r0/a1:  0x00004c58  r1/a2:  0x00004ec0  r2/a3:  0x00004ec0
E: r3/a4:  0x00000000 r12/ip:  0x00000002 r14/lr:  0x00003098
E:  xpsr:  0x600001ff
E: Faulting instruction address (r15/pc): 0xffb41092
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x00004ec0 (main)
E: Halting system

caught this failing a timeout assertion on x86_64:

Booting from ROM..*** Booting Zephyr OS build zephyr-v2.3.0-1198-g849ff583968e  ***
Running test suite mpool_threadsafe
===================================================================
START - test_mpool_threadsafe
ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ WEST_TOPDIR/zephyr/kernel/timeout.c:103

E: RAX: 0x0000000000000004 RBX: 0x0000000000109390 RCX: 0x0000000000000001 RDX: 0x0000000000000046
E: RSI: 0x0000000000000067 RDI: 0x000000000010847b RBP: 0x0000000000000015 RSP: 0x0000000000111088
E: Page fault at address 0xb9 (error code 0x11)
E:  R8: 0x0000000000000001  R9: 0x0000000000000020 R10: 0x0000000000000202 R11: 0x0000000000111098
E: Access violation: supervisor thread not allowed to execute
E: R12: 0xfffffffffffffffe R13: 0x00000000001034c3 R14: 0x0000000000000016 R15: 0x0000000000000010
E: PTE: 0x0000000000000000 -> 0x0000000000000000: RW XD 
E: RAX: 0x0000000000000001 RBX: 0x0000000000111050 RCX: 0x0000000000000001 RDX: 0x0000000000000046
E: RSI: 0x0000000000000000 RDI: 0x000000000010bc64 RBP: 0x00000000001110d0 RSP: 0x0000000000111048
E:  R8: 0x000000000010bb30  R9: 0x0000000000000001 R10: 0x0000000000000202 R11: 0x0000000000109da0
E: RSP: 0x0000000000111088 RFLAGS: 0x0000000000000006 CS: 0x0018 CR3: 0x0000000000141000
E: R12: 0x0000000000000000 R13: 0x0000000000000001 R14: 0x000000000010bb10 R15: 0x000000000013c1c0
E: RSP: 0x0000000000111048 RFLAGS: 0x0000000000000247 CS: 0x0018 CR3: 0x0000000000141000
E: RIP: 0x00000000000000b9
E: RIP: 0x0000000000100dde
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 1
E: Attempt to resume un-suspended thread object
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x0000000000109360 (unknown)
E: Current thread: 0x0000000000109360 (unknown)
E: Halting system

Expected behavior No failures like this with timeslicing turned on.

Impact This appears to be a showstopper for anyone who uses time slicing, and depending on where the root cause is, others too.

andrewboie commented 4 years ago

If you don't disable icount and just change the timeslice default I found one test that failed reliably:

apboie@apboie-mobl4:~/projects/zephyr2/zephyr master /home/apboie/projects/zephyr2/zephyr
$ git diff
diff --git a/kernel/Kconfig b/kernel/Kconfig
index 702d908d4b..fc94ca3111 100644
--- a/kernel/Kconfig
+++ b/kernel/Kconfig
@@ -404,7 +404,7 @@ config TIMESLICING

 config TIMESLICE_SIZE
        int "Time slice size (in ms)"
-       default 0
+       default 1
        range 0 2147483647
        depends on TIMESLICING
        help

I then ran sanitycheck -T tesks/kernel to see what shakes out.

I then got a reproducible failure in tests/kernel/sched/deadline/kernel.scheduler.deadline on qemu_riscv32 where we are also jumping to an illegal location:

*** Booting Zephyr OS build zephyr-v2.3.0-1056-g568211738d30  ***
Running test suite suite_deadline
===================================================================
START - test_deadline
E: Exception cause Illegal instruction (2)
E: Faulting instruction address = 0x204060ac
E:   ra: 0x204061a8  gp: 0x00000000  tp: 0x00000000  t0: 0x00000000
E:   t1: 0x00000000  t2: 0x00000000  t3: 0x00000000  t4: 0x00000000
E:   t5: 0x00000000  t6: 0x00000000  a0: 0x000186a0  a1: 0x00000000
E:   a2: 0x204066f8  a3: 0x00000021  a4: 0x00000000  a5: 0x00000008
E:   a6: 0x00000000  a7: 0x00000000

E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x800004d8 (unknown)
E: Halting system
andrewboie commented 4 years ago

Looking through test configurations, we do not have any dedicated tests for timeslicing.

correction, schedule_api test sets this, we do have some coverage. It doesn't change the default of 0 which means the slices are infinitely large though but there is some runtime adjustments made.

andrewboie commented 4 years ago

Found something interesting.

I ran some sanitycheck with this revision of https://github.com/andrewboie/zephyr/tree/this-should-not-happen-27533 which is a version of https://github.com/zephyrproject-rtos/zephyr/pull/27995 plus this patch applied:

diff --git a/kernel/Kconfig b/kernel/Kconfig
index 3e72f4acfd..d80d79e4d1 100644
--- a/kernel/Kconfig
+++ b/kernel/Kconfig
@@ -400,7 +400,7 @@ config TIMESLICING

 config TIMESLICE_SIZE
        int "Time slice size (in ms)"
-       default 0
+       default 5
        range 0 2147483647
        depends on TIMESLICING
        help

The test tests/kernel/mem_protect/sys_sem fails like so. Thankfully we have icount and it does it every time:

START - test_sem_take_multiple
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/thread_abort.c:47
        should never get here
E: r0/a1:  0x00000004  r1/a2:  0x0000002f  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x0000771b
E:  xpsr:  0x61000000
E: Faulting instruction address (r15/pc): 0x0000b406
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x200005c0 (unknown)
E: Halting system
qemu-system-arm: terminating on signal 2
ninja: build stopped: interrupted by user.

The assertion in question is here:

FUNC_NORETURN void z_self_abort(void)
{
    /* Self-aborting threads don't clean themselves up, we
     * have the idle thread for the current CPU do it.
     */
    int key;
    struct _cpu *cpu;

    /* Lock local IRQs to prevent us from migrating to another CPU
     * while we set this up
     */
    key = arch_irq_lock();
    cpu = _current_cpu;
    cpu->pending_abort = _current;

    LOG_DBG("%p self-aborting, handle on idle thread %p",
        _current, cpu->idle_thread);

    k_thread_suspend(_current);
    z_swap_irqlock(key);
    __ASSERT(false, "should never get here");
    CODE_UNREACHABLE;
}

We're in here because the thread exited. It is calling k_thread_suspend() on itself (with IRQs locked so execution continues) and then z_swap_irqlock(), and we should never get back from that because we have suspended ourselves; the idea is the next context switch will be to idle to clean up. But instead we are somehow getting scheduled again and the assertion fails.

The thread is sem_tid2 in the test case, which runs this function:

void sem_take_multiple_high_prio_helper(void *p1, void *p2, void *p3)
{
    int32_t ret_value;

    ret_value = sys_sem_take(&high_prio_sem, K_FOREVER);
    zassert_true(ret_value == 0, "sys_sem_take failed");

    ret_value = sys_sem_take(&multiple_thread_sem, K_FOREVER);
    zassert_true(ret_value == 0, "sys_sem_take failed");

    sys_sem_give(&high_prio_sem);
}

The documentation says:

**
 * @brief Suspend a thread.
 *
 * This routine prevents the kernel scheduler from making @a thread
 * the current thread. All other internal operations on @a thread are
 * still performed; for example, kernel objects it is waiting on are
 * still handed to it.  Note that any existing timeouts
 * (e.g. k_sleep(), or a timeout argument to k_sem_take() et. al.)
 * will be canceled.  On resume, the thread will begin running
 * immediately and return from the blocked call.
 *
 * If @a thread is already suspended, the routine has no effect.

The test case is a set of threads giving and taking semaphores on each other. Since the thread self-suspended, it's not waiting on anything, so "kernel objects it is waiting on are still handed to it" shouldn't apply. Yet is waking up anyway. It's not clear why it's being woken up like this.

The whole thing is very timing sensitive, adding printks() or enabling logs causes it to go away.

Apologies if this is a wild goose chase and my z_self_abort() is unsound.

andrewboie commented 4 years ago

One other thing I found, and I think this might be a real bug.

void z_impl_k_thread_suspend(struct k_thread *thread)
{
    (void)z_abort_thread_timeout(thread);

    LOCKED(&sched_spinlock) {
        if (z_is_thread_queued(thread)) {
            _priq_run_remove(&_kernel.ready_q.runq, thread);
            z_mark_thread_as_not_queued(thread);
        }
        z_mark_thread_as_suspended(thread);
        update_cache(thread == _current);
    }

    if (thread == _current) {
        z_reschedule_unlocked();
    }
}

We grab sched_spinlock, take the thread off the ready queue, mark ourselves as suspended, unlock, and then reschedule.

But with time slicing turned on, looking at this function which runs on the timer interrupt:

static inline int sliceable(struct k_thread *thread)
{
    return is_preempt(thread)
        && !z_is_prio_higher(thread->base.prio, slice_max_prio)
        && !z_is_idle_thread_object(thread)
        && !z_is_thread_timeout_active(thread);
}

Seems like if we get a timer interrupt at a bad time, this check will pass we'll get put right back on the queue by z_time_slice()

Is this a proper fix?:

diff --git a/kernel/sched.c b/kernel/sched.c
index 0dfb8a0254..b8fa04aeef 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -312,6 +312,7 @@ void k_sched_time_slice_set(int32_t slice, int prio)
 static inline int sliceable(struct k_thread *thread)
 {
        return is_preempt(thread)
+               && !z_is_thread_prevented_from_running(thread)
                && !z_is_prio_higher(thread->base.prio, slice_max_prio)
                && !z_is_idle_thread_object(thread)
                && !z_is_thread_timeout_active(thread);

Other thing I notice is that z_time_slice() itself isn't holding the sched spinlock and I am wondering if concurrent activity on another CPU could it to have some incorrect assumptions, for instance, what if another CPU did something that made the return value of sliceable() no longer true for _current by the time we get to z_move_thread_to_end_of_prio_q(_current) or something.

andrewboie commented 4 years ago

I put together a draft PR https://github.com/zephyrproject-rtos/zephyr/pull/28088 which has patches for some things that I am mostly convinced are bugs.

Not completely out of the woods yet, still can reproduce some problems. Here's what I can still see (with CONFIG_TIMESLICE_SIZE hacked to default to 5):

An issue on x86_64 where do_swap() seems to be invoked with interrupts enabled:

$ cat /dev/shm/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_api/kernel.lifo/handler.log
SeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200625_115407-9426dddc0a1f-zephyr
)
Booting from ROM..*** Booting Zephyr OS build zephyr-v2.3.0-2722-gf621d4a8d9bc  ***
Running test suite lifo_api
===================================================================
START - test_lifo_thread2thread
 PASS - test_lifo_thread2thread
===================================================================
START - test_lifo_thread2isr
 PASS - test_lifo_thread2isr
===================================================================
START - test_lifo_isr2thread
ASSERTION FAIL [!z_smp_cpu_mobile()] @ WEST_TOPDIR/zephyr/kernel/include/kswap.h:111
E: Page fault at address 0x46 (error code 0x10)
E: Linear address not present in page tables
E: PTE: not present
E: RAX: 0x0000000000000001 RBX: 0x0000000000000046 RCX: 0x0000000000000001 RDX: 0x0000000000000046
E: RSI: 0x000000000000000a RDI: 0x000000000010c7b0 RBP: 0x000000000010c3a0 RSP: 0x00000000001263c0
E:  R8: 0x0000000000000001  R9: 0x0000000000000020 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000000246 R13: 0x0000000000000046 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x00000000001263c0 RFLAGS: 0x0000000000000006 CS: 0x0018 CR3: 0x0000000000140000
E: RIP: 0x0000000000000046
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000010b980 (unknown)
E: Halting system

I have, occasionally seen the dreaded "Attempt to resume un-suspended thread object" error, even with my thread abort PR applied. This was most recently on the fifo test, so given that and the above crash on the lifo test, I'm reviewing the queue code to see if something doesn't look right.

Also occasional other weird pagefaults only on x86_64, here's one:

$ cat /dev/shm/sanity-out/qemu_x86_64/tests/kernel/device/kernel.device.pm/handler.log
SeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200625_115407-9426dddc0a1f-zephyr
)
Booting from ROM..*** Booting Zephyr OS build zephyr-v2.3.0-2722-gf621d4a8d9bc  ***
Running test suite device
===================================================================
START - test_dummy_device_pm
 PASS - test_dummy_device_pm
===================================================================
START - test_build_suspend_device_list
 PASS - test_build_suspend_device_list
===================================================================
START - test_dummy_device
 PASS - test_dummy_device
===================================================================
START - test_pre_kernel_detection
 PASS - test_pre_kernel_detection
===================================================================
START - test_bogus_dynamic_name
 PASS - test_bogus_dynamic_name
===================================================================
START - test_dynamic_name
E: Page fault at address 0x9ade6c (error code 0x4)
E: Linear address not present in page tables
E: PDE: not present
E: RAX: 0x0000000000000000 RBX: 0x0000000000000000 RCX: 0x0000000000102077 RDX: 0x000000000011292a
E: RSI: 0x0000000000112b85 RDI: 0x0000000000118240 RBP: 0x0000000000000000 RSP: 0x0000000000145fd8
E:  R8: 0x0000000000111e00  R9: 0x0000000000114191 R10: 0x0000000000000000 R11: 0x0000000000000206
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000145fd8 RFLAGS: 0x0000000000000206 CS: 0x003b CR3: 0x000000000013d000
E: RIP: 0x000000000010207e
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x0000000000118240 (ztest_thread)
E: Halting system

Failed assertions in tests/kernel/sched/schedule_api/ on mps2_an521. This might just be because I hacked the timeslice value and this breaks the test, not sure:

START - test_time_slicing_preemptible

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false

 FAIL - test_time_slicing_preemptible
===================================================================
START - test_time_slicing_disable_preemptible

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:293: test_time_slicing_disable_preemptible: tdata[1].executed == 0 is false

 FAIL - test_time_slicing_disable_preemptible

The same failed assertion, plus a branch to NULL on qemu_x86_coverage:

START - test_time_slicing_preemptible

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false

 FAIL - test_time_slicing_preemptible
===================================================================
START - test_time_slicing_disable_preemptible

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:293: test_time_slicing_disable_preemptible: tdata[1].executed == 0 is false

 FAIL - test_time_slicing_disable_preemptible
===================================================================
START - test_lock_preemptible
E: Page fault at address 0x0 (error code 0x10)
E: Linear address not present in page tables
E: PTE: not present
E: EAX: 0x001e0b80, EBX: 0x00110880, ECX: 0x00000246, EDX: 0x001e3260
E: ESI: 0x00000000, EDI: 0x00110829, EBP: 0x0019dfe4, ESP: 0x0019dfe0
E: EFLAGS: 0x00000202 CS: 0x0008 CR3: 0x001f3000
E: call trace:
E: EIP: 0x00000000
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x0016e900 (unknown)
E: Halting system

Currently looking at this last one since it reproduces consistently with icount.

Some other questions from code inspection. It's possible all of these have good answers and I just don't understand the code enough:

andrewboie commented 4 years ago

The weirdness with the schedule_api test was because if one of the scenarios fails, it doesn't clean up thread objects and the kernel goes haywire when it attempts to re-use them. I added a patch to https://github.com/zephyrproject-rtos/zephyr/pull/27995 to catch this, I've been bit by this so many times before in test cases...

I think I've successfully solved this (although my solutions may or may not be the right way to do it!), with one nagging detail. With all my recent patches combined, I do not see this random crash behavior any more, but there is a problem with the schedule_api test if set a nonzero TIMESLICE_SIZE on some ARM boards:

===================================================================
START - test_time_slicing_preemptible

    Assertion failed at ../src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false

 FAIL - test_time_slicing_preemptible

It's this code:

void test_time_slicing_preemptible(void)
{
#ifdef CONFIG_TIMESLICING
    /* set current thread to a preemptible priority */
    init_prio = 0;
    setup_threads();

    k_sched_time_slice_set(200, 0); /* 200 ms */
    spawn_threads(0);
    /* checkpoint: higher priority threads get executed immediately */
    zassert_true(tdata[0].executed == 1, NULL);
    k_busy_wait(500000); /* 500 ms */
    /* checkpoint: equal priority threads get executed every time slice */
    zassert_true(tdata[1].executed == 1, NULL);
    for (int i = 2; i < THREADS_NUM; i++) {
        zassert_true(tdata[i].executed == 0, NULL);
    }

    /* restore environment */
    k_sched_time_slice_set(0, 0); /* disable time slice */
    teardown_threads();
#else /* CONFIG_TIMESLICING */
    ztest_test_skip();
#endif /* CONFIG_TIMESLICING */
}

If I increment the k_busy_wait() value it passes. But 500ms should be enough since the timeslice is only 200ms. I have no idea why the default timeslice value affects this because this test case specifically sets k_sched_time_slice_set(200, 0); /* 200 ms */. Different timer interrupt delivery perhaps?

Doing a few more sanitycheck with CONFIG_TIMESLICE_SIZE forced to a nonzero value with all my patches to confirm. This is the test branch I'm using 6298a33f746c9d977fa3aa79a834532f28a7ee8f