Closed andrewboie closed 4 years ago
If you don't disable icount and just change the timeslice default I found one test that failed reliably:
apboie@apboie-mobl4:~/projects/zephyr2/zephyr master /home/apboie/projects/zephyr2/zephyr
$ git diff
diff --git a/kernel/Kconfig b/kernel/Kconfig
index 702d908d4b..fc94ca3111 100644
--- a/kernel/Kconfig
+++ b/kernel/Kconfig
@@ -404,7 +404,7 @@ config TIMESLICING
config TIMESLICE_SIZE
int "Time slice size (in ms)"
- default 0
+ default 1
range 0 2147483647
depends on TIMESLICING
help
I then ran sanitycheck -T tesks/kernel
to see what shakes out.
I then got a reproducible failure in tests/kernel/sched/deadline/kernel.scheduler.deadline on qemu_riscv32 where we are also jumping to an illegal location:
*** Booting Zephyr OS build zephyr-v2.3.0-1056-g568211738d30 ***
Running test suite suite_deadline
===================================================================
START - test_deadline
E: Exception cause Illegal instruction (2)
E: Faulting instruction address = 0x204060ac
E: ra: 0x204061a8 gp: 0x00000000 tp: 0x00000000 t0: 0x00000000
E: t1: 0x00000000 t2: 0x00000000 t3: 0x00000000 t4: 0x00000000
E: t5: 0x00000000 t6: 0x00000000 a0: 0x000186a0 a1: 0x00000000
E: a2: 0x204066f8 a3: 0x00000021 a4: 0x00000000 a5: 0x00000008
E: a6: 0x00000000 a7: 0x00000000
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x800004d8 (unknown)
E: Halting system
Looking through test configurations, we do not have any dedicated tests for timeslicing.
correction, schedule_api test sets this, we do have some coverage. It doesn't change the default of 0 which means the slices are infinitely large though but there is some runtime adjustments made.
Found something interesting.
I ran some sanitycheck with this revision of https://github.com/andrewboie/zephyr/tree/this-should-not-happen-27533 which is a version of https://github.com/zephyrproject-rtos/zephyr/pull/27995 plus this patch applied:
diff --git a/kernel/Kconfig b/kernel/Kconfig
index 3e72f4acfd..d80d79e4d1 100644
--- a/kernel/Kconfig
+++ b/kernel/Kconfig
@@ -400,7 +400,7 @@ config TIMESLICING
config TIMESLICE_SIZE
int "Time slice size (in ms)"
- default 0
+ default 5
range 0 2147483647
depends on TIMESLICING
help
The test tests/kernel/mem_protect/sys_sem fails like so. Thankfully we have icount and it does it every time:
START - test_sem_take_multiple
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/thread_abort.c:47
should never get here
E: r0/a1: 0x00000004 r1/a2: 0x0000002f r2/a3: 0x00000000
E: r3/a4: 0x00000000 r12/ip: 0x00000000 r14/lr: 0x0000771b
E: xpsr: 0x61000000
E: Faulting instruction address (r15/pc): 0x0000b406
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x200005c0 (unknown)
E: Halting system
qemu-system-arm: terminating on signal 2
ninja: build stopped: interrupted by user.
The assertion in question is here:
FUNC_NORETURN void z_self_abort(void)
{
/* Self-aborting threads don't clean themselves up, we
* have the idle thread for the current CPU do it.
*/
int key;
struct _cpu *cpu;
/* Lock local IRQs to prevent us from migrating to another CPU
* while we set this up
*/
key = arch_irq_lock();
cpu = _current_cpu;
cpu->pending_abort = _current;
LOG_DBG("%p self-aborting, handle on idle thread %p",
_current, cpu->idle_thread);
k_thread_suspend(_current);
z_swap_irqlock(key);
__ASSERT(false, "should never get here");
CODE_UNREACHABLE;
}
We're in here because the thread exited. It is calling k_thread_suspend()
on itself (with IRQs locked so execution continues) and then z_swap_irqlock(), and we should never get back from that because we have suspended ourselves; the idea is the next context switch will be to idle to clean up. But instead we are somehow getting scheduled again and the assertion fails.
The thread is sem_tid2
in the test case, which runs this function:
void sem_take_multiple_high_prio_helper(void *p1, void *p2, void *p3)
{
int32_t ret_value;
ret_value = sys_sem_take(&high_prio_sem, K_FOREVER);
zassert_true(ret_value == 0, "sys_sem_take failed");
ret_value = sys_sem_take(&multiple_thread_sem, K_FOREVER);
zassert_true(ret_value == 0, "sys_sem_take failed");
sys_sem_give(&high_prio_sem);
}
The documentation says:
**
* @brief Suspend a thread.
*
* This routine prevents the kernel scheduler from making @a thread
* the current thread. All other internal operations on @a thread are
* still performed; for example, kernel objects it is waiting on are
* still handed to it. Note that any existing timeouts
* (e.g. k_sleep(), or a timeout argument to k_sem_take() et. al.)
* will be canceled. On resume, the thread will begin running
* immediately and return from the blocked call.
*
* If @a thread is already suspended, the routine has no effect.
The test case is a set of threads giving and taking semaphores on each other. Since the thread self-suspended, it's not waiting on anything, so "kernel objects it is waiting on are still handed to it" shouldn't apply. Yet is waking up anyway. It's not clear why it's being woken up like this.
The whole thing is very timing sensitive, adding printks() or enabling logs causes it to go away.
Apologies if this is a wild goose chase and my z_self_abort()
is unsound.
One other thing I found, and I think this might be a real bug.
void z_impl_k_thread_suspend(struct k_thread *thread)
{
(void)z_abort_thread_timeout(thread);
LOCKED(&sched_spinlock) {
if (z_is_thread_queued(thread)) {
_priq_run_remove(&_kernel.ready_q.runq, thread);
z_mark_thread_as_not_queued(thread);
}
z_mark_thread_as_suspended(thread);
update_cache(thread == _current);
}
if (thread == _current) {
z_reschedule_unlocked();
}
}
We grab sched_spinlock, take the thread off the ready queue, mark ourselves as suspended, unlock, and then reschedule.
But with time slicing turned on, looking at this function which runs on the timer interrupt:
static inline int sliceable(struct k_thread *thread)
{
return is_preempt(thread)
&& !z_is_prio_higher(thread->base.prio, slice_max_prio)
&& !z_is_idle_thread_object(thread)
&& !z_is_thread_timeout_active(thread);
}
Seems like if we get a timer interrupt at a bad time, this check will pass we'll get put right back on the queue by z_time_slice()
Is this a proper fix?:
diff --git a/kernel/sched.c b/kernel/sched.c
index 0dfb8a0254..b8fa04aeef 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -312,6 +312,7 @@ void k_sched_time_slice_set(int32_t slice, int prio)
static inline int sliceable(struct k_thread *thread)
{
return is_preempt(thread)
+ && !z_is_thread_prevented_from_running(thread)
&& !z_is_prio_higher(thread->base.prio, slice_max_prio)
&& !z_is_idle_thread_object(thread)
&& !z_is_thread_timeout_active(thread);
Other thing I notice is that z_time_slice()
itself isn't holding the sched spinlock and I am wondering if concurrent activity on another CPU could it to have some incorrect assumptions, for instance, what if another CPU did something that made the return value of sliceable()
no longer true for _current
by the time we get to z_move_thread_to_end_of_prio_q(_current)
or something.
I put together a draft PR https://github.com/zephyrproject-rtos/zephyr/pull/28088 which has patches for some things that I am mostly convinced are bugs.
Not completely out of the woods yet, still can reproduce some problems. Here's what I can still see (with CONFIG_TIMESLICE_SIZE hacked to default to 5):
An issue on x86_64 where do_swap() seems to be invoked with interrupts enabled:
$ cat /dev/shm/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_api/kernel.lifo/handler.log
SeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200625_115407-9426dddc0a1f-zephyr
)
Booting from ROM..*** Booting Zephyr OS build zephyr-v2.3.0-2722-gf621d4a8d9bc ***
Running test suite lifo_api
===================================================================
START - test_lifo_thread2thread
PASS - test_lifo_thread2thread
===================================================================
START - test_lifo_thread2isr
PASS - test_lifo_thread2isr
===================================================================
START - test_lifo_isr2thread
ASSERTION FAIL [!z_smp_cpu_mobile()] @ WEST_TOPDIR/zephyr/kernel/include/kswap.h:111
E: Page fault at address 0x46 (error code 0x10)
E: Linear address not present in page tables
E: PTE: not present
E: RAX: 0x0000000000000001 RBX: 0x0000000000000046 RCX: 0x0000000000000001 RDX: 0x0000000000000046
E: RSI: 0x000000000000000a RDI: 0x000000000010c7b0 RBP: 0x000000000010c3a0 RSP: 0x00000000001263c0
E: R8: 0x0000000000000001 R9: 0x0000000000000020 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000000246 R13: 0x0000000000000046 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x00000000001263c0 RFLAGS: 0x0000000000000006 CS: 0x0018 CR3: 0x0000000000140000
E: RIP: 0x0000000000000046
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000010b980 (unknown)
E: Halting system
I have, occasionally seen the dreaded "Attempt to resume un-suspended thread object" error, even with my thread abort PR applied. This was most recently on the fifo test, so given that and the above crash on the lifo test, I'm reviewing the queue code to see if something doesn't look right.
Also occasional other weird pagefaults only on x86_64, here's one:
$ cat /dev/shm/sanity-out/qemu_x86_64/tests/kernel/device/kernel.device.pm/handler.log
SeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200625_115407-9426dddc0a1f-zephyr
)
Booting from ROM..*** Booting Zephyr OS build zephyr-v2.3.0-2722-gf621d4a8d9bc ***
Running test suite device
===================================================================
START - test_dummy_device_pm
PASS - test_dummy_device_pm
===================================================================
START - test_build_suspend_device_list
PASS - test_build_suspend_device_list
===================================================================
START - test_dummy_device
PASS - test_dummy_device
===================================================================
START - test_pre_kernel_detection
PASS - test_pre_kernel_detection
===================================================================
START - test_bogus_dynamic_name
PASS - test_bogus_dynamic_name
===================================================================
START - test_dynamic_name
E: Page fault at address 0x9ade6c (error code 0x4)
E: Linear address not present in page tables
E: PDE: not present
E: RAX: 0x0000000000000000 RBX: 0x0000000000000000 RCX: 0x0000000000102077 RDX: 0x000000000011292a
E: RSI: 0x0000000000112b85 RDI: 0x0000000000118240 RBP: 0x0000000000000000 RSP: 0x0000000000145fd8
E: R8: 0x0000000000111e00 R9: 0x0000000000114191 R10: 0x0000000000000000 R11: 0x0000000000000206
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000145fd8 RFLAGS: 0x0000000000000206 CS: 0x003b CR3: 0x000000000013d000
E: RIP: 0x000000000010207e
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x0000000000118240 (ztest_thread)
E: Halting system
Failed assertions in tests/kernel/sched/schedule_api/ on mps2_an521. This might just be because I hacked the timeslice value and this breaks the test, not sure:
START - test_time_slicing_preemptible
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false
FAIL - test_time_slicing_preemptible
===================================================================
START - test_time_slicing_disable_preemptible
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:293: test_time_slicing_disable_preemptible: tdata[1].executed == 0 is false
FAIL - test_time_slicing_disable_preemptible
The same failed assertion, plus a branch to NULL on qemu_x86_coverage:
START - test_time_slicing_preemptible
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false
FAIL - test_time_slicing_preemptible
===================================================================
START - test_time_slicing_disable_preemptible
Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:293: test_time_slicing_disable_preemptible: tdata[1].executed == 0 is false
FAIL - test_time_slicing_disable_preemptible
===================================================================
START - test_lock_preemptible
E: Page fault at address 0x0 (error code 0x10)
E: Linear address not present in page tables
E: PTE: not present
E: EAX: 0x001e0b80, EBX: 0x00110880, ECX: 0x00000246, EDX: 0x001e3260
E: ESI: 0x00000000, EDI: 0x00110829, EBP: 0x0019dfe4, ESP: 0x0019dfe0
E: EFLAGS: 0x00000202 CS: 0x0008 CR3: 0x001f3000
E: call trace:
E: EIP: 0x00000000
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x0016e900 (unknown)
E: Halting system
Currently looking at this last one since it reproduces consistently with icount.
Some other questions from code inspection. It's possible all of these have good answers and I just don't understand the code enough:
sched_spinlock
, but I see z_add_thread_timeout() and z_abort_thread_timeout() called with a variety other spinlocks and not sched_spinlock
z_unpend_all()
The weirdness with the schedule_api test was because if one of the scenarios fails, it doesn't clean up thread objects and the kernel goes haywire when it attempts to re-use them. I added a patch to https://github.com/zephyrproject-rtos/zephyr/pull/27995 to catch this, I've been bit by this so many times before in test cases...
I think I've successfully solved this (although my solutions may or may not be the right way to do it!), with one nagging detail. With all my recent patches combined, I do not see this random crash behavior any more, but there is a problem with the schedule_api test if set a nonzero TIMESLICE_SIZE on some ARM boards:
===================================================================
START - test_time_slicing_preemptible
Assertion failed at ../src/test_sched_timeslice_and_lock.c:255: test_time_slicing_preemptible: tdata[1].executed == 1 is false
FAIL - test_time_slicing_preemptible
It's this code:
void test_time_slicing_preemptible(void)
{
#ifdef CONFIG_TIMESLICING
/* set current thread to a preemptible priority */
init_prio = 0;
setup_threads();
k_sched_time_slice_set(200, 0); /* 200 ms */
spawn_threads(0);
/* checkpoint: higher priority threads get executed immediately */
zassert_true(tdata[0].executed == 1, NULL);
k_busy_wait(500000); /* 500 ms */
/* checkpoint: equal priority threads get executed every time slice */
zassert_true(tdata[1].executed == 1, NULL);
for (int i = 2; i < THREADS_NUM; i++) {
zassert_true(tdata[i].executed == 0, NULL);
}
/* restore environment */
k_sched_time_slice_set(0, 0); /* disable time slice */
teardown_threads();
#else /* CONFIG_TIMESLICING */
ztest_test_skip();
#endif /* CONFIG_TIMESLICING */
}
If I increment the k_busy_wait() value it passes. But 500ms should be enough since the timeslice is only 200ms. I have no idea why the default timeslice value affects this because this test case specifically sets k_sched_time_slice_set(200, 0); /* 200 ms */
. Different timer interrupt delivery perhaps?
Doing a few more sanitycheck with CONFIG_TIMESLICE_SIZE forced to a nonzero value with all my patches to confirm. This is the test branch I'm using 6298a33f746c9d977fa3aa79a834532f28a7ee8f
Describe the bug For some time now we've received reports that two tests are failing sporadically in CI for mysterious reasons, #13813 and #19701. As it turns out, what these tests have in common is that in their prj.conf, they both set
CONFIG_TIMESLICE_SIZE=1
.There seems to be some kind of race condition in the kernel which is made worse by having timeslicing set to this. Now that almost all of our platforms enable icount in the QEMU configuration, this now doesn't happen randomly as all tests run in a deterministic fashion and it's harder to get unlucky.
Looking through test configurations, we do not have any dedicated tests for timeslicing. Only three tests change the default value of CONFIG_TIMESLICE_SIZE=0, which means unless I'm missing something, only these tests are exercising the timeslice feature, and all of them are actually testing something else:
tests/kernel/mem_slab/mslab_threadsafe/prj.conf sets it to 1 tests/kernel/mem_pool/mem_pool_threadsafe/prj.conf sets it to 1 tests/kernel/mem_protect/syscalls/prj.conf sets it to 10 (for the syscall torture test)
To Reproduce
To best show this issue, we need to default the timeslice size to 1 and disable icount:
Then run sanitycheck. In a typical run I would see
There were 43 total failures when testing
sanitycheck -T tests/kernel
which seems abnormal even with icount disabled. I looked through the logs:qemu_cortex_r5 tests/kernel/mbox/mbox_api/kernel.mailbox.api
qemu_cortex_m0 tests/kernel/sched/schedule_api/kernel.scheduler.multiq
qemu_cortex_m0 tests/kernel/sched/schedule_api/kernel.scheduler
qemu_cortex_r5 tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser
qemu_cortex_r5 tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool.threadsafe
caught this failing a timeout assertion on x86_64:
Expected behavior No failures like this with timeslicing turned on.
Impact This appears to be a showstopper for anyone who uses time slicing, and depending on where the root cause is, others too.