zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.43k stars 6.4k forks source link

it8xxx2_evb: Combined timer anomaly after boot, making k_busy_wait() misbehave. #76022

Closed edersondisouza closed 2 weeks ago

edersondisouza commented 1 month ago

Describe the bug Several timer/clock related tests fail after rebooting the it8xxx2_evb. A reboot that triggers the anomaly is uncommon, usually < 5% of time. But when it happens, the anomaly will persist until next reboot. The offending reboot can be detected because the board doesn't start properly - it won't print anything on the serial, but a subsequent flash will work just fine, with the timer anomaly.

The tests failure can be traced to k_busy_wait() taking a different time from that expected. When the anomaly is happening, the k_busy_wait() seems to happen at 3x the expected speed, for instance, a 10s wait will take 3.3s. Note that other methods to count the time will also note this discrepancy (e.g., by using k_cycle_get_32()), hence the tests failures.

On the it8xxx2_evb, there's a custom busy wait, which uses the "combined timer" feature to track the passage of time. When the anomaly is active, the combined timer itself behaves anomalously, but not the other timers. Here I made some tests and noticed that:

From the datasheet and Zephyr timer initialisation code, I couldn't see any obvious issue. I can't rule out a defective hardware, but I'm creating this issue to see if anyone saw something similar and has any input.

To Reproduce

Keep rebooting the board (in my tests, it seems it happens more likely with a seemingly random combination of rebooting the board, the debug one or both). At some point, the application flashed won't start (no output on serial). After that, flash it with an application like the following, and it will print !!! k_busy_wait() MISMATCH! !!!. Several tests that rely on k_busy_wait() will also fail.

#include <stdlib.h>

#include <zephyr/kernel.h>

#ifndef TICKS_TOLERANCE
#define TICKS_TOLERANCE 1
#endif

#define WAIT_US 100000

int main(void)
{
    uint32_t a, b, exp;

    a = k_cycle_get_32();
    k_busy_wait(WAIT_US);
    b = k_cycle_get_32();

    exp = k_us_to_cyc_ceil32(WAIT_US);

    printk("a: %u, b: %u, exp: %u, real: %u cyc per tick: %u\n", a, b, exp, b - a, k_ticks_to_cyc_ceil32(TICKS_TOLERANCE));

    if (abs((b - a) - exp) > k_ticks_to_cyc_ceil32(TICKS_TOLERANCE)) {
        printk("!!! k_busy_wait() MISMATCH! !!!\n");
        return 1;
    } else {
        printk("!!! k_busy_wait() OK! !!!\n");
    }

    return 0;
}

Expected behavior

k_busy_wait() time should agree with other timers in the system, as well as "real world". Several timer/clock related tests also fail.

Impact

Tests fail.

Logs and console output

See "additional context".

Environment (please complete the following information):

Additional context

Some of the tests that will fail:

FAILED: tests/kernel/sched/schedule_api/kernel.scheduler.dumb_timeslicing 
FAILED: tests/kernel/sched/schedule_api/kernel.scheduler.multiq 
FAILED: tests/kernel/sched/schedule_api/kernel.scheduler.slice_perthread 
FAILED: tests/kernel/sched/schedule_api/kernel.scheduler

FAILED: tests/kernel/timer/timer_api/kernel.timer.tickless
FAILED: tests/kernel/timer/timer_api/kernel.timer

FAILED: tests/kernel/common/kernel.common.picolibc
FAILED: tests/kernel/common/kernel.common.nano64
FAILED: tests/kernel/common/kernel.common.nano32
FAILED: tests/kernel/common/kernel.common.misra
FAILED: tests/kernel/common/kernel.common
FAILED: tests/kernel/common/kernel.common.tls
FAILED: tests/kernel/common/kernel.common.minimallibc 

Some more details for some of those:

...
===================================================================
START - test_ms_time_duration Assertion failed at WEST_TOPDIR/os.rtos.zephyr.zephyr/tests/kernel/common/src/clock.c:225:
clock_test_ms_time_duration: (tdata.duration_count == 1 is false)  duration 0 not 1
FAIL - test_ms_time_duration in 0.075 seconds
=================================================================== 
...
...
=================================================================== 
START - test_time_slicing_preemptible Assertion failed at WEST_TOPDIR/os.rtos.zephyr.zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:255: threads_scheduling_test_time_slicing_preemptible: tdata[1].executed == 1 is false 
FAIL - test_time_slicing_preemptible in 0.188 seconds 
=================================================================== 
START - test_unlock_nested_sched_lock Assertion failed at WEST_TOPDIR/os.rtos.zephyr.zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:418: threads_scheduling_test_unlock_nested_sched_lock: tdata[2].executed == 0 is false 
FAIL - test_unlock_nested_sched_lock in 0.055 seconds 
=================================================================== 
START - test_unlock_preemptible Assertion failed at WEST_TOPDIR/os.rtos.zephyr.zephyr/tests/kernel/sched/schedule_api/src/test_sched_timeslice_and_lock.c:367: threads_scheduling_test_unlock_preemptible: tdata[2].executed == 0 is false 
FAIL - test_unlock_preemptible in 0.054 seconds 
=================================================================== 
...
Dino-Li commented 1 month ago

@edersondisouza how did you reboot the SOC? Asserting its #WRST pin? Symptom of timer running 3 times faster than expected should be caused by a failure to change PLL. When SOC is running change PLL sequence, the following HW mechanism will prevent SOC from entering sleep mode and cause the sequence to fail:

Or the patch can be used as a workaround for this symptom. NOTE: after applying the patch, clock frequence of i2c0, i2c1, and i2c2 can only run at 400kHz. i2c testing requires revalidation.

edersondisouza commented 1 month ago

@Dino-Li I'm simply taking the power off - the board is connected to a ykush, which is used to power the USB port on and off. Will try those and report, thanks!

pkunieck commented 1 month ago

We checked on backup board, and currently we can't reproduce this issue. However, I connected board that is failing to my debug machine and when i change the code following patch the board is stable, but I will run more iterations of tests. Also I checked states on pins: GPM5 low GPH1 low GPE0 high GPA4 high

pkunieck commented 2 weeks ago

we can't reproduce this issue on backup board, closing for now, will reopen if the issue appear on new board