zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.65k stars 6.52k forks source link

k_sleep/k_msleep ends too early on UP_squared board #26271

Closed starmachine24 closed 4 years ago

starmachine24 commented 4 years ago

Hi

It seems that the k_sleep/k_msleep ends too early on our board and this happens no matter how many tasks we run in the system. We are running version 2.2.99. (forgot to mention this at first)

We have a task in the system that executes for one tick and has a periodicity of 8 ticks. Then it should go to sleep until its period comes up again. The deadline is set from a point in time that is counted as 0.

Our system is running with a tick rate of 1000 ticks/s (1ms) which means that it takes 19200 cycles per tick as the system clock is running on 19200000Hz.

The CTF trace is read using babeltrace, now is our zero in the system. start_of_work and end of work are basically markers for when a task start and end it's work.

The task starts running on basically tick 0 and then as mentioned before runs for one tick and then calculates the difference between the current time and the deadline, and sleeps for that amount of time.

[00000000000376635624] now_is: { now = 376635624 }
[00000000000376635682] start_of_work: {name = "Task 1" }
[00000000000376648426] isr_enter:
[00000000000376648443] start_call:
[00000000000376648450] end_call:
[00000000000376654942] end_of_work: {name = "Task 1" }
[00000000000376654953] current_deadline: {name = "Task 1", deadline = 376789224 }
[00000000000376654989] diff_cycles_to_deadline: {name = "Task 1", diff = 134238 }

Here we can see that the task has a deadline of 376789224 which equates to tick 8 from our zero. But after it has finished it's work, it will update it's deadline to set a new one and then sleep to the old deadline which is in this case is tick 8, as the new deadline will be tick 16. And the task will sleep for diff amount of milliseconds (in this case its 134238 / 19200 = 6.99 ticks).

Which means that the task should start again sometime around tick 8, might be a little before or a little after but around it.

[00000000000376753979] start_of_work: { name = "Task 1" } starts at tick 6.16
[00000000000376773191] end_of_work: { name = "Task 1" }

Here we can see that the task starts way too early to be around tick 8 as: (376789224 - 376753979) / 19200 = 1.835677 difference in ticks. Which means that it starts around 2 ticks too early and we have no clue what the problem is.

andyross commented 4 years ago

I'm a little confused from that description. Can you give the code (or at least some pseudocode that shows me the structure) that implements your deadline algorithm? Are your timestamps the result of k_cycle_get() or some other mechanism (in particular are you reading the TSC instead)? Are you using the HPET driver (the default for up_squared) or APIC?

Your assumption is absolutely correct: k_sleep() (and wrappers) is required to sleep for at least the amount of time specified, so if you're grabbing a cycle count before and after a sleep and it's less than the duration, that's a bug.

andyross commented 4 years ago

I happen to have an up_squared rigged up on my desk right now, and trying the simple code below produces exactly the results I expect (log snippet below). So something more complicated must be going on.

#include <zephyr.h>
#include <sys/printk.h>

void main(void)
{
    printk("Hello World! %s\n", CONFIG_BOARD);

    while (true) {
        uint32_t dt, t0 = k_cycle_get_32();

        k_sleep(K_TICKS(6));
        dt = k_cycle_get_32() - t0;

        printk("dt %d ticks %d\n", dt, k_cyc_to_ticks_floor32(dt));
    }
}
*** Booting Zephyr OS build zephyr-v2.3.0-442-g150e18de6e18  ***
Hello World! up_squared_32
dt 12312 ticks 6
dt 12321 ticks 6
dt 12305 ticks 6
dt 12306 ticks 6
dt 12312 ticks 6
dt 12320 ticks 6
dt 12289 ticks 6
dt 12304 ticks 6
...
starmachine24 commented 4 years ago

Ok so we tested taking timestamps berfore and after the sleep and also printing out the duration we wanted the task to sleep for.

We are using the default setting for the clock in the system.

[00000000000330889986] now_is: { now = 330889986 }
[00000000000330909340] before_sleep: { thread_id = 1080512, name = "Task 1", cycles = 330909336 }
[00000000000330909351] sleep_dur_is: { sleep_dur = 134236 }
[00000000000330909375] thread_suspend: { thread_id = 1080512 }
[00000000000330909381] thread_switched_out: { thread_id = 1080512 }
[00000000000330909387] thread_switched_in: { thread_id = 1080512 }
[00000000000331008426] thread_resume: { thread_id = 1080512 }
[00000000000331008433] thread_ready: { thread_id = 1080512 }
[00000000000331008456] thread_switched_in: { thread_id = 1080512 }
[00000000000331008467] after_sleep: { thread_id = 1080512, name = "Task 1", cycles = 331008465 }

So if we take the difference between the before and after_sleep we get the duration of the sleep: 331008465 - 330909336 = 99129 / 19200 = 5.16 ticks. While our calculated sleep_dur is: 134236 / 19200 = 6.99 ticks.

but the next instance of the task starts earlier than expected based on these numbers.

[00000000000331008475] start_of_work: { thread_id = 1080512, name = "Task 1" }

Task next instance = (331008475 - 330889986) / 19200 = 6.17 Which means that the task start at tick 6 which is odd based on the calculated sleep duration.

So we made some changes to how the deadlines are set, so now they are set absolutely, because we wanted easier periodicity. We also changed how it determines the priority of a task when determining which task to be run when EDF is used. We changed it so that it only compares the deadlines without subtracting the current time.

I've added these changes to a file so you can have a look at them in case that might be the problem but also so that you are able to run the program with the same changes.

The CJ_busy_wait is our version of a busy_wait because the currently implemented version in Zephyr does not account for preemption which could lead to the busy wait being shorter if preempted.

files.zip

andyross commented 4 years ago

I'm still a little confused. The code you posted doesn't correspond to the logging above. Where are the bits where you're capturing "before_sleep" and "after_sleep", and logging the calculated sleep duration? I'll give this a shot on my board later today; nothing looks immediately weird as far as timeout subsystem usage.

As far as your changes to the API convention of k_thread_deadline_set(), none of that will be relevant to the behavior of the timeouts, totally separate areas of the kernel.

starmachine24 commented 4 years ago

Sorry that was our fault, accidently sent the wrong file

Could this problem with the sleep be related to our board only? i.e. that the clock or something similar is faulty and that causes the issue?

program.zip

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.