Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1

pizi-nordic commented 5 years ago

The z_clock_announce() unconditionally adjust dticks of first timeout in the list:

if (first() != NULL) {
    first()->dticks -= announce_remaining;
}

However this adjustment is not always valid:

If the adjusted timeout was on the list before call to z_clock_announce(), we have to do such correction in order to not delay next timeout.
If the adjusted timeout belongs to periodic k_timer which was handled during current z_clock_announce() execution, we should do such correction if we would like to not accumulate error.
If the adjusted timeout was added during current z_clock_announce() execution but it is not related periodic timer which was handled, we have to skip the adjustment. If we perform it, the newly scheduled timeout will fire earlier than expected.

This issue has been initially mentioned in https://github.com/zephyrproject-rtos/zephyr/issues/11722. Affected use case is presented here: https://github.com/zephyrproject-rtos/zephyr/issues/12247#issuecomment-451456702

pizi-nordic commented 5 years ago

@andyross, @pabigot, @carlescufi: FYI

andyross commented 5 years ago

@pizi-nordic, this is getting out of hand. Not. A. Bug. I don't know how else to explain it. Existing tests require the behavior as implemented. What you are worrying about is an untested edge case that results from an an API convention issue and happens only in latency crisis situations where interrupts have been delayed more than a tick.

It can be fixed via API (and worked around right now), but it requires design and ideas and discussion. I tried to isolate that discussion in #12247. But now it's leaked out as a "bug" and I genuinely don't know what to do.

If you have code that actually fails and has no workaround, submit it as a test first and then submit the bug. But for now... I just don't know how else to deal with this. Stop this senseless argument. Please do not resubmit this without an exercising test case (and not a whiteboxed one!)

pabigot commented 5 years ago

We need consensus on whether this is a bug, since @pizi-nordic and I disagree with your assessment. If it is a bug, it may very well require API and other changes to resolve. That's fine, and not a motivation for closing it until we have consensus.

andyross commented 5 years ago

(And yes, I know you think you understand this better and that I'm missing something important. But without a failing test case it's not a bug, and I'm tired of typing about it)

andyross commented 5 years ago

Without a failing test case, it's not a bug. Please stop this senseless fight. There's a perfectly good discussion elsewhere.

pabigot commented 5 years ago

I'm reopening this after changing it to an enhancement and changing the description slightly.

I asked @pizi-nordic to open this as an issue because it had previously been discussed in comments to #11722, #12204, and #12247. At least two collaborators involved in those discussions believe the behavior of the current system does not meet our needs in cases of the sort described here, and that working around the existing behavior is not easy, if it is possible at all.

The concern raised is possibly but not necessarily related to #12247; for the #11722/#12204 it was unrelated to what turned out to be the real issue, but a result of the same code. Putting it in its own issue allows us to remember to address it if and when the timer architecture is revisited, and make it visible to others who might have an opinion.

I understand there is disagreement on whether the behavior is a bug. I hope changing the issue to an enhancement is acceptable to everybody. If not, we have a triage meeting once a week where disagreements on categorization and validity of issues can be raised, discussed, and resolved in a less abrupt manner.

pabigot commented 5 years ago

I've been doing some work on a path toward alarms that use the system clock rather than the tick clock. In this context it is almost certain that the corresponding clock increment will be greater than one when carving off ready-to-run timeouts.

We don't have a test case that demonstrates the expected behavior described in this issue, but the adjustment to a newly added alarm can be eliminated without introducing a failure in any existing test (i.e., delays for rescheduled alarms are still relative to the original deadline). Code to do so can be seen on pabigot/zephyr:wip/20190106a.

This of course isn't an argument that the change should be made, but does show making the change is not difficult if it does appear to be desirable.

[Edit: Actually, no, the work-in-progress basically switches the implementation to #12247's reschedule management policy (b): delay timers reschedule with a delay. There's a second patch that's intended to restore latency-ignoring behavior within the current API, but it fails eight tests.]

pizi-nordic commented 5 years ago

@andyross: The test you want, the test you have :) https://github.com/pizi-nordic/zephyr/tree/timer-issue-12332/tests/kernel/timer/timer_issue_12332

This test validates only one thing: It fails if timer expired too early. The test scenario is described here: https://github.com/zephyrproject-rtos/zephyr/issues/12247#issuecomment-451456702

On my board the result is following (it looks that the z_clock_announce() issue is covered by something else):

***** Booting Zephyr OS zephyr-v1.13.0-3011-ge98e6a8a87 *****
Running test suite timer_api
===================================================================
starting test - test_issue_12332
Test Run 1 of 300...
 -> Timer 1 expired!
 -> Timer 2 expired! (after 47 ms)

    Assertion failed at /home/pizi/repos/zephyr/tests/kernel/timer/timer_issue_12332/src/main.c:52: timer2_expire: 50 <= elapsed_ms is false

pabigot commented 5 years ago

@pizi-nordic As an equal-opportunity complainer I have to say that test doesn't do what you think. 2ms of that 3ms lateness is due to the time it takes to printk the timer1 expiration within the callback. If the printk is removed it's 49.83 ms (1633 32 KiHz ticks), which is still 6 ticks too early to reach 50 ms.

Also the fact the check always fails on the first call suggests that either the test isn't correct or the timer system really is broken and a 50 ms timeout will normally complete in less than 50 ms. Adding the lets-synchronize-to-the-tick dance before timer1 is started doesn't make it succeed either. So I remain convinced timers in Zephyr are weird, but not that we've reproduced the theoretical problem.

pabigot commented 5 years ago

... a 50 ms timeout will normally complete in less than 50 ms.

That'd be it. A tick is 327 increments of the 32 KiHz clock. So 1 s as measured by Zephyr's tick clock on Nordic hardware is 997.92 ms in the real world. (So 50 "Zephyr-ms" is 49.896 ms).

andyross commented 5 years ago

Does the test fix itself with TICKLESS_KERNEL=n? That's the easiest way to rule out whether or not we're looking at the multi-tick expiry issue.

pabigot commented 5 years ago

No, TICKLESS_KERNEL has no effect. The test can never pass because 50 ms isn't 50 ms. With enhanced output I get:

T1 1000 ms, T2 50 ms, 32768 sysclock per tick                                   
 -> Timer 2 run 2 expired 32700 ; 35643 - 34009 = 1634 ! (after 49865 us)       
 -> Timer 2 run 3 expired 32700 ; 68343 - 66709 = 1634 ! (after 49865 us)       
 -> Timer 2 run 4 expired 32700 ; 101043 - 99409 = 1634 ! (after 49865 us)

where the displayed 32700 is the difference between the k_cycle_get_32() values captured in successive callbacks of timer 1, which is configured to callback every 1000 ms (32768 cycles). 49.865 ms is within one rounded-up system clock tick (30.5175... us) of the 49.896 ms interval that the timeout is set for when 50 ms is requested. So it's as accurate as it's going to be.

The test is unlikely to produce a multi-tick expiry issue; to do so there'd have to be an interrupt-blocking delay in excess of something like 20 ms somewhere.

andyross commented 5 years ago

Right right, what I meant is that if this test is exercising the bug that it thinks then it will always pass with tickless disabled, becasue all ticks come one at a time and the ambiguity we're discussing can't exist. I buy that there may be precision bugs too.

pabigot commented 5 years ago

That run I displayed is from # CONFIG_TICKLESS_KERNEL is not set, so it doesn't get fixed. The 1634 system clock ticks measured as latency is the same as the steady-state when CONFIG_TICKLESS_KERNEL=y.

andyross commented 5 years ago

Then I'm confused. You can't hit the situation @pizi-nordic is complaining about unless z_clock_announce() is called with a ticks argument of 2 or greater. The issue is whether or not it is correct to decrement timeouts that were "just added" to the list by another timeout handler. If the argument was one tick, then announce_remaining must have been decremented to zero inside the loop during the execution of the handler, and the bug won't happen. This is isomorphic to our old ticked behavior, thus it cannot happen if TICKLESS_KERNEL is unset.

If it is happening, then the problem is somewhere else. You can verify with some whiteboxed asserts with z_clock_announce() if you like to see that the decrement being called out above never happens.

pabigot commented 5 years ago

Yes. What I've been trying to say all along is that this test is faulty: it never hits a condition where multiple ticks are being announced. It fails because of accumulated error in measuring the passage of time, not because the deadline was reduced by unaccounted ticks.

pabigot commented 5 years ago

I've corrected pabigot/zephyr:wip/20190106a to preserve the existing behavior (reschedule relative to original deadline) while avoiding the "undesirable" behavior (schedule relative to past time). I can't verify this without a test case, but by inspection I think it's right.

This patch passes sanitycheck, though one test passes only by retaining an override to ensure dticks is positive that I think is incorrect. Ways in which it differs from the existing implementation are documented in the commit message.

pizi-nordic commented 5 years ago

@andyross, @pabigot: That is true, that the test does not trigger the z_clock_announce() issue. However it shows the problematic behaviour that afraids me: The timer fires prematurely and it looks that we are happy with that. I apologize for the confusion I caused sending test failing due to other problem than the one discussed here.

static void timer1_expire(struct k_timer *timer)
{
    printk(" -> Timer 1 expired!\n");

    timer1_exp_timestamp = k_cycle_get_32();
    k_timer_start(&timer2, K_MSEC(T2_TIMEOUT), 0);
}

static void timer2_expire(struct k_timer *timer)
{
    unsigned int elapsed_cycles, elapsed_ms;

    elapsed_cycles = k_cycle_get_32() - timer1_exp_timestamp;
    elapsed_ms = MSEC_PER_SEC * elapsed_cycles /
             sys_clock_hw_cycles_per_sec();

    printk(" -> Timer 2 expired! (after %u ms)\n", elapsed_ms);

    /* TESTPOINT: Timer must never expire earlier than requested. */
    TIMER_ASSERT(T2_TIMEOUT <= elapsed_ms, &timer1);

    k_sem_give(&semaphore);
}

The time from k_timer_start(&timer2, K_MSEC(T2_TIMEOUT), 0); to elapsed_cycles = k_cycle_get_32() - timer1_exp_timestamp; should be always greater than T2_TIMEOUT, especially that there is overhead: I am taking the first timestamp before timer start and the second in expiration function (which is running at that time, which mean that timer already fired).

I have updated the test in order to not fail due to "50ms is not 50ms" issue (which were discussed a lot during timer infrastructure rework and as we can see it is still not fixed) and focus on the z_clock_announce() problem discussed here. This is how the updated test behaves on my board:

Test Run 34 of 300 (will block interrupts for 34 ms)...
 -> Timer 1 expired!
 -> Timer 2 expired! (after 48 ms)
Test Run 35 of 300 (will block interrupts for 35 ms)...
 -> Timer 1 expired!
 -> Timer 2 expired! (after 48 ms)
 -> Timer 1 expired!
Test Run 36 of 300 (will block interrupts for 36 ms)...
 -> Timer 2 expired! (after 32 ms)

    Assertion failed at /home/pizi/repos/zephyr/tests/kernel/timer/timer_issue_12332/src/main.c:52: timer2_expire: 50 <= elapsed_ms + (50 / 10) is false

Also, if you would like to play a bit more with the timers, feel free to checkout the https://github.com/pizi-nordic/zephyr/commit/c670a3143443a7ba69c4a9d3d0104667bdbaef65 and run it. You will see, that the timeout is rounded down to the nearest tick, which in my opinion also not good.

Summarising: So far we observe 3 problems in simple program using 2 timers (I hope we all agree here):

The "50ms is not 50ms" issue => Timer fires too early.
Rounding down of timeout value => Timer fires too early.
Impact of IRQ blocking => Timer fires too early.

pabigot commented 5 years ago

Oh, "we" are very much not happy about 50 ms not being 50 ms. That's a whole new issue, as is the rounding down the interval, and more generally the incredibly low resolution available to timers due to the implementation coupling them to the tick clock rather than the system clock. Discovering those here is good progress, and in my head I'm already drafting an issue that can cover them.

Does the test pass using the branch I mentioned where I believe I've fixed the problem with the timer firing too early because of multiple ticks in a z_clock_announce()?

pizi-nordic commented 5 years ago

Oh, "we" are very much not happy about 50 ms not being 50 ms. That's a whole new issue, as is the rounding down the interval, and more generally the incredibly low resolution available to timers due to the implementation coupling them to the tick clock rather than the system clock. Discovering those here is good progress, and in my head I'm already drafting an issue that can cover them.

What is pity, these issues are not so new... Please look into: https://github.com/zephyrproject-rtos/zephyr/issues/9904#issuecomment-420264184

Does the test pass using the branch I mentioned where I believe I've fixed the problem with the timer firing too early because of multiple ticks in a z_clock_announce()?

Unfortunately, blocking interrupts still makes timeouts shorter:

 -> Timer 2 expired! (after 48 ms)
Test Run 35 of 300 (will block interrupts for 35 ms)...
 -> Timer 1 expired!
 -> Timer 2 expired! (after 48 ms)
 -> Timer 1 expired!
Test Run 36 of 300 (will block interrupts for 36 ms)...
 -> Timer 2 expired! (after 42 ms)

    Assertion failed at /home/pizi/repos/zephyr-pabigot/tests/kernel/timer/timer_issue_12332/src/main.c:52: timer2_expire: 50 <= elapsed_ms + (50 / 10) is false

pabigot commented 5 years ago

@andyross CONFIG_TICKLESS_KERNEL=n does not affect the theoretical bug: all it does is ensure the system attempts to make sure ticks is always one when z_clock_announce() is invoked. The problem still occurs if the timer ISR is blocked so that multiple ticks are passed when it finally can call z_clock_announce(), that being the condition that produces the behavior.

@pizi-nordic Instrumenting the kernel timeout code confirms that the revised test is still not causing a premature decrease in the desired delay for T2, even though multiple ticks are passed to z_clock_announce() and the test shows there are anomalies in the delay.

Rather than try to figure out what it's doing wrong, my wip/20190108b branch has a revised sample (not a test, to avoid impacts from test infrastructure). It's simpler and forces the delay explicitly. Running on the current master it produces:

***** Booting Zephyr OS zephyr-v1.13.0-3079-gc4ab8dd1fb *****
T1 0x2000004c 1000 ms, T2 0x20000078 50 ms, 32768 sysclock per tick
Run 0 timer1 last 1020 interval 33355/1020; timer2 delay 1634/50 = 49865 us
Run 1 timer1 last 2020 interval 32700/1000; timer2 delay 1634/50 = 49865 us
Run 2 timer1 last 3020 interval 32700/1000; timer2 delay 1634/50 = 49865 us
blocking
Run 3 timer1 last 4020 interval 33866/1000; timer2 delay 468/50 = 14282 us
Run 4 timer1 last 5020 interval 31534/1000; timer2 delay 1634/50 = 49865 us
Run 5 timer1 last 6020 interval 32700/1000; timer2 delay 1634/50 = 49865 us

This shows that T1 was delayed by 1166 32 KiHz ticks (35.5 ms), and as a consequence the delay for T2 was reduced by that error, and so T2 fired early by that number of sysclock ticks.

Rebasing this on my wip/20190106a branch which attempts to preserve the requested delay I get this:

***** Booting Zephyr OS zephyr-v1.13.0-3079-g68435c755e *****
T1 0x2000004c 1000 ms, T2 0x20000078 50 ms, 32768 sysclock per tick
Run 0 timer1 last 1020 interval 33355/1020; timer2 delay 1634/50 = 49865 us
Run 1 timer1 last 2020 interval 32700/1000; timer2 delay 1634/50 = 49865 us
Run 2 timer1 last 3020 interval 32700/1000; timer2 delay 1634/50 = 49865 us
blocking
Run 3 timer1 last 4050 interval 33867/1030; timer2 delay 1448/50 = 44189 us
Run 4 timer1 last 5020 interval 31533/970; timer2 delay 1634/50 = 49865 us
Run 5 timer1 last 6020 interval 32700/1000; timer2 delay 1634/50 = 49865 us

Here the 1167 32 KiHz delay for T1 is not charged against the delay for T2. Also, the uptime clock shows that T1 executed 30 ms late, a behavior change documented in the commit message for the rework. Note that after the delay T1 continues to be invoked on the original schedule, accomplished by shortening the interval after the late invocation.

It's obviously still not right since T2 is early by 186 32 KiHz ticks, but it's on the right track. I suspect a mishandling of z_clock_elapsed() somewhere in the changes.

pizi-nordic commented 5 years ago

Thank you for your investigation @pabigot. We are closer to get all things working.

@pizi-nordic Instrumenting the kernel timeout code confirms that the revised test is still not causing a premature decrease in the desired delay for T2, even though multiple ticks are passed to z_clock_announce() and the test shows there are anomalies in the delay.

I have to disagree. Please look at the https://github.com/pizi-nordic/zephyr/commit/0f6f752e5235f7156031ec19ef370f40238e5d6e. The attached code detects the condition you mentioned: T1 adds T2 timeout, which becomes first on the timeout list and which is adjusted during the same call to the z_clock_announce(). Here is the result:

Test Run 33 of 300 (will block interrupts for 33 ms)...                                                                                                                                        
 -> Timer 1 expired!                                                                                                                                                                           
 -> Timer 2 expired! (after 48 ms)                                                                                                                                                             
Test Run 34 of 300 (will block interrupts for 34 ms)...                                                                                                                                        
 -> Timer 1 expired!                                                                                                                                                                           
 -> Timer 2 expired! (after 48 ms)                                                                                                                                                             
Test Run 35 of 300 (will block interrupts for 35 ms)...                                                                                                                                        
 -> Timer 1 expired!                                                                                                                                                                           
 -> Timer 2 expired! (after 48 ms)                                                                                                                                                             
 -> Timer 1 expired!                                                                                                                                                                           
--- z_clock_announce(): Timer2 will fire prematuerly! ---                                                                                                                                      
Test Run 36 of 300 (will block interrupts for 36 ms)...                                                                                                                                        
 -> Timer 2 expired! (after 22 ms)                                                                                                                                                             

    Assertion failed at /home/pizi/repos/zephyr/tests/kernel/timer/timer_issue_12332/src/main.c:52: timer2_expire: 50 <= elapsed_ms + (50 / 10) is false

pabigot commented 5 years ago

OK; my instrumentation didn't produce such a case, but I probably didn't wait until the 36th iteration. All those printks cause problems of their own in timing.

pizi-nordic commented 5 years ago

All those printks cause problems of their own in timing.

That is true. But we should observe late timer expiration in such case, not the premature one (especially that I am calling printk() under lock in z_clock_announce()).

pizi-nordic commented 5 years ago

@andyross: Could we agree that the behaviour visible on https://github.com/zephyrproject-rtos/zephyr/issues/12332#issuecomment-452369223 is a bug? If so, do you have an idea how to solve this?

pabigot commented 5 years ago

The base issue really is solved with my rework described in https://github.com/zephyrproject-rtos/zephyr/issues/12332#issuecomment-452351313.

The remaining T2 error visible there (T2 fires 186 32 KiHz increments early) is because the fractional part of the RTC counter's offset from the last recorded tick is discarded by z_clock_elapsed() when it converts to integral ticks, so the timeout deadline is calculated disregarding the fact that a significant portion of a tick had already past. That'd be related to the "rounds down" problem @pizi-nordic noted.

As long as the timer resolution is 10 ms and deadlines are managed in increments of 10 ms there's no way to eliminate errors less than 10 ms. This comparison in your test can never be satisfied reliably because it only allows for 5 ms of error.

The fix for that part is to use native system clock counter values when managing deadlines so we don't accumulate truncation error.

pizi-nordic commented 5 years ago

The comparison you mention cannot be satisfied because we allow timer to fire prematurely (due to rounding in that case). If the rounding will be correct, time timer will always fire much later than requested due to 10ms tick abstraction (requested time: 10ms, tick time: 9.97ms, nearest time > 10ms = 19.9ms), satisfying (timeout <= elapsed) requirement.

Putting whole discussion in perspective (FYI: @carlescufi): Not so long ago we had a "Big timer rework" (https://github.com/zephyrproject-rtos/zephyr/pull/10160), which was intended to solve all our problems (https://github.com/zephyrproject-rtos/zephyr/issues/9904#issuecomment-420264184). Despite of the suggestions, the delta-based timeout handling as well as abstract tick as a time unit were kept in the design. Now, all the problems we had remain and correct management of delta-based timeouts is trickier that we thought. Could we stop here for a moment and re-think the situation?

Keeping all timeouts absolute will eliminate errors caused by the late clock interrupts. Removal of tick abstraction and rounding up all timeouts to the nearest HW cycle will give us precise timers which will fire with microsecond accuracy.

The statements above are not new. They are popping up in all timer-related discussion. Why shouldn't we try this approach?

pabigot commented 5 years ago

I think we should. There are other reasons why the existing timer infrastructure doesn't meet my requirements, including that the API only supports 1 ms resolution. For low-power devices I should be able to sleep for periods significantly less than 1 ms, and I have no platform-independent way to do that.

andyross commented 5 years ago

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals).

The idea is the call takes two timeout values, a "duration" argument (IMHO poorly named) that indicates how long after the current time the first callback should arrive, and a "period" argument that indicates how long should pass between subsequent callbacks.

So we should make "duration" specify a "realtime" value that is based on current real time irrespective of what the current state of the timeout system (i.e. whether or not there is a currently-active multi-tick z_clock_announce() call in progress). But the "period" time will honor the intent of the caller to have a steady beat over time by adjusting its scheduling to account for late/missed interrupts.

That will be mildly hard to document, but probably not awful (realistically most users aren't going to care about the distinction). And it shouldn't be hard to implement. @pabigot has WIP code to be submitted that does something like this, I believe.

My only worry is that I'll wake up tomorrow and realize this scheme misses another edge case or has compatibility issues, though. I'll think on it.

andyross commented 5 years ago

Sorry, thought I had posted this but didn't. To repeat what was said in the call:

CONFIG_TICKLESS_KERNEL=n does not affect the theoretical bug: all it does is ensure the system attempts to make sure ticks is always one when z_clock_announce() is invoked.

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

Really the only value to disabling tickless on these drivers is code size. Given that we've never been able to detect of handle late interrupts, there's no reason to start now. We don't want to be doing that math if it isn't a requirement.

pabigot commented 5 years ago

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals).

Note that my solution does not change the API at all; it just clarifies how the existing arguments are interpreted (a documentation-only change). The behavioral modification is done internal through coordination between k_timer and timeout. If that's not what people expect, please let's wait until the PR is submitted and hash it out there.

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

I remain skeptical that it's that simple, but that can be hashed out in a separate issue.

pizi-nordic commented 5 years ago

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

I think you chosen good direction making timer drivers common. If we implement the z_clock_announce(1) requirement on non-tickless system we will end in one of 2 situation:

Either timer drivers will have to detect and handle missed ticks by calling z_clock_announce(1) multiple times, which will result in the same problems we observing now, but bit harder to solve as kernel will not have knowledge about missed ticks, or ...
... we just ignore missed ticks and out system clock will be always slower than the real time.

andyross commented 5 years ago

Honestly I don't care whether this really gets fixed or not. The point is that the current state represents a regression from the way timekeeping was done prior to the rewrite, which honored single ticks as the sole reference for timekeeping.

With the current architecture, the only reason you'd ever choose to disable TICKLESS_KERNEL is for code size reasons. And if that's the case, then you almost certainly don't care about theoretical behavior and just want the way Zephyr (and basically everything) has worked since the invention of timer interrupts in the 1960's.

carlescufi commented 5 years ago

@andyross @pabigot @pizi-nordic

The proposal to solve the now months-old debate about the kernel vs real time that was discussed has been described here:

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals). The idea is the call takes two timeout values, a "duration" argument (IMHO poorly named) that indicates how long after the current time the first callback should arrive, and a "period" argument that indicates how long should pass between subsequent callbacks.

I just want to make sure that, regardless of other considerations, this is still considered a good solution for all the people involved here. Can we get an acknowledgement in that direction?

pizi-nordic commented 5 years ago

I think that the proposed solution will fix k_timer. However the same problem could affect scheduling (causing premature thread wake-up) and we will have to fix that too.

pabigot commented 5 years ago

tl;dr: the proposal is still a good solution; it does not address the general question.

I have many concerns with k_timer. Once #12400 is merged, then #12248 can be rebased and merged, then I can submit the solution for the particular problem outlined in this issue. (Got a nice test and implementation, just waiting on the blockers....)

"kernel time" and "real time" are poorly defined. We know from #12409 that there is a non-linear relation between the system clock (k_uptime_get()) and the hardware clock (k_cycle_get()) when tickless is disabled; this probably cannot/should not be changed as it's historical and has a reasonable justification. In tickless mode the relationship should be linear, but it has an inherent 0.2% error for Nordic at 100 Hz ticks, 0.02% at 1000 Hz ticks. The error reduces to zero at 32768 Hz ticks, but the maximum documented Zephyr typical tick rate is 1000 Hz and somewhere there's a comment warning about exceeding that.

At this time I'm considering prototyping a k_alarm infrastructure that will use the undivided hardware clock and take more care to deal with any remaining truncation due to unit conversion. The main driver for that effort is to eliminate the continuing impact of the concept of a "tick" on timer operations, which currently cannot express sub-millisecond sleeps, a feature lack that for me blocks use of Zephyr in low-power applications. k_timer could be implemented on top of the new infrastructure. All that, though, is a different issue.

pizi-nordic commented 5 years ago

@pabigot: I fully support removal of tick abstraction and sticking to (optionally prescaled) hardware cycles. The same was suggested months ago. However I am not sure if we should implement another subsystem aside from existing timeout management (I assume you would like to have a clean start :)). In my opinion such direction will consume much more effort.

zephyrbot commented 5 months ago

Hi @nordic-krch,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. Please confirm the issue is correctly assigned and re-assign it otherwise.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@pizi-nordic you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!

zephyrproject-rtos / zephyr

Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1 #12332