kernel idling / timer API, idle parameter limitations

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.84k stars 6.6k forks source link

kernel idling / timer API, idle parameter limitations #30590

Open noelleclement opened 3 years ago

noelleclement commented 3 years ago

Is your enhancement proposal related to a problem? Please describe. At the moment the only way that [timer API's] z_clock_set_timeout() can be called with is_idle=true, is in [kernel/timeout.c] z_set_timeout_expiry(), with the condition that the new timeout will only be set (z_clock_set_timeout() is called with that timeout value) when it is sooner than the existing (soonest?) timeout. This creates the consequence that the is_idle parameter in z_clock_set_timeout() can only be true when this condition is met.

When combined with the power management subsystem this can potentially create situations where is_idle=true, but z_set_timeout_expiry() does not call z_clock_set_timeout(), before the power management subsystem is triggered. This can create problems when actions in the timer driver need to be taken specifically when the kernel goes into the idle-state.

Besides (/ because of?) this problem, the is_idle parameter in z_clock_set_timeout() isn't used in most, if not all, timer drivers. This seems inefficient, since it must have been put there for a reason, and I personally see large advantages when it can be used properly, especially regarding power management.

Describe the solution you'd like I'm not fully sure what the appropriate solution would be for this issue, since it ties into multiple other things. Comments are highly appreciated and requested.

However, here are my 2cents:

First of all I think it would be useful to document this dependency clearly in /include/drivers/timer/system_timer.h, since at the moment the idle parameter is still part of the timer API.

Secondly, I think it would be beneficial to keep the parameter, but then a larger solution is required in the kernel. Probably an analysis is required of the extend of the problem, i.e. are there indeed cases where the requested new timeout isn't earlier than the existing soonest timeout, but the idle-state should still be communicated to the timer driver? I can imagine it's also useful to analyse whether this truly should be the only location that z_clock_set_timeout() is called with is_idle=true.

Additional context & considered alternatives I ran into this issue during the development of a STM32 RTC timer driver. The vision for the implementation was to base the alarm configuration on the idle parameter in z_clock_set_timeout(). For non-idle, the subsecond alarm would be configured and kept, for a higher-granularity (1/128 sec) 'system tick' (but still with the tickless kernel config). For idle, subsecond 'system tick' disabled and the calendar alarm could be configured for a low-granularity (1sec) wake-up alarm.

I would like to point out that I'm still in the middle of developing this driver, so there could still be some bugs in there. If the condition in z_set_timeout_expiry() *should* cover all cases, please explain so below.

In my observations during debugging/testing I have found the following situation to occur:

z_clock_set_timeout() is called with the expected higher (thread-sleep) ticks duration (long enough to trigger the power management subsystem), idle=false
[timer specific timeout actions, meaning in my current implementation the non-idle timeout actions are taken]
z_set_timeout_expiry() is called, is_idle=true & the requested timeout being equal or 1 tick more than the currently set next_timeout (soonest timeout) > condition not met, timer driver not informed of idle state to come
power management subsystem actions: SoC specific sleep state entry actions taken
'wrong' timer timeout actions taken, undesired result

Because of this, I have been thinking about the fact that this could also suggest that (intentionally) the idle-state can only be triggered when the timeout has already been set, and the final z_clock_set_timeout() call in z_set_timeout_expiry() is to ensure that it truly is the timeout value that needs to be set before entering idle (including the power management actions). If this is the case, I'm not fully sure why the idle parameter has been included in z_clock_set_timeout().

If this issue isn't changed in the kernel I could still probably figure out a solution in the timer driver to still get (part of) the result I want. However, I think it would be smart to think about whether this currently limited functionality of the idle parameter in the timer API is desired, since it in turn could have a large impact on the power management subsystem if development on this keeps on continuing.

noelleclement commented 3 years ago

Issue created per suggestion by @pabigot. Referenced limitation in kernel/timeout commited by @andyross (#12204), probably would be great to get your perspective on this too!

nashif commented 3 years ago

@andyross Can you please take a look?

pabigot commented 3 years ago

Background: See also #28572 which I believe is related, in particular this comment. Also SYSTEM_CLOCK_SLOPPY_IDLE. I see only these system timers doing something with the idle parameter: arcv2 arm_arch cortex_m_systick hpet mchp_xec_rtos riscv_machine

I'm not sure how much this should be a question focused on the current idle parameter and how it does or should behave, or a higher level question of what infrastructure do we need to be able to support power modes where the system clock is shut down and a wakeup will be initiated by a time-based external event (which may mean the current approach is not what we want at all). AFAICT the idle parameter in z_set_timeout_expiry() is essentially independent of current power management plans, depending only on TICKLESS_IDLE_THRESH.

In any case, I'd like to propose these architectural concepts in hopes we all agree on the basic behavior of the system clock and its relation to Zephyr time:

We should expect that exactly one timing infrastructure is used for the timeout subsystem curr_tick counter, that infrastructure being the one selected as the system clock. The complexities of running a high-frequency clock in one power mode, then a completely different low-power clock in another (non-suspend) mode, should be avoided if at all possible.
The system clock should be monotonic non-decreasing from startup until the system enters a power mode where it is no longer running. When TICKLESS the running clock should advance at a nominal constant rate corresponding to TICKS_PER_SEC.
When resuming from a power mode where the system clock has stopped the clock may be discontinuous, i.e. it may reset to zero. However, the restarted clock must coordinate with the timeout infrastructure such that in any state that retains curr_tick then the timeout infrastructure should observe one of these two behaviors:
- the tick counter continues to advance from its last value with no record of elapsed time while the system clock was stopped; or
- during wakeup the tick counter is atomically advanced by a value corresponding to an estimate of elapsed time while the system clock was stopped.

In short, the tick counter must always be monotonic non-decreasing, at a nominal fixed rate, but may have discontinuities in its relation to the passage of real time.

In the cases where not all context is lost, this would mean that pending alarms would fire when the system wakes up in an order consistent with the system's concept of the passage of time.

Skipping some steps, what this suggests to me is that scheduled wakeups should not (necessarily) be performed through the standard timeout infrastructure. But I'm not sure how to deal with them instead. One possibility is to have z_set_timeout_expiry() divert to something other than z_clock_set_timeout() when is_idle is true.

noelleclement commented 3 years ago

So, I obviously don't have as much experience with this as you guys (and my knowledge is limited to the STM32 SoCs I've worked with), but I feel like these points are important to take into account if a larger architectural change is going to be considered:

We should expect that exactly one timing infrastructure is used for the timeout subsystem curr_tick counter, that infrastructure being the one selected as the system clock. The complexities of running a high-frequency clock in one power mode, then a completely different low-power clock in another (non-suspend) mode, should be avoided if at all possible.

I think it's important to note that this (obviously) requires the chosen kernel system timer clock source to be available at all (available / chosen for SoC) power modes then. This will be very important to state clearly in the documentation. This might be debatable (and also dependent on how it is configured / implemented for each SoC that implements power management), but this IMO inherently means that if one would want to have the same kernel timer clock source for all power modes, probably the (timer) clock source needs to be chosen that is available during the lowest possible power mode, where context is still kept (since this isn't an issue with deepsleep, where context isn't kept and the system timer starts counting from 0 again, right?). Also, an approach could be chosen where depending on the selected SoC, and the (application-)selected sleep mode(s), automatically the kernel system timer is selected.

When resuming from a power mode where the system clock has stopped the clock may be discontinuous, i.e. it may reset to zero.

What situation would cause this to happen? Wouldn't this only happen when the kernel system clock source isn't 'supported' by the low-power mode, in which case it shouldn't be used in the first place (since then it will not be known how much time has elapsed during the low-power mode)? Or are there implementations where the amount of time spent in the low-power mode (and where the kernel system clock source == SoC high frequency system clock source that isn't available during this low-power mode) doesn't matter? I'm curious to what extend the deep-sleep infrastructure would overlap with this, or whether there's a use-case for which this is specifically required (i.e. context is kept, but kernel system timer clock source is reset).

///

I guess all of this also ties into the fact that with these types of situations the kernel system timer clock source will be different than the SoC system clock source. This raised the following issue for me: There are multiple locations in different types of drivers where the (STM32) clock configuration assumes that the SoC system clock has the same source as the kernel system clock. In the case of our chosen SoC (STM32L1x) this is not possible, as the SoC system clock can only be based on HSI, HSE or MSI, but only LSI/LSE are available during the lower power modes. This inherently creates two clock sources that need to be used and 'handled' during configuration in the kernel. AFAICT the STM32 clock_control driver is able to handle this difference (although the kernel system timer clock configuration is now being done in the timer driver), but in some drivers I found that sys_clock_hw_cycles_per_sec() is used for timing calculations that don't necessarily have to do with the kernel timing but with the hardware timing. It might be necessary to split the clock configuration up in two different subsystems, or to state more clearly which configuration is for the kernel system timer clock source, and which for the SoC system clock, to ensure that the different configurations don't interfere with each other.

///

In the cases where not all context is lost, this would mean that pending alarms would fire when the system wakes up in an order consistent with the system's concept of the passage of time.

Could you elaborate what you mean by this, what types of alarms are you referring to?

///

Lastly, SYSTEM_CLOCK_SLOPPY_IDLE sounds very interesting. To what extend is this already supported? And what is meant by 'maintain a correct system uptime count'? What is 'correct'? Meaning it's continuous, or does it refer to the precision, or something else entirely?

/////

Besides the questions, a summary of my suggestions:

Potential auto-select kernel system timer, based on SoC and application-selected power modes (to ensure that proper kernel system timer clock source is configured, when power management selected)
Split up / adapt clock configuration (clock_control API), for SoC system clock != kernel system timer clock source

pabigot commented 3 years ago

We should expect that exactly one timing infrastructure is used for the timeout subsystem curr_tick counter

I think it's important to note that this (obviously) requires the chosen kernel system timer clock source to be available at all (available / chosen for SoC) power modes then.

Not exactly. I tried to word the concepts so that the system clock can be turned off (stop advancing) in a sleep mode without losing context, including pending alarms. However, if nothing returns the system to a mode where the clock is running before the first deadline, when the alarm fires depends on the clock restoration process (all fire at once if the clock jumps to cover the sleep duration, all fire with a constant late offset if the duration isn't applied).

pending alarms would fire when the system wakes up in an order consistent with the system's concept of the passage of time. Could you elaborate what you mean by this, what types of alarms are you referring to?

Anything that uses struct _timeout and is processed by z_clock_announce() from the system clock interrupt. That includes k_alarm, k_delayed_work, k_sleep, anything using K_MSEC() and related helpers, ....

When resuming from a power mode where the system clock has stopped the clock may be discontinuous, i.e. it may reset to zero.

What situation would cause this to happen?

I was being imprecise. I posit three elements:

The timing infrastructure (counter advancing at a nominal constant rate) that is used by the clock driver;
The clock driver selected as the system clock
The timeout infrastructure (curr_ticks counter) used for all timing in Zephyr[%] which advances based on deltas between clock driver counter values. (% except the architecture-specific cycle counter)

I should have said something like:

When resuming from a power mode where the system clock has stopped the timing infrastructure (counter) underlying it may reset to zero

because some hardware counters will always reset to zero when they're started. In this case a naively-implemented clock driver may improperly subtract a captured pre-sleep value from the current value, and return a wildly inaccurate estimate of elapsed time since the last tick.

I should also clarify:

monotonic never goes backwards
continuous the span of time between two consecutive instants (counter values) is (nominally) constant. Think "linear". (NB: This term is not precisely defined; this is my interpretation from its use in ISO/WD 8601-1 section 2.1.4. My definition may instead describe "continuous without discontinuities" (!). Dealing with time is messy; just ask The Doctor.)

So with this terminology a 1 Hz clock that is stopped when it reaches 10 and restarted 5 seconds later is:

not monotonic if the next instant it produces is 0;
monotonic but not continuous if the next instant it produces is 11;
continuous if the next instant it produces is 15.

Lastly, SYSTEM_CLOCK_SLOPPY_IDLE sounds very interesting. To what extend is this already supported? And what is meant by 'maintain a correct system uptime count'? What is 'correct'? Meaning it's continuous, or does it refer to the precision, or something else entirely?

I don't know the motivation behind it, but assume it's something like what I was describing above, where the counter underlying the system clock can be turned off to save power.

My architectural concepts were intended as a first step of defining what "correct system uptime" means. It requires a monotonic uptime count, but allows for a discontinuous one (where passage of Zephyr time is not in one-to-one correspondence with wall clock time). This continues the original Zephyr time model where all time is based on ticks that are from a counter, and can (with CONFIG_TICKLESS_KERNEL=n) advance at a non-constant rate triggered by system events that may not be related to passage of time.

Applications must be able to determine whether the implemented uptime behavior is monotonic discontinuous or continuous; if it's discontinuous an external RTC may need to be used with infrastructure like #28977 to produce a continuous clock.

noelleclement commented 3 years ago

@pabigot thanks for the clarifications! What you're essentially proposing is to split up the (kernel system clock) timer drivers and timeout infrastructure, right? To allow for more specific applications, especially regarding power management?

Thinking out loud: Essentially this would make a use-case possible where the high-frequency clock is still used as the kernel system clock (becoming discontinuous when power management is used, if its clock source shuts off during the low-power mode), right? A wake-up interrupt source would still need to be configured, but it doesn't have to be the kernel system clock source. Is that a desired (unintended) infrastructural side-effect?

My architectural concepts were intended as a first step of defining what "correct system uptime" means. It requires a monotonic uptime count, but allows for a discontinuous one (where passage of Zephyr time is not in one-to-one correspondence with wall clock time). This continues the original Zephyr time model where all time is based on ticks that are from a counter, and can (with CONFIG_TICKLESS_KERNEL=n) advance at a non-constant rate triggered by system events that may not be related to passage of time.

What if it is required / assumed that the kernel timing is linear wall clock time? There's many applications where this is required, but would this mean that that would not be possible if power management is also enabled? I would say a solution is also required where either the kernel system clock is either discontinuous (and thread scheduling can't be guaranteed to reflect wall clock time), or continuous (but where the proper timer infrastructure needs to be provided to 'track' time during sleep), regardless of the kernel being tickless. The latter essentially is what currently is allowed/possible, right?

pabigot commented 3 years ago

What you're essentially proposing is to split up the (kernel system clock) timer drivers and timeout infrastructure, right?

I'm not sure exactly what you mean. The timeout infrastructure (k_uptime_ticks(), struct _timeout) should always be directly driven by the state of the system clock driver (<drivers/timer/system_timer.h>) selected as the system clock.

What I'm proposing is that the system clock driver API be augmented with features to support it being turned off during periods where struct _timeout time-triggered events will not be processed. Time-triggered events from other sources (external or SoC RTCs, for example) should still work in that state. Then when the clock gets turned on again the effect of time that passed while it was off being must be applied to to the timeout infrastructure, either directly or as a result of adjusting the system clock's internal counter.

Thinking out loud: Essentially this would make a use-case possible where the high-frequency clock is still used as the kernel system clock (becoming discontinuous when power management is used, if its clock source shuts off during the low-power mode), right? A wake-up interrupt source would still need to be configured, but it doesn't have to be the kernel system clock source. Is that a desired (unintended) infrastructural side-effect?

Yes! That's what I've been suggesting all along as a solution to your problem: keep the standard STM32 clock while the processor is running in a normal mode, and adjust its state using an always-running lower-frequency lower-power clock at power management events.

(What I don't know about your use case is whether you expect the CPU to do anything during the periods while the system clock is off except respond to external interrupts in order to restore full functionality, including restoring the system clock. I expect the system clock to run whenever the power management state is PM_STATE_ACTIVE or PM_STATE_RUNTIME_IDLE, but perhaps not in PM_STATE_STANDBY, and probably not in PM_STATE_SUSPEND. Is that consistent with your expectation?)

What if it is required / assumed that the kernel timing is linear wall clock time? ... I would say a solution is also required where either the kernel system clock is either discontinuous (and thread scheduling can't be guaranteed to reflect wall clock time), or continuous (but where the proper timer infrastructure needs to be provided to 'track' time during sleep), regardless of the kernel being tickless. The latter essentially is what currently is allowed/possible, right?

That's sort of what tickless/non-tickless gives. I want it to be supported better for tickless with the clock being turned off in certain power modes. Two gaps are API to advance the tick counter to cover time that passed while the clock was off %, and #28977 to support translating between kernel timestamps and an external accurate wall-clock source.

% it may be as simple as a single call to z_clock_announce() to provide the effect of the elapsed time, in which case no special handling is required.

noelleclement commented 3 years ago

Yes! That's what I've been suggesting all along as a solution to your problem: keep the standard STM32 clock while the processor is running in a normal mode, and adjust its state using an always-running lower-frequency lower-power clock at power management events.

Haha okay YES we're now on the same page then :)

(What I don't know about your use case is whether you expect the CPU to do anything during the periods while the system clock is off except respond to external interrupts in order to restore full functionality, including restoring the system clock. I expect the system clock to run whenever the power management state is PM_STATE_ACTIVE or PM_STATE_RUNTIME_IDLE, but perhaps not in PM_STATE_STANDBY, and probably not in PM_STATE_SUSPEND. Is that consistent with your expectation?)

In our case the CPU isn't expected to do anything while the system clock is off. The RTC generates the external interrupt, and has a different source that is still available during the used sleep mode.
So yes, I would say that that is consistent with our needs.

% it may be as simple as a single call to z_clock_announce() to provide the effect of the elapsed time, in which case no special handling is required.

What I find interesting is that this is similar to what I'm currently doing (see #30790; announcing the elapsed ticks during sleep in the IRQ handler), and the kernel seems to handle it fine. However, I probably haven't tested extensive enough to weed out the edge-cases where this wouldn't work properly.

zephyrbot commented 9 months ago

Hi @andyross, @peter-mitsis,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@noelleclement you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!

andyross commented 9 months ago

Long-winded discussion might not be helpful. But it's true that the timer driver API still shows some historical warts from the original/long-forgotten "tickless idle" architecture, and this is one of them (another is that set_timeout() can be called with INT_MAX to mean "forever", but only if SLOPPY_IDLE=y). Not sure if either is really worth this level of discussion, but some cleanup wouldn't hurt either.