Open alexpaschoaletto opened 2 months ago
Took a very quick look: FWIW this seems more like "lib" code than "kernel" code from a code organization perspective. It's built on existing kernel primitives (e.g. k_timer/msgq as I gather from above), not really a new primitive of its own. See for example the p4wq queue which speaks to a vaguely similar problem area and lives happily in ./lib/os/
.
As far as the code itself, can you clarify exactly what the semantics are? I'm gathering that it's a work queue that implements EDF-like semantics, but it doesn't really spell out how: like, what happens when someone blows a deadline? What happens when the queue thread gets preempted (I mean, it probably fails, but how, and does/can it detect that situation)? Can you suspend the thread in a worker callback (surely not, but if you do what happens)?
Broadly I don't see any reason to reject code like this. Tools are tools and we already have lots of options, no harm in a few more. But it's hard to give a firm +1 without code; my vote is to submit and get the arguments started earlier rather than later.
Also: do look at lib/os/p4wq.c to see if that addresses some or all of the problems you're working with. It's possible that we can augment an existing tool rather than add a new one from scratch.
Hi, Andy! Thanks a lot for your comment.
Took a very quick look: FWIW this seems more like "lib" code than "kernel" code from a code organization perspective. It's built on existing kernel primitives (e.g. k_timer/msgq as I gather from above), not really a new primitive of its own. See for example the p4wq queue which speaks to a vaguely similar problem area and lives happily in
./lib/os/
.
I agree that it looks more like a "lib" code, and that's why I implemented it as a subsystem rather than on the kernel files directly. I assumed, by inspecting the codebase, a subsystem would be the best place as my code needed similar "privileges" as the tracing code, of which you've seen me opening previous issues about (my code does not use the tracing API, though, just to clarify).
As far as the code itself, can you clarify exactly what the semantics are? I'm gathering that it's a work queue that implements EDF-like semantics, but it doesn't really spell out how: like, what happens when someone blows a deadline?
Yes, you can put it like that: a work queue with a deadline and a "period" of its own (quotes there because it is not really periodic, the period is actually used to calculate the new deadline when needed. I don't like this naming convention, tbh).
As for the semantics, the behavior of deadline misses is not usually defined as far as I know, being up to each particular application to decide what to do if/when that occurs. I don't think Zephyr has any protocol on this subject either; the CBS is only concerned about updating its deadline on certain events (e.g. jobs being pushed on an empty queue) and leaving the EDF scheduler to do its trick.
What happens when the queue thread gets preempted (I mean, it probably fails, but how, and does/can it detect that situation)? Can you suspend the thread in a worker callback (surely not, but if you do what happens)?
I'm not sure I understand your question, but a preemption can happen just as it would do with any thread, and critical sections of the CBS are protected with sched_locks. The CBS thread is not meant to be accessed directly, just as _current is not meant to be referenced in user code. That still can happen, but that's a bad practice and considered a problem of the user implementation. Thus, if a worker callback suspends the CBS thread or has an infinite loop within it, it would just stop working as intended (deadlines would still be updated correctly as their updates are associated with a timer, not the thread). In fact, the documentation written so far already gives some advice about it:
[...]
Jobs are regular C functions executed within a thread context that
should **not** loop indefinitely (as it will prevent further jobs on
the queue to ever be reached), return nothing and accept
one void pointer as the argument.
[...]
one can call :c:func:`k_cbs_push_job` from an ISR or a thread context,
but ISRs should **not** attempt to wait trying to push a job on the CBS
queue. Moreover, jobs are executed within a system thread context but
should **not** attempt to directly interact with the CBS thread (e.g.
suspend it or change its static priority on the fly), as it would likely
lead to unexpected behavior.
I hope that answers your questions. I will take a look on the p4wq but would like to stick with the idea of a new tool for now.
Just bike shedding here, instead of calling it "server" as that gives to me wrong kind of connotation, could it be called a "service" instead?
Just bike shedding here, instead of calling it "server" as that gives to me wrong kind of connotation, could it be called a "service" instead?
I agree the naming convention is somewhat confusing. However, I don't think I should change anything because that's how it was originally proposed by the algorithm's author and how it's been implemented on other operating systems such as RTEMS and Linux.
Swinging back by: FWIW, what I'm really asking for is a clean description of what the algorithm actually does, what promises it makes, and what demands it makes on the application. Like a bullet list of statements like:
Basically most people don't know what this is, don't want to read a paper describing it, and just want an executive summary of why they should use this.
Swinging back by: FWIW, what I'm really asking for is a clean description of what the algorithm actually does, what promises it makes, and what demands it makes on the application. [...] Basically most people don't know what this is, don't want to read a paper describing it, and just want an executive summary of why they should use this.
Fair enough, sir. I myself am also resistant to the idea of having to read whole papers to grab a new concept. Let me try to explain it the easiest way I can, then:
Regular EDF tasks have a period in which they regularly activate and execute. At every new period, they invoke k_thread_deadline_set
, do their job and sleep until the next period comes. Then repeat. and repeat. and repeat.
void edf_thread(){
while(true){
k_thread_deadline_set(this_thread, deadline);
do_my_work();
sleep_until(next_period);
}
}
The CBS is essentially the same thing, with two major differences: 1) there is no pre-defined work it must do, as all the work is assigned to it in runtime, and 2) the call to k_thread_deadline_set
is triggered by specific events, NOT by the next period.
void cbs_thread(){
while(true){
wait_for_new_job(); // CBS remains inactive while it has no jobs to execute
k_timer_start();
do_the_job();
k_timer_stop();
}
}
The CBS only calls k_thread_deadline_set
in two moments: 1) when a job is pushed to it and 2) when the timer expires.
void k_cbs_push_job(void *job, cbs_t *cbs){
...
if(condition_is_met(cbs)){ // when a job is pushed, a condition will be checked
deadline = now + period; // if condition holds, deadline is updated
k_thread_deadline_set(cbs, deadline);
}
}
void cbs_timer_expiry_function(){
...
deadline += period;
k_thread_deadline_set(cbs, deadline); // the deadline is always updated if timer expires
}
The timer of cbs_thread
therefore is a protective mechanism to keep updating the CBS deadline when a job takes too long to finish. The succesively postponed deadlines mean the CBS will progressively lower its priority for the EDF scheduler, which guarantees that other tasks within the EDF will eventually be able to take the CPU back for them.
The timer duration is known as the CBS budget
.
The value used to calculate the CBS deadline is known as the CBS period
. This is admittedly confusing, since the CBS is not periodic (it sleeps when there are no jobs to execute). I'd rather call it relative deadline
instead.
The CBS is a wrapper for jobs that have unknown execution times and no deadlines, but still need to be executed within an EDF task set. Since EDF needs all tasks to have a deadline, the CBS does the favor of having one for these jobs.
Because the CBS guarantees that no hard real-time thread will ever miss a deadline just because some random function took too long to execute, at the same time as it guarantees that these random functions will have as much execution time as they can get.
- After the work item is added, it will run when $CONDITION1 until $CONDITION2
- The deadline is specified in units of $UNIT and measured by $MECHANISM
- The CBS mechanisms guarantees $SOME_GOOD_BEHAVIOR_APPS_WANT, which the core schedules does not, because $REASON
- The work item $WILL_or_WILL_NOT be preempted by the CBS code if it exceeds its deadline
- If a work item runs too long $FAILURE1 will occur
- The CBS server requires that it not be preempted by excessive ISR or high priority thread usage, if it is, then $FAILURE2 will occur
As for your requested format, here it goes:
k_sched_lock
and k_sched_unlock
. It is up to the application developer to design a system with no deadlocks or excessive ISRs, as usual.Hi, any developments here? Should I explain anything else? Is it good to be submitted as a PR?
It's with a great pleasure that I come before you in behalf of my research laboratory (SoftCPS - ISEP) to submit our first contribution to this amazing open-source RTOS: the Constant Bandwidth Server, or CBS.
Introduction
The CBS is a kernel feature meant to allow aperiodic, non-real-time tasks (i.e. jobs) to be executed within a periodic real-time taskset running under a dynamic sheduling environment such as EDF. It is called a server because its sole purpose is to serve (execute) jobs, effectively acting as a wrapper to them.
Problem description
A key feature in the analysis of hard real-time systems is to know whether or not a given set of tasks, each with their estimated execution times and periods, is guaranteed to run without missing any deadlines. This analysis is dependant on the scheduler algorithm and the aformentioned task properties.
However, on actual real-world systems it is quite often to find tasks which are event-driven instead of just time-driven. They don't have predictable activation instants or periods, and often also no hard deadlines, but must be served anyway among the periodic tasks. So how do we make sure that the system will not miss any critical task deadline when this level of incertainty is put on the table?
Well, there are several ways to do that, and the CBS is one of those. What it does is hold its own values of deadline and period so the aperiodic jobs given to it don't have to. It then stays inactive until a job arrives and, when it does, it enters the run queue for execution, updating its own deadline when certain conditions are met.
Proposed change
We propose to bring this feature to the Zephyr kernel. As far as we know, currently Zephyr already has a feature called workqueue which methods and goals are similar to the ones of CBS. However, workqueues seems to support only static priorities and there is no guarantee on its execution "boundaries" other than the configurable thread timeslices and the yield behavior when a new item is submitted. Moreover, the workqueues are typically assigned a single handler function (i.e. a single job) upon
k_work_init
and expected to only be submitted new arguments to execute within it by callingk_work_submit
. The CBS, on the other hand, has been designed to accept different jobs and arguments altogether at each new submission withk_cbs_push_job
.Thus, the CBS would pose as the perfect complement to the Workqueue rather then its substitute - it is meant to work alongside the EDF scheduler instead of static priorities and support a more volatile set of job functions instead of defining a single function from the start and only change the arguments passed to it.
Detailed RFC
The CBS has been already implemented and extensively tested on various targets (including physical and emulated ARM, Xtensa, RISC-V) with no problems detected so far, but I'm doing a RFC before submitting a pull request 1) to follow the guidelines and b) to allow the community to inspect the work done so far and give tips on changes. The changes made to Zephyr can be found here, on branch cbs-as-subsystem.
Proposed change (Detailed)
The works conducted on this feature were additive; They don't modify or remove any existing functionality, therefore are unlikely to break any of the current features. The public API consists of just two elements, really: a
K_CBS_DEFINE
macro and ak_cbs_push_job
function.The proposed implementation also offers a logging feature to allow users to keep track of the CBS events. When enabled, the CBS will use the logging API to register keypoints of its execution - such as a job being pushed to the queue, budget being replenished or the CBS thread entering the CPU to serve the jobs given to it. Logs will show up on screen like this:
If looking for more specific details, feel free to ask or inspect the aforementioned fork with the proposed implementation.
Dependencies
The CBS implementation was developed with the goal of using as many existing kernel features as possible in a way to minimize doing things from scratch. Therefore, each CBS is essentially composed of one thread, one message queue and one timer. The CBS structure holds a reference to those and a few extra metadata to be used internally:
I don't know what else should I include here, so if anything is missing please let me know, but as I said before, nearly nothing of Zephyr's current implementation is excluded or modified with the proposed code. Therefore I don´t think anything is likely to break or showcase unexpected behavior. As of my rounds of testing, everything seems to be working fine.
Concerns and Unresolved Questions
However, there is one thing that should be kept in mind. To work as expected, the CBS needs to know when the thread enters/leaves the CPU for execution of the jobs given to it. Therefore it directly makes use of
z_thread_mark_switched_in
andz_thread_mark_switched_out
kernel functions, which are apparently not working properly on all devices. Work is being done on this regard and more details can be seen on issue #76057.