Closed alexpaschoaletto closed 3 weeks ago
Hi @alexpaschoaletto! We appreciate you submitting your first issue for our open-source project. 🌟
Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙
This comment only explains a bit of the actual feature I'm working on and how I found the bug in the first place.
I'm developing a kernel feature which needs to know when a particular thread switches in/out of the CPU in order to properly work. My feature therefore directly uses the z_thread_mark_switched_in
and z_thread_mark_switched_out
functions for that purpose, since it runs within the kernel and doesn't need the intermediation of the tracing API. So far, so good.
The feature should start a timer whenever the concerned thread enters the CPU and stop the timer if the thread leaves it before expiring. All the timer expiry function does is redefine its EDF priority by recalculating its absolute deadline. Therefore a preemption can happen depending on the other threads.
The following showcases a normal execution of the feature - the only relevant logs for this analysis are:
SWT_TO
: the thread enters the CPU for execution.SWT_AY
: the thread leaves the CPU.B_ROUT
: the timer expired.The expected output when the thread finishes before the timer expiring is:
...
[00:00:19.110,000] <inf> CBS: cbs_1 SWT_TO 200000 //the number is the time left (in cycles)
[00:00:19.120,000] <inf> CBS: cbs_1 B_STAT 34173
[00:00:19.120,000] <inf> CBS: cbs_1 SWT_AY 34173
...
And the expected output if the timer expires before the thread finishes is:
...
[00:00:05.040,000] <inf> CBS: cbs_1 SWT_TO 200000
[00:00:05.060,000] <inf> CBS: cbs_1 B_STAT 9257
[00:00:05.070,000] <inf> CBS: cbs_1 B_ROUT 200000 // logged by timer expiry function
[00:00:05.070,000] <inf> CBS: cbs_1 B_STAT 191420
[00:00:05.070,000] <inf> CBS: cbs_1 SWT_AY 191420
...
However, the actual output frequently observed when the timer expires is:
...
[00:00:05.040,000] <inf> CBS: cbs_1 SWT_TO 200000 // switch to
[00:00:05.050,000] <inf> CBS: cbs_1 B_STAT 21493
[00:00:05.060,000] <inf> CBS: cbs_1 B_ROUT 200000
[00:00:05.060,000] <inf> CBS: cbs_1 B_STAT 191045
[00:00:05.060,000] <inf> CBS: cbs_1 SWT_AY 191045 // switch away
[00:00:05.070,000] <inf> CBS: cbs_1 SWT_AY 191045 // switch away
...
I assumed it could be my implementation and decided to test a simple code as a sanity check. And that's when I ended up with the example code of #75406.
Quick update, I think line 879 in kernel/sched.c
might be part of the issue, but not all of it:
static inline void set_current(struct k_thread *new_thread)
{
z_thread_mark_switched_out(); // here - only the switched_out is called
_current_cpu->current = new_thread;
}
This is the only moment in sched.c
in which any of the tracing functions are called, and it only calls the switched_out
function. As an experiment I decided to comment this line and see what happens. This is the result for xiao_esp32c3
for the simple example:
xiao_esp32c3/esp32c3
[00:00:13.091,000] <inf> tracing: SWT_TO
[00:00:13.091,000] <inf> tracing: SWT_AY
[00:00:14.092,000] <inf> tracing: SWT_TO
[00:00:14.095,000] <inf> tracing: SWT_AY
xiao_esp32c3/esp32c3
xiao_esp32c3/esp32c3
[00:00:15.096,000] <inf> tracing: SWT_TO
[00:00:15.098,000] <inf> tracing: SWT_AY
[00:00:16.099,000] <inf> tracing: SWT_TO
[00:00:16.103,000] <inf> tracing: SWT_AY
and this is the output for the kernel feature I'm developing:
...
[00:00:13.080,000] <inf> CBS: cbs_1 SWT_TO 200000 // switch to
[00:00:13.100,000] <inf> CBS: cbs_1 B_STAT 10069
[00:00:13.110,000] <inf> CBS: cbs_1 B_ROUT 200000
[00:00:13.110,000] <inf> CBS: cbs_1 B_STAT 170632
[00:00:13.110,000] <inf> CBS: cbs_1 SWT_AY 170632 // switch away
...
As expected, results with targets that did not output SWT_TO
events were not affected. But the case for xiao_esp32c3
seems definetely solved with this tweak.
FWIW:
I'm developing a kernel feature which needs to know when a particular thread switches in/out of the CPU in order to properly work.
Then you really don't want to be doing that on top of the tracing layer, which is (as you've discovered) somewhat fragile. Trace points move/break/get-evaded all the time, we aren't (and don't want to be!) rigorous about low level ordering details. The kernel promises the behavior specified in its API. Tracing is a debug feature, not an API.
Can you elaborate on what you're trying to achieve, either here or in a separate issue (or Discord, etc...). I suspect there are better options. Basically if you really need to be whiteboxing the scheduler then you want to be writing code in the scheduler.
That's not to say there isn't a bug with the tracing hooks on context switch, mind you. Just that it's the kind of bug I'd expect to see.
Suck in @nashif @teburd for more expertise and remove me as assignee, as tracing isn't really my area and as you can tell I'm not really a fan of how we do it currently. :)
Then you really don't want to be doing that on top of the tracing layer, which is (as you've discovered) somewhat fragile. Trace points move/break/get-evaded all the time, we aren't (and don't want to be!) rigorous about low level ordering details. The kernel promises the behavior specified in its API. Tracing is a debug feature, not an API.
Can you elaborate on what you're trying to achieve, either here or in a separate issue (or Discord, etc...). I suspect there are better options. Basically if you really need to be whiteboxing the scheduler then you want to be writing code in the scheduler.
Thanks for the tip! I'm still new on Zephyr kernel implementing and would love to hear more suggestions.
The feature I'm working on is the Constant Bandwidth Server (CBS). It is basically a soft real-time task wrapper that has a time budget to execute the jobs assigned to it. It is also meant to work alongside EDF, and its priority (i.e. deadline) is recalculated every time the budget expires or an internal condition is met. This budget can only be consumed when the jobs are being executed (i.e. in the CPU) which is why I need to know when the CBS thread enters and leaves the processor to trigger/stop the budget timer accordingly.
Tracing is a debug feature, not an API.
I agree! And that's why I'm no longer using that. However, analyzing the kernel code so far the best places to put my cbs_switched_in
and cbs_switched_out
hooks were on the z_thread_mark_switched_in
and z_thread_mark_switched_out
functions, which looked like the proper entry points for those. If you have better suggestions please let me know!
Suck in @nashif @teburd for more expertise and remove me as assignee, as tracing isn't really my area and as you can tell I'm not really a fan of how we do it currently. :)
Please suggest something better then
Personally I don't see kprobe/ftrace like tracing being answers that make any sense here.
What we have today works, and works well with the tools people are familiar with in this space (systemview, tracealyzer) so its unclear to me what the problem is exactly other than it requires manually ensuring the hooks are called at the right time... which really doesn't appear to be much of a burden at all.
so on mips, we had one call to the switched_out hook missing, something like this, though still not sure if this the right spot makes it work:
diff --git a/arch/mips/core/isr.S b/arch/mips/core/isr.S
index 86d05d19833..f2f2f342053 100644
--- a/arch/mips/core/isr.S
+++ b/arch/mips/core/isr.S
@@ -70,6 +70,7 @@ GTEXT(_Fault)
GTEXT(_k_neg_eagain)
GTEXT(z_thread_mark_switched_in)
+GTEXT(z_thread_mark_switched_out)
/* exports */
GTEXT(__isr_vec)
@@ -209,6 +210,9 @@ on_thread_stack:
#endif /* CONFIG_PREEMPT_ENABLED */
reschedule:
+#ifdef CONFIG_INSTRUMENT_THREAD_SWITCHING
+ jal z_thread_mark_switched_out
+#endif
For xtensa, there is something weird going on, the switched_in call is called but without any output, investigating still.
so on mips, we had one call to the switched_out hook missing, something like this, though still not sure if this the right spot makes it work: ... For xtensa, there is something weird going on, the switched_in call is called but without any output, investigating still.
Thank you for your help! If I can be useful for any further testing please let me know.
Personally I don't see kprobe/ftrace like tracing being answers that make any sense here.
What about syscalls? We already have a mechanism for detecting/emitting/indirecting around defined functions with lightweight/inlined stubs based on build time configuration (it's even smart enough to understand how to inspect the arguments and return values!). Seems like all the same tricks would apply to emit tracing entry/exit hooks as you'd need to emit a un/marshalling, they could share probably 80% of their code. Imagine by doing tracing by putting a "__trace
"
annotation on your function and then maybe having a build time list of which functions get instrumented that could be controlled by some new build API.
I mean, it doesn't have to be super gold plated. My complaint here is just things like this bug. Tracing in Zephyr is, kinda, "always broken". We get it working once and then look away and some yahoo has come in and moved stuff around and the trace points aren't there any more, or there's a new code path that skips them, or in cases like this one the entry/exits don't match (to be fair context switching is highly nontrivial and arch-specific, so this specific bug is not amenable to obvious automation).
Hey there, I'm back after a few weeks off. I don't really think so, but any progress here?
as for the specific case of the switched_in
/switched_out
events, I thought about adding the following on my contribution:
/* Just a wrapper around _current = xxx with tracing */
static inline void set_current(struct k_thread *new_thread)
{
/*
FIXME this code appears to be inducing false switched_out events.
When the switched_out function is commented, targets seem to work as expected.
https://github.com/zephyrproject-rtos/zephyr/issues/76057#issuecomment-2236545540
z_thread_mark_switched_out() below is proposed to
be left commented until a proper solution arises.
*/
/* z_thread_mark_switched_out(); */
_current_cpu->current = new_thread;
}
I've inspected a great part of the codebase already but haven't yet gained enough system-wide understanding of what's really going on, and I'm not sure I will anytime soon. I'm aware this proposal might seem a bad idea, though, so feel free to reject it completely if you will.
No complete yet, will come back to this ASAP
I'm sorry @mmahadevan108 but isn't this issue supposed to remain open as long as other architectures still need fixes? @nashif only solved for MIPS as far as I understood.
Describe the bug
The
z_thread_mark_switched_in
andz_thread_mark_switched_out
functions seem to be getting called assimetrically across different targets, with (sometimes way) moreout
events being called thanin
events. I have first mentioned this problem in discussion #75406, but after a bit of digging and a more comprehensive testing round I am now almost sure it's an actual bug and not a problem of my implementation (although that can always be the case).The example below uses the user-defined tracing API, which internally is directly called by the aforementioned functions. I am also developing a kernel feature which is called directly by
z_thread_mark_switched_in
andout
, and the problem continues, which leads me to think they are the souce of the issue and not the user-defined tracing API.To Reproduce
Set the following configs in
prj.conf
as well as whichever needed for your device to log outputs.Run the following example code written in
main.c
.Expected behavior
The code should showcase an equal amount of
SWT_TO
andSWT_AY
events. The number itself is not relevant, but they shoud match. Like this:The exact shape of the bug relies on the target, please refer to #75406 for more examples. But the worst-case scenario for me was certainly with the
xiao_esp32c3
board:Impact
Although it doesn't prevent my feature to run, it certainly cuts off the ability to adapt to preemptions. Since I can't know reliably in all targets when a thread enters/leaves the CPU. Therefore the functionality can only be assured when no higher priority thread shows up amidst execution.
Environment (please complete the following information):