Asymmetry between `switched_in` and `switched_out` events

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.96k stars 6.68k forks source link

Asymmetry between `switched_in` and `switched_out` events #76057

Closed alexpaschoaletto closed 3 weeks ago

alexpaschoaletto commented 4 months ago

Describe the bug

The z_thread_mark_switched_in and z_thread_mark_switched_out functions seem to be getting called assimetrically across different targets, with (sometimes way) more out events being called than in events. I have first mentioned this problem in discussion #75406, but after a bit of digging and a more comprehensive testing round I am now almost sure it's an actual bug and not a problem of my implementation (although that can always be the case).

The example below uses the user-defined tracing API, which internally is directly called by the aforementioned functions. I am also developing a kernel feature which is called directly by z_thread_mark_switched_in and out, and the problem continues, which leads me to think they are the souce of the issue and not the user-defined tracing API.

To Reproduce

Set the following configs in prj.conf as well as whichever needed for your device to log outputs.

CONFIG_LOG=y
CONFIG_SCHED_DEADLINE=y
CONFIG_TRACING=y
CONFIG_TRACING_USER=y

Run the following example code written in main.c.

#include <zephyr/logging/log.h>
#include <zephyr/kernel.h>
#include <string.h>

LOG_MODULE_REGISTER(tracing);

static void thread_function(void *a1, void *a2, void *a3){                 // thread: prints the target name every second
    for(;;){
        printf("\n%s\n\n", CONFIG_BOARD_TARGET);
        k_sleep(K_SECONDS(1));
    }
}

K_THREAD_DEFINE(
    test_thread, 1024, thread_function,
    NULL, NULL, NULL, 5, 0, 0
);

void sys_trace_thread_switched_in_user(){                                        //should log every time the test thread enters the CPU
    if(k_sched_current_thread_query() == test_thread) LOG_INF("SWT_TO"); 
}

void sys_trace_thread_switched_out_user(){                                    //should log every time the test thread leaves the CPU
    if(k_sched_current_thread_query() == test_thread) LOG_INF("SWT_AY");
}

Expected behavior

The code should showcase an equal amount of SWT_TO and SWT_AY events. The number itself is not relevant, but they shoud match. Like this:

expected_behavior

The exact shape of the bug relies on the target, please refer to #75406 for more examples. But the worst-case scenario for me was certainly with the xiao_esp32c3 board:

Impact

Although it doesn't prevent my feature to run, it certainly cuts off the ability to adapt to preemptions. Since I can't know reliably in all targets when a thread enters/leaves the CPU. Therefore the functionality can only be assured when no higher priority thread shows up amidst execution.

Environment (please complete the following information):

OS: Windows 11 23H2
Toolchain: SDK 0.16.8
Version tested on both 3.7.0-rc2 and 2.7.0-rc3

github-actions[bot] commented 4 months ago

Hi @alexpaschoaletto! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

alexpaschoaletto commented 4 months ago

Context

This comment only explains a bit of the actual feature I'm working on and how I found the bug in the first place.

I'm developing a kernel feature which needs to know when a particular thread switches in/out of the CPU in order to properly work. My feature therefore directly uses the z_thread_mark_switched_in and z_thread_mark_switched_out functions for that purpose, since it runs within the kernel and doesn't need the intermediation of the tracing API. So far, so good.

The feature should start a timer whenever the concerned thread enters the CPU and stop the timer if the thread leaves it before expiring. All the timer expiry function does is redefine its EDF priority by recalculating its absolute deadline. Therefore a preemption can happen depending on the other threads.

The following showcases a normal execution of the feature - the only relevant logs for this analysis are:

SWT_TO: the thread enters the CPU for execution.
SWT_AY: the thread leaves the CPU.
B_ROUT: the timer expired.

The expected output when the thread finishes before the timer expiring is:

...
[00:00:19.110,000] <inf> CBS: cbs_1     SWT_TO  200000        //the number is the time left (in cycles)
[00:00:19.120,000] <inf> CBS: cbs_1     B_STAT  34173
[00:00:19.120,000] <inf> CBS: cbs_1     SWT_AY  34173
...

And the expected output if the timer expires before the thread finishes is:

...
[00:00:05.040,000] <inf> CBS: cbs_1     SWT_TO  200000
[00:00:05.060,000] <inf> CBS: cbs_1     B_STAT  9257
[00:00:05.070,000] <inf> CBS: cbs_1     B_ROUT  200000      // logged by timer expiry function
[00:00:05.070,000] <inf> CBS: cbs_1     B_STAT  191420
[00:00:05.070,000] <inf> CBS: cbs_1     SWT_AY  191420
...

However, the actual output frequently observed when the timer expires is:

...
[00:00:05.040,000] <inf> CBS: cbs_1     SWT_TO  200000         // switch to
[00:00:05.050,000] <inf> CBS: cbs_1     B_STAT  21493
[00:00:05.060,000] <inf> CBS: cbs_1     B_ROUT  200000
[00:00:05.060,000] <inf> CBS: cbs_1     B_STAT  191045
[00:00:05.060,000] <inf> CBS: cbs_1     SWT_AY  191045         // switch away
[00:00:05.070,000] <inf> CBS: cbs_1     SWT_AY  191045         // switch away
...

I assumed it could be my implementation and decided to test a simple code as a sanity check. And that's when I ended up with the example code of #75406.

alexpaschoaletto commented 4 months ago

Quick update, I think line 879 in kernel/sched.c might be part of the issue, but not all of it:

static inline void set_current(struct k_thread *new_thread)
{
    z_thread_mark_switched_out();                 // here - only the switched_out is called
    _current_cpu->current = new_thread;
}

This is the only moment in sched.c in which any of the tracing functions are called, and it only calls the switched_out function. As an experiment I decided to comment this line and see what happens. This is the result for xiao_esp32c3 for the simple example:


xiao_esp32c3/esp32c3

[00:00:13.091,000] <inf> tracing: SWT_TO
[00:00:13.091,000] <inf> tracing: SWT_AY
[00:00:14.092,000] <inf> tracing: SWT_TO
[00:00:14.095,000] <inf> tracing: SWT_AY

xiao_esp32c3/esp32c3

xiao_esp32c3/esp32c3

[00:00:15.096,000] <inf> tracing: SWT_TO
[00:00:15.098,000] <inf> tracing: SWT_AY
[00:00:16.099,000] <inf> tracing: SWT_TO
[00:00:16.103,000] <inf> tracing: SWT_AY

and this is the output for the kernel feature I'm developing:

...
[00:00:13.080,000] <inf> CBS: cbs_1     SWT_TO  200000           // switch to
[00:00:13.100,000] <inf> CBS: cbs_1     B_STAT  10069
[00:00:13.110,000] <inf> CBS: cbs_1     B_ROUT  200000
[00:00:13.110,000] <inf> CBS: cbs_1     B_STAT  170632
[00:00:13.110,000] <inf> CBS: cbs_1     SWT_AY  170632          // switch away
...

As expected, results with targets that did not output SWT_TO events were not affected. But the case for xiao_esp32c3 seems definetely solved with this tweak.

andyross commented 4 months ago

FWIW:

I'm developing a kernel feature which needs to know when a particular thread switches in/out of the CPU in order to properly work.

Then you really don't want to be doing that on top of the tracing layer, which is (as you've discovered) somewhat fragile. Trace points move/break/get-evaded all the time, we aren't (and don't want to be!) rigorous about low level ordering details. The kernel promises the behavior specified in its API. Tracing is a debug feature, not an API.

Can you elaborate on what you're trying to achieve, either here or in a separate issue (or Discord, etc...). I suspect there are better options. Basically if you really need to be whiteboxing the scheduler then you want to be writing code in the scheduler.

That's not to say there isn't a bug with the tracing hooks on context switch, mind you. Just that it's the kind of bug I'd expect to see.

andyross commented 4 months ago

Suck in @nashif @teburd for more expertise and remove me as assignee, as tracing isn't really my area and as you can tell I'm not really a fan of how we do it currently. :)

alexpaschoaletto commented 4 months ago

Then you really don't want to be doing that on top of the tracing layer, which is (as you've discovered) somewhat fragile. Trace points move/break/get-evaded all the time, we aren't (and don't want to be!) rigorous about low level ordering details. The kernel promises the behavior specified in its API. Tracing is a debug feature, not an API.

Can you elaborate on what you're trying to achieve, either here or in a separate issue (or Discord, etc...). I suspect there are better options. Basically if you really need to be whiteboxing the scheduler then you want to be writing code in the scheduler.

Thanks for the tip! I'm still new on Zephyr kernel implementing and would love to hear more suggestions.

The feature I'm working on is the Constant Bandwidth Server (CBS). It is basically a soft real-time task wrapper that has a time budget to execute the jobs assigned to it. It is also meant to work alongside EDF, and its priority (i.e. deadline) is recalculated every time the budget expires or an internal condition is met. This budget can only be consumed when the jobs are being executed (i.e. in the CPU) which is why I need to know when the CBS thread enters and leaves the processor to trigger/stop the budget timer accordingly.

Tracing is a debug feature, not an API.

I agree! And that's why I'm no longer using that. However, analyzing the kernel code so far the best places to put my cbs_switched_in and cbs_switched_out hooks were on the z_thread_mark_switched_in and z_thread_mark_switched_out functions, which looked like the proper entry points for those. If you have better suggestions please let me know!

teburd commented 4 months ago

Suck in @nashif @teburd for more expertise and remove me as assignee, as tracing isn't really my area and as you can tell I'm not really a fan of how we do it currently. :)

Please suggest something better then

Personally I don't see kprobe/ftrace like tracing being answers that make any sense here.

kprobe like tracing would be a regression performance wise, would require opcode padding presumably (to insert the break/debug interrupt instruction), and require some form of management protocol to set/list/clear kprobes. We don't have a consistent FS API layer like Linux here.
ftrace like tracing would, presumably, require some compiler help (-finstrument-functions) and some method of filtering which functions are decorated with these to avoid overhead of entry/exit on all symbols. It would also require a way of mapping functions to consistent event ids. It would also not work with any inlined functions, which Zephyr employs.

What we have today works, and works well with the tools people are familiar with in this space (systemview, tracealyzer) so its unclear to me what the problem is exactly other than it requires manually ensuring the hooks are called at the right time... which really doesn't appear to be much of a burden at all.

nashif commented 4 months ago

so on mips, we had one call to the switched_out hook missing, something like this, though still not sure if this the right spot makes it work:

diff --git a/arch/mips/core/isr.S b/arch/mips/core/isr.S
index 86d05d19833..f2f2f342053 100644
--- a/arch/mips/core/isr.S
+++ b/arch/mips/core/isr.S
@@ -70,6 +70,7 @@ GTEXT(_Fault)

 GTEXT(_k_neg_eagain)
 GTEXT(z_thread_mark_switched_in)
+GTEXT(z_thread_mark_switched_out)

 /* exports */
 GTEXT(__isr_vec)
@@ -209,6 +210,9 @@ on_thread_stack:
 #endif /* CONFIG_PREEMPT_ENABLED */

 reschedule:
+#ifdef CONFIG_INSTRUMENT_THREAD_SWITCHING
+       jal z_thread_mark_switched_out
+#endif

For xtensa, there is something weird going on, the switched_in call is called but without any output, investigating still.

alexpaschoaletto commented 4 months ago

so on mips, we had one call to the switched_out hook missing, something like this, though still not sure if this the right spot makes it work: ... For xtensa, there is something weird going on, the switched_in call is called but without any output, investigating still.

Thank you for your help! If I can be useful for any further testing please let me know.

andyross commented 4 months ago

Personally I don't see kprobe/ftrace like tracing being answers that make any sense here.

What about syscalls? We already have a mechanism for detecting/emitting/indirecting around defined functions with lightweight/inlined stubs based on build time configuration (it's even smart enough to understand how to inspect the arguments and return values!). Seems like all the same tricks would apply to emit tracing entry/exit hooks as you'd need to emit a un/marshalling, they could share probably 80% of their code. Imagine by doing tracing by putting a "__trace" annotation on your function and then maybe having a build time list of which functions get instrumented that could be controlled by some new build API.

I mean, it doesn't have to be super gold plated. My complaint here is just things like this bug. Tracing in Zephyr is, kinda, "always broken". We get it working once and then look away and some yahoo has come in and moved stuff around and the trace points aren't there any more, or there's a new code path that skips them, or in cases like this one the entry/exits don't match (to be fair context switching is highly nontrivial and arch-specific, so this specific bug is not amenable to obvious automation).

alexpaschoaletto commented 3 months ago

Hey there, I'm back after a few weeks off. I don't really think so, but any progress here?

as for the specific case of the switched_in/switched_out events, I thought about adding the following on my contribution:

/* Just a wrapper around _current = xxx with tracing */
static inline void set_current(struct k_thread *new_thread)
{
    /*
        FIXME this code appears to be inducing false switched_out events.
        When the switched_out function is commented, targets seem to work as expected.
        https://github.com/zephyrproject-rtos/zephyr/issues/76057#issuecomment-2236545540

        z_thread_mark_switched_out() below is proposed to
        be left commented until a proper solution arises.
    */

    /* z_thread_mark_switched_out(); */
    _current_cpu->current = new_thread;
}

I've inspected a great part of the codebase already but haven't yet gained enough system-wide understanding of what's really going on, and I'm not sure I will anytime soon. I'm aware this proposal might seem a bad idea, though, so feel free to reject it completely if you will.

nashif commented 3 months ago

No complete yet, will come back to this ASAP

alexpaschoaletto commented 3 weeks ago

I'm sorry @mmahadevan108 but isn't this issue supposed to remain open as long as other architectures still need fixes? @nashif only solved for MIPS as far as I understood.