Zephyr instrumentation subsystem

gromero commented 1 year ago

TL:DR;

New Zephyr subsystem for profiling and tracing, proof of concept: https://github.com/zephyrproject-rtos/zephyr/pull/70253

CLI tool demo: https://asciinema.org/a/581092

Perfetto output example:

Introduction

This RFC proposes a new Zephyr subsystem named instrumentation meant for profiling and tracing. The new 'instrumentation' subsys leverages the compiler code instrumentation feature available on compilers like GCC and LLVM, enabled via the flag -finstrument-functions, to collect profile data, function entry/exit traces, and thread scheduling info in a non-intrusive way.

Problem description

Most profiling and tracing features in Zephyr today either depend on proprietary and expensive tooling, or have significant platform limitations. For instance, Segger's SystemView, which requires J-Link, or gprof, which only supports a few platforms, often emulated, like native_posix, and native_posix_64 platforms, with no timestamp data associated with events. Also, although tracing subsys has numerous hooks available in its APIs, these hooks are static and so can't be easily set to trace arbitrary functions in kernel or application code.

By leveraging the compiler code instrumentation feature it is possible to have a platform, tooling, and arch-neutral tracing and profiling subsystem which allows collecting traces at arbitrary code locations, both at compile time and runtime, and at the same time, using the very same mechanism, collect profiling data for the traced code.

Once the traces and profiling data are collected, various transport layers can be used to transfer data from the target to the host, similary to the transport backends available in the tracing subsys (UART, USB, etc).

Data can then the viewed in the host using a CLI tool, which can display traces and profing statistics in the text terminal or export it to various formats so data can be visualized on GUI tools, like Perfetto TraceViewer. The same CLI tool allows for setting the functions to be traced at runtime (it can also be set via Kconfig, at compile time).

Proposed change

Implement a new Zephyr subsystem named instrumentation and an associated CLI tool that allows setting, collecting, displaying, and exporting tracing and profiling data captured by this new subsystem.

Detailed RFC

Compilers like GCC and LLVM can generate instrumentation calls for entry and exit to functions if flag -finstrument-functions is passed to the compiler. When this feature is enabled, just after function entry and just before function exit, the following functions, respectively, are called with the address of the current function and its call site:

void __cyg_profile_func_enter(void *callee, void *call_site);
void __cyg_profile_func_exit(void *callee, void *call_site);

The instrumentation is also done for functions expanded inline in other functions. A function may be given the attribute no_instrument_function, in which case this instrumentation is not done for the function. It's also possible to exclude from instrumentation all functions in a compilation unit (source file) by setting -finstrument-functions-exclude-file-list=file,file,..., or also exclude functions by their name setting -finstrument-functions-exclude-function-list=sym,sym,....

All these options take effect at compile time and so the instrumentation subsys has yet another way to disable the instrumentation at runtime: calling instr_disable(); and instr_enable(), to enable it back. This is an interesting mechanism because it allows using any function inside the instrumentation code and avoid having to exclude that function from instrumentation at compile time, which would prevent tracing and profiling that function outside the instrumentation code, i.e. elsewhere in kernel or in the application code.

Because the compiler instruments the entry and exit to functions, it's possible to have a single handler called by __cyg_profile_func_enter() and __cyg_profile_func_exit() that collects profiling and tracing data for the whole system. The instr_event_handler() is this handler.

Conceptually, the only events that can trigger the handler are function entries and exits, which provide good function trace information. However once the handler inspects the context and the name of the function (callee) the handler can promote these events to others. For instance, if it's possible to get the context when a function is called, a function entry event with additional context (CPU ID, priority, thread name, etc) is generated; another example is if the callee is part of the thread switching code, the event can be promoted to a thread switched in/out scheduler event (e.g. z_thread_mark_switched_in/out()). Other events can be easily added the same way.

tracing subsys needs to enabled (CONFIG_TRACING=y) to give access to hooks like z_thread_mark_switched_in/out().

The instrumentation subsys relies upon a monotonic timer which is initialized to zero at subsys initialization. Different platforms can set the timer backend to the most suitable timer source to be used. For example, on some Arm Cortex-M the DWT (Data Watchpoint and Trace) unit is available and DWT_CYCCNT can be used as a monotonic clock source. This timer is used by the handler to acquire a timestamp for each event processed.

The handler collects profile data for each function in an aggregated way, where the total amount of time spent in the function (delta) is calculated per entry/exit and accumulated. The accumulators are kept in an array. Similarly to the way a function stack is created and destroyed in the function's prologue and epilogue, a timestamp (S0) is pushed to an alternative stack when a function is called and another timestamp is obtained just before the function returns (S1). The delta is calculated as delta = S1 - S0. Delta is then accumulated per function. The depth of the timestamp stack can be configured (e.g to 16) and the number of functions for which delta can be accumulated can also be configured (e.g. 256 as the size of accumulator array).

The handler also collects traces, notabily, function entry/exit events but also scheduling events. The events are recorded into a configurable size buffer, which can be configured to behave as a ring buffer or as a fixed buffer, depending on the use case.

The ring buffer API used by the instrumentation subsys to keep the traces is similar to the Zephyr Ring Buffers API but allows putting variable size elements (event records) and also can work in overwrite mode. In the overwrite mode when the buffer fills up newer events will overwrite the older ones. The buffer can also be configured to be in fixed mode, in this mode once the buffer fills up no more traces are recorded. The ring buffer API consists in:

void init_instr_ringbuffer(void);

struct instr_record* put_item_claim(enum instr_event_types type);
void put_item_finish(enum instr_event_types type);
int put_item_wrapping(struct instr_record* record);

struct instr_record* get_item_claim(void);
void get_item_finish(struct instr_record* record);
int get_item_wrapping(struct instr_record* record);
struct instr_record* get_item(struct instr_record* record);

When the instrumentation handler needs to write a record it calls put_item_claim and informs the type of the record it wants to write. If there is enough contiguous space (buffer doesn't wrap) put_item_claim returns a pointer that will be used to write the record directly to the buffer. If the buffer wraps, the instrumentation handler will need to fill up first a record structure and pass it to put_item_wrapping, which will take care of copying the record to the buffer, wrapping data accordingly to the start of the buffer.

When the instrumentation handler needs to read a record, usually for dumping it to the host, the ring buffer get_* functions are used instead and they work similarly to put_* ones. One additional function is available for convenience tho: get_item. get_item will returned the next available item at the tail of the ring buffer, until there is no more items to get from the ring buffer.

Event records are defined as follows, where the header is essentially used to specify the event type:

struct instr_record {
        struct instr_header header;
        void* callee;
        void* caller;
        uint64_t timestamp;
        union {
                struct instr_payload0 payload0;     // Metadata.
                struct instr_payload1 payload1;     // Entry/Exit, no context.
                struct instr_event_context context; // Entry/Exit, with context.
                // ...
                /* Add more payloads here */
        };
} __packed;

The following basic event types are proposed:

enum __attribute__ ((__packed__)) instr_event_types {
        INSTR_EVENT_META = 0,   /* Metadata record, defined in TLV format. */
        INSTR_EVENT_ENTRY,  /* Callee entry event record */
        INSTR_EVENT_EXIT,   /* Callee exit event record */
        INSTR_EVENT_ENTRY_CTXT, /* Entry event record, but with context */
        INSTR_EVENT_EXIT_CTXT,  /* Exit event record, but with context */
        INSTR_EVENT_PROFILE,    /* Profile events */
        INSTR_EVENT_SCHED_IN,   /* Thread switched in scheduler event */
        INSTR_EVENT_SCHED_OUT,  /* Thread switched out scheduler event */
        INSTR_EVENT_NUM
};

Subsystem initialization and activation

Because the compiler can instrument any function, the instrumentation handler can be called quite early in the boot process, even as soon as the first kernel function is called. This can be problematic because RAM might be not properly initialized yet.

One way to guarantee the instrumentation subsystem has the proper context to be initialized is to check if an initialized C variable in RAM is already properly initialized with a magic number; in other words, check an in-RAM value against a magic value in the flash memory. This check is done by instr_fundamentals_initialized() and guarantees essential Zephyr RAM initialization functions were executed (like z_early_memset & friends).

In the instrumentation initialization the tracing buffer will be initialized. Also, the trigger and stopper function will be set accordingly. The trigger and stopper function addresses control when, after the subsys is enabled, the tracing and profiling will start and stop, respectively. The trigger and stopper addresses are set to a default value at compile time based on Kconfig, but since they can be set at runtime and need to survive between reboots, they can be set also from the RAM, via the retained_mem driver. Hence, if values for trigger and stopper are available from retained memory, they will be set accordingly, overwritting the defaults from compile time.

After proper initialization the instrumentation subsystem will be enabled but turned off.

In the instrumentation subsys terminology enabled means that tracing and profiling can be performed at any time but will start only when instrumentation is turned on, and will stop when it is turned off. Thus there are two stages to effetively activate tracing and profiling. This allows disabling the instrumentation in the instrumentation code itself at runtime, so in critical sections, avoiding infinite recursions (instrumentation subsys functions, like the handler, are excluded at compile time from instrumentation but can call other functions, like strncpy, etc, which are instrumented, leading to infinite recursion).

Tracing and profiling are turned on when the trigger function is called and are turned off when the stopper function returns. This lets a fine-grained control over which code location tracing and profiling will be active.

Transport

Much like in the tracing subsys, the instrumentation subsys might support various backends for transporting the tracing and profiling data collected in the buffers. The focus must be on popular, low-cost, and widely available interfaces, like UART and USB.

The transport controller supports the following commands:

status: Get status on tracing and profiling (enabled / disabled) reboot: Reboot target ping: Ping transport controller dump_trace: Dump traces from buffer dump_profiler: Dump profile deltas from accumulator array trigger: Set trigger function stopper: Set stopper function

The dump commands dump the event records as a binary stream via the transport channel. The binary stream is decoded using the CTF (Common Trace Format) v1.8 and libbabeltrace2.

Currently only the asynchronous transfer mode is considered.

CLI tool

To interface with the transport controller an easy to use CLI tool is proposed, which provides the subcommands and options below:

~/zephyrproject/zephyr$ ./scripts/zaru --help
usage: zaru [-h] [--serial SERIAL] [--elf ELF] {reboot,status,trace,profile} ...

Zephyr compiler-managed system profiling and tracing tool.

positional arguments:
  {reboot,status,trace,profile}
    reboot              reboot target.
    status              get instrumentation status from target.
    trace               get traces from target.
    profile             get profile info from target.

options:
  -h, --help            show this help message and exit
  --serial SERIAL       default to '/dev/ttyACM0'.
  --elf ELF             default to 'build/zephyr/zephyr.elf'.

~/zephyrproject/zephyr$ ./scripts/zaru trace --help
usage: zaru trace [-h] [--verbose] [--reboot] [--list] [--trigger FUNC_NAME] [--stopper FUNC_NAME] [--couple FUNC_NAME] [--perfetto] [--output OUTPUT] [--annotation]

options:
  -h, --help            show this help message and exit
  --verbose, -v         verbose mode.
  --reboot, -r          reboot target before getting traces.
  --list, -l            list trigger and stopper functions set in the target.
  --trigger FUNC_NAME, -t FUNC_NAME
                        set FUNC_NAME as the trigger function. When called it turns on instrumentation.
  --stopper FUNC_NAME, -s FUNC_NAME
                        set FUNC_NAME as the stopper function. When returned it turns off instrumentation.
  --couple FUNC_NAME, -c FUNC_NAME
                        set both trigger and stopper to FUNC_NAME.
  --perfetto, -p        export traces to Perfetto TraceViewer UI (Event Trace Format).
  --output OUTPUT, -o OUTPUT
                        output filename. Default to './perfetto.json'.
  --annotation, -a      disable function name annotation in function returns.

~/zephyrproject/zephyr$ ./scripts/zaru profile --help
usage: zaru profile [-h] [--verbose] [--reboot] [-n [N]]

options:
  -h, --help     show this help message and exit
  --verbose, -v  verbose mode.
  --reboot, -r   reboot target before profiling.
  -n [N]         show first N most expensive functions.

Changes in Zephyr core

No changes are necessary in the existing Zephyr core code or in the tracing hooks since the instrumentation subsys is called from instrumented code generated by the compiler.

Use cases

The instrumentation subsys and its CLI tool, zaru, can be used to trace and profile specific code location.

The user will set trigger and stopper function names, usually set to main. If they are changed at runtime, then --reboot must be passed to reboot the target before data is collected at new location specified.

Profiling data is collected always when tracing is active.

The collected data can be viewed using zaru trace or zaru profile subcommands. The size of the ring buffer can be configured accordingly to the memory constraints of the target. The more buffer size, the more traces it's possible to collect.

If the ring buffer is in write mode, the most recent events can be observed. Otherwise, once the buffer gets full data collection stops and so only the first (and older) events are kept recorded.

The tracing data can be exported to various formats, like the Event Trace Format, that can be imported by Perfetto Trace Viewer.

Dependencies

The instrumentation subsys depends on the tracing subsys, i.e. CONFIG_TRACING=y, for the thread scheduling hooks. It also depends on the retained_mem driver for retaining the trigger and stopper addresses between reboots, i.e. CONFIG_RETAINED_MEM=y and CONFIG_RETAINED_MEM_MUTEXES=n. And it also depends on the reboot capability (CONFIG_REBOOT=y).

The CLI tool, zaru, depends on Python 3 Babeltrace2 module (bt2) for decoding the binary stream (event records) via the Common Trace Format's Trace Stream Description Language (TSDL).

Concerns and Unresolved Questions

Calculating delta in profiling for recursive calls can easily lead to wrong delta values because the number of recursive calls exceeds the number of slots in the timestamp stack (usually 16 slots). Maybe some heuristics can be used to avoid pushing further timestamps in recursion.
If the number of functions profiled exceeds the max number of functions for which delta can be accumulated, some kind of heuristic can be used to discard the least relevant ones, for instance, function with low delta values.
About 50% (20 bytes) of the event record size is spend in the thread name. It can be compacted further by using a lookup table and just recording the lookup index.
Due to limitations in the CTF 1.8 a variant id can't be used to select a field type, so it's difficult to represent 32-bit vs 64-bit binary formats in a single metadata file. The same happens for expressing big-endian vs little-endian formats. Hence multiple metadata files would exist in CTF 1.8. This problem is solved in CTF 2.0, but currently there isn't a library implementation for it (it's a WIP).
trigger and stopper mechanism can be enhanced, for instance, be activated only a certain amount of times and when callee is called from specific callers.
It's possible to enhance the tracing by providing additional statistics. For example: if the trace buffer isn't enough to capture all the events for the location selected, print how much memory it would be necessary to capture all the events ideally, so the user can adjust the buffer size based on this estimation.

github-actions[bot] commented 1 year ago

Hi @gromero! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

cfriedt commented 1 year ago

Also, although tracing subsys has numerous hooks available in its APIs, these hooks are static and so can't be easily set to trace arbitrary functions in kernel or application code.

💯 Very much support adding this upstream. I'll see if I can get some interested parties to comment here.

Once the traces and profiling data are collected, various transport layers can be used to transfer data from the target to the host, similary to the transport backends available in the tracing subsys (UART, USB, etc).

Perfect - this is exactly what I was looking for!

Thanks for putting this together @gromero and @microbuilder

vlotech commented 1 year ago

This is looking great! We used to debug and trace the same with eCos and it's internal tools which solved issues in hours instead of weeks.

microbuilder commented 7 months ago

@gromero Any update on getting this over the finish line?

TaiJuWu commented 6 months ago

Is it ready to merge?

gromero commented 5 months ago

Is it ready to merge?

@TaiJuWu @microbuilder Unfortunately, not yet. I need to remove the variable length support in the ring buffer (use Zephyr API instead) and test it on a VM as requested by @carlocaione in Prague. I didn't give up. It's been just hard to find the cycles I need to address it. Otherwise, I deem it done.

zephyrproject-rtos / zephyr