zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.8k stars 6.58k forks source link

Define and document expected behavior of API calls in various contexts #18970

Open pabigot opened 5 years ago

pabigot commented 5 years ago

This issue is intended to cover the concerns raised in #1960, #3694, #6184, #17014, and related lack of clarity and consistency.

Proposal: We should have:

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

Example policies might be:

If we can define the normal behavior at either a system or driver-specific level, only the exceptions need be documented at the individual function level.

mbolivar commented 4 years ago
  • thread-safe: (TBD) a function is thread-safe if its behavior is correct when invoked from multiple threads at the same time

  • reentrant: (TBD) a function is reentrant if its behavior is correct when it is invoked by (indirect) recursion from the same thread

+1 to these.

  • interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead.

This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

By contrast, "might sleep" makes it clearer that the caller's thread might get scheduled out, and thus further hints the caller must be running in thread context.

At the individual function level we should clearly document:

Can a call block?

So this would become e.g.:

/**
 * @brief Enable a clock controlled by the device
 *
 * On success, the clock is enabled and ready when this function
 * returns.
 *
 * This function :ref:`might-sleep`.
 *
 * @param dev Device structure whose driver controls the clock.
 * @param sys Opaque data representing the clock.
 * @return 0 on success, negative errno on failure.
 */
static inline int clock_control_on(struct device *dev,
                   clock_control_subsys_t sys)

Where the Sphinx might-sleep ref would link directly to a common definition for the term.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

+1 to k_is_atomic() instead of k_is_in_isr().

mbolivar commented 4 years ago

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:

The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.

pabigot commented 4 years ago
  • interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

We need to be able to say succinctly "Unless otherwise specified all API functions are interrupt-safe" and expect people to know what that means. A specific example would be the GPIO API, where it is not clear to implementers that, because GPIO write functions may be invoked from ISRs, read-modify-write code like:

u32_t out = gpio->OUT;
gpio->OUT ^= (out & ~mask) | (value & mask);

must be wrapped in a spin-lock to be interrupt-safe. Most implementations don't do that, even in the ongoing GPIO rewrite, in part because we couldn't get agreement in #19252 on what "interrupt-safe" meant and whether it was important to document that expectation before people started implementing drivers.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead. This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

My intent was to identify functions that will not transfer control from a cooperative thread. "blocking" is a poorly chosen term. Scrap it and start over.

Zephyr uses several terms including "Waiting", "Suspended", and "Ready" to identify started non-terminated non-running threads. The transitions we're interested in are "suspend" and "wait". Let's use "yield" for "suspends or waits".

There are three cases:

So we can do:

It's non-yielding that we care about, because those functions guarantee, when invoked by a cooperative thread, that no other cooperative thread will be allowed to run before the function returns. (Absent bugs like #20255.)

I'm not sure exactly how functions should be annotated with respect to their behavior, but I would want it to be terse and consistent.

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Yes, but that relies on having API to provide information about context to callees, which we probably don't have. To be considered later.

mbolivar commented 4 years ago

A specific example would be the GPIO API

Excellent example, thank you.

Summarizing the status:

There are three cases:

* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

  • yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.
  • non-yielding (function): A function is non-yielding when it is not yielding.

With the given definition of "yield" ("suspends or waits"), I agree these are the desired semantics.

I also agree that defining terms in accordance with documented thread states and related APIs is a great approach.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

pabigot commented 4 years ago

I've integrated the discussion so far into this document which is available for review through #20959 so people can comment on specific concerns in context. I think that'll be a bit more convenient then chained comments on this issue.

There are three cases:


* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

No. The operation of yielding is something that directly affects threads. The impact on functions is derived.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

A problem is that traditionally Zephyr has used "sleep" to specifically mean "wait for a timeout to occur". I would rather that "suspend" was more general, but currently it specifically references waiting without an event.

carlescufi commented 4 years ago

API meeting 19th May:

Plan of action to document the expected behavior of public API functions in Zephyr:

During the 2.4 development cycle, begin each API meeting with a status update on the function properties.

pabigot commented 4 years ago

Here's my assessment of k_mutex and k_pipe:

function syscall reschedule sleep no-wait isr-ok pre-kernel-ok async supervisor
k_mutex_init YES no no yes yes
k_mutex_lock YES yes YES YES NO*
k_mutex_unlock YES YES no yes NO*
k_pipe_init no no no yes yes
k_pipe_cleanup no no no yes yes
k_pipe_alloc_init YES no no yes yes
k_pipe_put YES yes YES YES yes
k_pipe_get YES yes YES YES yes
k_pipe_block_put no yes YES no no YES
k_pipe_read_avail YES yes YES no no
k_pipe_write_avail YES yes YES no no

Some notes on the table

Determining the behavior of the API is not trivial. Some of the above may be wrong. Some were surprises, like the functions to read the number of spaces available for read and write in k_pipe, which can block for an unbounded period due to an internal call to an allocation function implemented as k_stack_pop() with a K_FOREVER wait.

I've identified 181 functions in kernel.h that could be categorized. There are some interesting inconsistencies where some functions are syscalls but other seemingly parallel ones are not. Some of the convenience wrappers like k_fifo and k_lifo are inconsistently implemented.

This is gonna be really unpleasant.

carlescufi commented 3 years ago

@andyross and @andrewboie could you please review the contents of this table? The terminology is further explained here: https://docs.zephyrproject.org/latest/reference/terminology.html

nashif commented 3 years ago

@pabigot was reading through all of this and was wondering why we use "sleep" (https://docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

andrewboie commented 3 years ago

@pabigot was reading through all of this and was wondering why we use "sleep" (docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

AFAIK yes

pabigot commented 3 years ago

The terminology was introduced in #21678 based on analysis in #20969. The term "sleep" ultimately came from here.

To precisely describe what "put the caller to sleep" means to a user it became "a voluntary transition to a non-running active thread state".

The documented thread states relevant to "sleep" are transition to either "waiting" and "suspended". Technically it would also cover "ready" but AFAIK there is no API that allows voluntary direct transition from running to ready.

So the terminology does not reflect any internal kernel state names that implement the thread state machine. It's perhaps not too late to change, but I suspect "sleeps" will be more clear to people than "pends".

pabigot commented 3 years ago

Shorter: "sleeps" was intended to also describe a function that invokes k_thread_suspend(_current); which would instead be _THREAD_SUSPENDED.