Define and document expected behavior of API calls in various contexts

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.8k stars 6.58k forks source link

Define and document expected behavior of API calls in various contexts #18970

Open pabigot opened 5 years ago

pabigot commented 5 years ago

This issue is intended to cover the concerns raised in #1960, #3694, #6184, #17014, and related lack of clarity and consistency.

Proposal: We should have:

Definitions of terms; see #20959
A policy for when (kernel and driver) API functions are thread-safe/reentrant. Some are (to varying degrees), some are not.
At the individual function level we should clearly document:
- Can a call block?
- Any situation where the call doesn't block but also will not return for a very long time (e.g. chip-level flash erase)
- Can the function be invoked from an ISR? What behavior is guaranteed in that situation?
- Are the operations of the function atomic?
- The context in which any callback is invoked.
- Any conditions that must be satisfied before the function is invoked, e.g. holding a lock.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

Example policies might be:

All driver functions must be invokable from ISRs, but-EWOULDBLOCK shall be returned in situations where the requested operation cannot be successfully completed in that context.
All GPIO manipulation functions must be atomic.

If we can define the normal behavior at either a system or driver-specific level, only the exceptions need be documented at the individual function level.

mbolivar commented 4 years ago

thread-safe: (TBD) a function is thread-safe if its behavior is correct when invoked from multiple threads at the same time

reentrant: (TBD) a function is reentrant if its behavior is correct when it is invoked by (indirect) recursion from the same thread

+1 to these.

interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen

I would suggest "might sleep" (or "may sleep") instead.

This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

By contrast, "might sleep" makes it clearer that the caller's thread might get scheduled out, and thus further hints the caller must be running in thread context.

At the individual function level we should clearly document:

Can a call block?

So this would become e.g.:

/**
 * @brief Enable a clock controlled by the device
 *
 * On success, the clock is enabled and ready when this function
 * returns.
 *
 * This function :ref:`might-sleep`.
 *
 * @param dev Device structure whose driver controls the clock.
 * @param sys Opaque data representing the clock.
 * @return 0 on success, negative errno on failure.
 */
static inline int clock_control_on(struct device *dev,
                   clock_control_subsys_t sys)

Where the Sphinx might-sleep ref would link directly to a common definition for the term.

Also are k_is_in_isr() and k_is_preempt_thread() sufficient to determine whether it's safe to proceed, or do we also need k_is_atomic(), #17301, or other technologies?

+1 to k_is_atomic() instead of k_is_in_isr().

mbolivar commented 4 years ago

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Once these mistakes are cleared up, there is still the question of just how kernel code should decide whether it is running in an atomic context or not. The real answer is that it just can't do that. Quoting Andrew Morton again:

The consistent pattern we use in the kernel is that callers keep track of whether they are running in a schedulable context and, if necessary, they will inform callees about that. Callees don't work it out for themselves.

pabigot commented 4 years ago

interrupt-safe: (TBD) a function is interrupt-safe if its behavior is not affected by concurrent access from interrupts

No objections, but can you provide an example API function where you might want to use this term? I would expect all the driver APIs I can think of offhand to be interrupt-safe, at least when called from thread context.

We need to be able to say succinctly "Unless otherwise specified all API functions are interrupt-safe" and expect people to know what that means. A specific example would be the GPIO API, where it is not clear to implementers that, because GPIO write functions may be invoked from ISRs, read-modify-write code like:

u32_t out = gpio->OUT;
gpio->OUT ^= (out & ~mask) | (value & mask);

must be wrapped in a spin-lock to be interrupt-safe. Most implementations don't do that, even in the ongoing GPIO rewrite, in part because we couldn't get agreement in #19252 on what "interrupt-safe" meant and whether it was important to document that expectation before people started implementing drivers.

* block: (TBD) a call blocks if it can suspend the invoking thread
  while it waits for something to happen
I would suggest "might sleep" (or "may sleep") instead. This is more familiar from Linux kernel source code (648 matches for "might sleep" in v5.3), and has additional advantages beyond that.

In particular, as discussed for clock_control_on() at today's API meeting, some people (including me) thought "block" might simply mean "don't return until the operation is complete". This would include functions which busy-wait until the job is done (like k_spin_lock() might want to do on SMP).

My intent was to identify functions that will not transfer control from a cooperative thread. "blocking" is a poorly chosen term. Scrap it and start over.

Zephyr uses several terms including "Waiting", "Suspended", and "Ready" to identify started non-terminated non-running threads. The transitions we're interested in are "suspend" and "wait". Let's use "yield" for "suspends or waits".

There are three cases:

The thread invoking a given function will not yield;
The thread invoking a given function may yield;
The thread invoking a given function will yield.

So we can do:

yields (thread): A thread yields when it invokes an operation that transitions it to Suspended or Waiting.
pre-empted: A thread is pre-empted when it transitions to Ready without having yielded.
yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.
non-yielding (function): A function is non-yielding when it is not yielding.

It's non-yielding that we care about, because those functions guarantee, when invoked by a cooperative thread, that no other cooperative thread will be allowed to run before the function returns. (Absent bugs like #20255.)

I'm not sure exactly how functions should be annotated with respect to their behavior, but I would want it to be terse and consistent.

+1 to k_is_atomic() instead of k_is_in_isr().

Never mind, I take it back. https://lwn.net/Articles/274695/

Yes, but that relies on having API to provide information about context to callees, which we probably don't have. To be considered later.

mbolivar commented 4 years ago

A specific example would be the GPIO API

Excellent example, thank you.

Summarizing the status:

thread-safe: OK as-is
re-entrant: OK as-is
interrupt-safe: OK as-is
atomic: OK as-is
blocking: scrap it, need a different term
k_is_atomic(): no, not yet (and to clarify: in my mind, not ever, being convinced by the LWN article)

There are three cases:

* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.

Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

yielding (function): A function is yielding when it invokes (directly or indirectly) an operation that causes the executing thread to yield.

non-yielding (function): A function is non-yielding when it is not yielding.

With the given definition of "yield" ("suspends or waits"), I agree these are the desired semantics.

I also agree that defining terms in accordance with documented thread states and related APIs is a great approach.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

pabigot commented 4 years ago

I've integrated the discussion so far into this document which is available for review through #20959 so people can comment on specific concerns in context. I think that'll be a bit more convenient then chained comments on this issue.

There are three cases:
* The thread invoking a given function will not yield;
* The thread invoking a given function may yield;
* The thread invoking a given function will yield.
Did you mean "The function invoked by a given thread" instead of "The thread invoking a given function"?

No. The operation of yielding is something that directly affects threads. The impact on functions is derived.

However, I'm still bullish on "might sleep" terminology-wise. I worry "yielding" is too easy to confuse with "may call k_yield()".

A problem is that traditionally Zephyr has used "sleep" to specifically mean "wait for a timeout to occur". I would rather that "suspend" was more general, but currently it specifically references waiting without an event.

carlescufi commented 4 years ago

API meeting 19th May:

Plan of action to document the expected behavior of public API functions in Zephyr:

[x] Define the format to be used in the header files (#25411)
[x] Begin with a slice of the kernel, for example k_pipe and k_mutex (TBD). @pabigot volunteers to do this and then have it reviewed by all.
[ ] Evaluate any changes required to the property list (or their definitions)
[ ] Complete the kernel
[ ] Drivers and thread-safety discussion. Where should this be documented (in a single spot for all drivers minus exceptions or function-by-function using funcprops).
[ ] Assign each subsystem to their respective maintainers, keep a table that is updated with the current state of the process

During the 2.4 development cycle, begin each API meeting with a status update on the function properties.

pabigot commented 4 years ago

Here's my assessment of k_mutex and k_pipe:

function	syscall	reschedule	sleep	no-wait	isr-ok	async
k_mutex_init	YES	no	no	yes	yes
k_mutex_lock	YES	yes	YES	YES	NO*
k_mutex_unlock	YES	YES	no	yes	NO*
k_pipe_init	no	no	no	yes	yes
k_pipe_cleanup	no	no	no	yes	yes
k_pipe_alloc_init	YES	no	no	yes	yes
k_pipe_put	YES	yes	YES	YES	yes
k_pipe_get	YES	yes	YES	YES	yes
k_pipe_block_put	no	yes	YES	no	no	YES
k_pipe_read_avail	YES	yes	YES	no	no
k_pipe_write_avail	YES	yes	YES	no	no

Some notes on the table

Upper case indicates a primary attribute setting. Lower case indicates a derived setting. Derived values include:
- reschedule is yes if sleep is YES; it is YES only in some cases where sleep is no
- sleep defaults to no. Whether suspending in a function like k_sem_give() that is reschedule is something that should be sleep is open for clarification.
- no-wait is yes if not sleep. It may be YES or no if sleep
- isr-ok is yes if either no-wait or not sleep. The two NO are an inconsistency that may be removed via #25678
- async is rare and only one of these functions has it.
syscall is just informational, not one of the standard function attributes
I've ignored pre-kernel-ok which is probably generally equivalent to isr-ok, but deep inspection is required to ensure it doesn't do something that isn't available until the main thread has started.
I've ignored supervisor which also requires inspection. I assume anything that's syscall is not supervisor, but the others may or may not be.

Determining the behavior of the API is not trivial. Some of the above may be wrong. Some were surprises, like the functions to read the number of spaces available for read and write in k_pipe, which can block for an unbounded period due to an internal call to an allocation function implemented as k_stack_pop() with a K_FOREVER wait.

I've identified 181 functions in kernel.h that could be categorized. There are some interesting inconsistencies where some functions are syscalls but other seemingly parallel ones are not. Some of the convenience wrappers like k_fifo and k_lifo are inconsistently implemented.

This is gonna be really unpleasant.

carlescufi commented 3 years ago

@andyross and @andrewboie could you please review the contents of this table? The terminology is further explained here: https://docs.zephyrproject.org/latest/reference/terminology.html

nashif commented 3 years ago

@pabigot was reading through all of this and was wondering why we use "sleep" (https://docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

andrewboie commented 3 years ago

@pabigot was reading through all of this and was wondering why we use "sleep" (docs.zephyrproject.org/latest/reference/terminology.html#sleep). Sleeping is not a documented thread state. Is this the same as _THREAD_PENDING where a thread is waiting on an object?

AFAIK yes

pabigot commented 3 years ago

The terminology was introduced in #21678 based on analysis in #20969. The term "sleep" ultimately came from here.

To precisely describe what "put the caller to sleep" means to a user it became "a voluntary transition to a non-running active thread state".

The documented thread states relevant to "sleep" are transition to either "waiting" and "suspended". Technically it would also cover "ready" but AFAIK there is no API that allows voluntary direct transition from running to ready.

So the terminology does not reflect any internal kernel state names that implement the thread state machine. It's perhaps not too late to change, but I suspect "sleeps" will be more clear to people than "pends".

pabigot commented 3 years ago

Shorter: "sleeps" was intended to also describe a function that invokes k_thread_suspend(_current); which would instead be _THREAD_SUSPENDED.