Discussion: Assertions, error values, runtime checks

andrewboie commented 5 years ago

Introduction

Purpose of this ticket is to discuss changes to our policies for kernel APIs as they relate to returning error values.

Problem description

Currently, we have the following:

Base kernel APIs assume well-formed parameters, with little to no error checking. At most, we are using __ASSERT() which will generate a fatal error with no opportunity for recovery. In production assertions will be disabled, and the system will essentially be running with a false sense of optimism that nothing will go wrong. This does have the advantage of saving code size.
System calls were implemented on top of the existing APIs. Runtime checks that were previously assertions will be unconditionally evaluated, but typically if they fail will invoke Z_OOPS() and kill the calling thread.

Proposed change

With Zephyr moving towards functional safety certifications, we should allow for robust error checking, with APIs being augmented to return error codes.

1) APIs that return void will be converted to return an integer indicating success or failure, with failure returning a negative errno code. 2) APIs that return a pointer, should return NULL on an error condition. We might consider populating errno with a more detailed error code. 3) All current assertions in API implementation functions should be converted to return an error condition instead. 4) System call checks which duplicate implementation function assertions can be removed since the implementation function is always doing the checking now. 5) Additional system call checks should propagate error values instead of invoking Z_OOPS().

Some random thoughts for discussion

We need to decide if there are any valid uses of Z_OOPS() in system calls. Do we keep using Z_OOPS() for memory access issues, or kernel object validation checks?
We will need consistency on what error values are returned. For example, if we decide to return an error if memory access fails, every API should return -EACCES.
Returning errors instead of failing assertions/Z_OOPS() adds additional responsibility to the end user.
Additional code size due to adding runtime checks and propagation of return values
Is there a way we can add flexibility on what runtime checks do? Consider the following:

    if (!__ASSERT(foo == bar, "wrong foo!")) {
        return -EINVAL;
    }

We could define __ASSERT three ways, selectable by Kconfig:

1) The default case has __ASSERT() evaluate to the expression in its first argument, cast to a boolean. So if foo != bar it evaluates to false and we return -EINVAL. This is the robust configuration. The string could either be logged or discarded by the compiler.

2) The second case would be for constrained applications which will enable assertions in the debug phase, but otherwise not do return value propagation. __ASSERT() will work like it used to and immediately generate a fatal error =, printing the string, if the check fails.

3) The third case will be for constrained applications where code size must be optimized at all costs. Here __ASSERT() will simply be defined as 'true'. The compiler will see this and not compile anything at all, it will be as if it wasn't there.

andyross commented 5 years ago

One problem with making an assertion into an early return is cleanup. Consider ("normal" C for clarity, not Zephyr code):

int foo(int arg)
{
    struct whatever *result = malloc(sizeof(struct whatever));

    // ...

    __ASSERT(arg != INVALID_ARG, "");
}

Now the code is going to leak the resource it acquired because it bailed too soon.

It's not insoluble, but it's messy. We could require that code never assert in a context where a synchronous return would be illegal (too limiting and too difficult a rule to follow). We could have an ASSERT_CLEANUP macro or something that worked like a FOREACH and had a block at the end (weird, but IMHO kinda clever).

Farther afield there are GCC extensions that allow you to define code to run at block exit we could try to leverage (sort of a hook to the C++ destructor invocation logic that the compiler has), and of course in the C++ world there are exceptions which speak to this need too.

But really I think my biggest worry here is that most assertions don't reflect things we want to check at runtime at all. They're bug catchers describing an invariant in the code. Right now we all feel free to sprinkle these around because we know they'll never end up in production and it makes us feel safer to have the extra layer of validation when the code changes. But if we knew we had to be careful with them... would we just stop using them? I think that would make the code worse, not better.

ceolin commented 5 years ago

One problem with making an assertion into an early return is cleanup. Consider ("normal" C for clarity, not Zephyr code):
int foo(int arg)
{
    struct whatever *result = malloc(sizeof(struct whatever));

    // ...

    __ASSERT(arg != INVALID_ARG, "");
}
Now the code is going to leak the resource it acquired because it bailed too soon.

It's not insoluble, but it's messy. We could require that code never assert in a context where a synchronous return would be illegal (too limiting and too difficult a rule to follow). We could have an ASSERT_CLEANUP macro or something that worked like a FOREACH and had a block at the end (weird, but IMHO kinda clever).

We should not use "ASSERT", we should create a new macro, e.g Z_CHECK_FAIL, Z_CHECK_FAIL_GOTO ... assert is a well macro defined in POSIX and in C standard change its behaviour will make people confuse.

Farther afield there are GCC extensions that allow you to define code to run at block exit we could try to leverage (sort of a hook to the C++ destructor invocation logic that the compiler has), and of course in the C++ world there are exceptions which speak to this need too.

cleanup extension is cool but we should avoid it IMHO, we have to be compiler agnostic. And even if we do some black magic to have it portable this will probably trigger a lot of violations to C code standard.

But really I think my biggest worry here is that most assertions don't reflect things we want to check at runtime at all. They're bug catchers describing an invariant in the code. Right now we all feel free to sprinkle these around because we know they'll never end up in production and it makes us feel safer to have the extra layer of validation when the code changes. But if we knew we had to be careful with them... would we just stop using them? I think that would make the code worse, not better.

ceolin commented 5 years ago

4. System call checks which duplicate implementation function assertions can be removed since the implementation function is always doing the checking now.

That was my initial concern, where to do the checks, in syscalls or in the implementation itself

andrewboie commented 5 years ago

That was my initial concern, where to do the checks, in syscalls or in the implementation itself

I think they should be in the implementation where possible.

b0661 commented 5 years ago

But really I think my biggest worry here is that most assertions don't reflect things we want to check at runtime at all. They're bug catchers describing an invariant in the code.

Invariant checks as currently be done by _ASSERT are an improvement for functional safety, they should not be removed. They may be triggered because of RAM/ROM bit flips or some unforeseen computation sequence. The problem is how to react on them. From my experience in functional safety development - if an invariant check triggers there is a real problem in the system. Usually the only option is to bring the system to the safe state. For availibility reasons and in case the invariant check triggered in a non safety thread you may decide just to restart or stop the thread.

pabigot commented 5 years ago

It does me no good if a device ends up in la-la-land because of an undetected/unhandled error, or in a halted state after printing an error message to a disconnected console, so I agree with this:

From my experience in functional safety development - if an invariant check triggers there is a real problem in the system. Usually the only option is to bring the system to the safe state.

To some degree the determination of safe state is application-specific. In one framework I've built a detected failure enters a path that records where the error was detected along with information about the error (e.g. the type of failure). This is placed in no-init memory along with other metadata; then the system resets and during startup this area is checked for recorded faults which are then interpreted and logged, and the application proceeds to whatever it wants as a "safe state" (which may include proceeding with normal functionality if recovery could be completed during startup).

This API is available for both system and application failures. A framework for something like this would be nice to have in Zephyr.

yashi commented 5 years ago

What we want is not ASSERT but as @ceolin described a variant of "Design by Contract" from Eiffel.

And "What to do when something goes wrong" is application specific, meaning that in some case we want to abort with backtrace so that it allows us to see what's wrong In other case, we want the system to run as much as possible recovering from the error (safety applications). We might want to log what's going on in some system and might not for another. In yet another use case, we might disable all together for performance.

So multiple implementations and selectable by Kconfig is good. But what we should provide is a pluggable system so that users can implement their own version of it according to their requirements.

My question is: does anybody know a good practice to implement DbC in C?

aescolar commented 5 years ago

@andrewboie There are issues with this proposal. You mentioned yourself one quite critical: code size. But the overall issue is trying to define a one-size-fits-all approach for this. As Zephyr targets a wide range of applications and platforms constraints, just adding unconditionally compiled runtime checks would be a killer for some of them. Similarly not having asserts is also not an option for others.

What I think we need is to:

Provide sane defaults but:
- Let the application developer decide if and what asserts or runtime checks are there.
- Let the application developer decide what to do in case of an assert (we do that today)
Categorize asserts by type, and have build time config. options to select which are compiled in.
Add run time checks (which will overlap with some of these asserts), which return error codes, but which can also be configured out at build time.

This will, of course, add extra paths to test, but I certainly think it will be worth it given Zephyr’s targets.

Silly pseudo code snippet as example of what I mean:

int k_foo(void *ptr)
{
  ASSERT(ptr == NULL, ASSERT_INPUT_PARAM_CHECKS_ON);
  if (RUNTIME_INPUT_PARAM_CHECKS_ON && ptr == NULL) {
    return EINVAL;
  }
  // whatever else
  return 0;
}
//the ASSERT and runtime check could be merged into a single macro, this is just an example

Stating what will be obvious to many:

Adding runtime checks adds extra code and cycles.
Zephyr targets all the way from a few KBs to MBs of RAM
Many of those runtime checks will be repeatedly testing paths which are not going to ever trigger in the field because the app has been properly designed and tested (save for memory corruption issues)
A returned error code in many cases
- will only be handled by the app by raising an ASSERT anyhow (certainly for cases where such an error code is in no way expected)
- In other cases they would be just not handled at all
Many assert are added just as development aids (as inbuilt tests for regressions)
Some apps do not care at all about some types of ASSERTs, some others will just want to log them, some others will need to immediately bring the device into a safe state and maybe remotely log the fault. Only the app developer knows what he needs.
Constrained applications may not want strings with error descriptions, but may be much happier with the program counter where the assert triggered (and for some use cases the assert type).

andrewboie commented 5 years ago

@andrewboie There are issues with this proposal. You mentioned yourself one quite critical: code size.

@aescolar let's get some data first before we start worrying about code size. @nashif did some initial investigation and found that the code size penalty to propagating return values was trivial

marc-hb commented 5 years ago

The string could either be logged or discarded by the compiler.

A message catalog is a common alternative that saves memory while preserving human-readable information, example:

https://en.wikipedia.org/wiki/MIPI_Debug_Architecture#Visibility_of_SoC-internal_operations http://debugger.iul.intel.com/documentation/npk-sdk/doc/sventx/html/a00034.html

andyross commented 9 months ago

This is still live. We still lack a framework for "recoverable runtime errors". In the modern world that's mostly a userspace thing, so maybe @dcpleung would be a better owner?

zephyrproject-rtos / zephyr