Introduce `errno.h`-compatible ranges for subsystems

carlescufi commented 1 year ago

Error numbers in Zephyr depend on the C library being used, but a script ensures that the minimal libc errno.h always matches the corresponding newlib errno.h. In fact, newlib defines a starting point for a user-defined range.

The proposal here would be to allow subsystems to reserve ranges of the int returned on almost every Zephyr function so errno.h in order to provide subsystem-specific error codes, like so:

#define ERRNO_BT_BEGIN 0x1000
#define ERR_BT_FAILED_TO_PAIR ERRNO_BT_BEGIN+1

#define ERRNO_WIFI_BEGIN 0x2000
#define ERR_WIFI_WPA_ERROR ERRNO_WIFI_BEGIN+1

...

carlescufi commented 1 year ago

CC @stephanosio @keith-packard

galak commented 1 year ago

Can you clarify if there would be one range of values that all subsystems would be using, or if it would be:

SUBSYSTEM_A: 1000..1999 SUBSYSTEM_B: 2000..2999 etc..

carlescufi commented 1 year ago

Can you clarify if there would be one range of values that all subsystems would be using, or if it would be:

SUBSYSTEM_A: 1000..1999 SUBSYSTEM_B: 2000..2999 etc..

My proposal is that each subsystem has a range assigned, like in your example with SUBSYSTEM_A and SUBSYSTEM_B. I will edit the description to clarify that.

stephanosio commented 1 year ago

Can you explain why this is necessary?

As far as I can see, subsystem-specific error numbers do not need to be globally unique -- they just need to be not in conflict with the "standard" error numbers defined in the errno.h.

If a subsystem needs to return a non-standard error number (i.e. not part of errno.h), it is free to define its own E... macros in a custom header file and set the errno variable -- application can read that value after calling the subsystem function, at which point it knows what to expect.

keith-packard commented 1 year ago

It is somewhat helpful to make them globally unique; it makes debugging code paths that expect errno to be set when it isn't (and errno is left with some previous value). And, you can implement strerror too. However, that latter is a lot easier to implement if error numbers are dense and managed in a single place. A modest process for allocating new errno values across the whole system that encouraged unifying common definitions across subsystems could work here?

I'm not sure it matters if errno values are the same across C libraries, as long as the whole system uses the C library errno.h file and not a Zephyr errno file.

carlescufi commented 1 year ago

Can you explain why this is necessary?

This issue is basically here to see if others agree with the concept of subsystems defining their own E... macros and to make sure that different subsystems' errors do not overlap.

The reasons I think they should not overlap are:

Simpler debugging: You can see the error code and identify its meaning without needing to know what function had been called
Reuse of a subsystem's error codes by another subsystem. This can be very useful in some cases, like for example Bluetooth's IP bridge returning network-specific error codes

As far as I can see, subsystem-specific error numbers do not need to be globally unique -- they just need to be not in conflict with the "standard" error numbers defined in the errno.h.

Well, and they must not overlap with other subsystem's error numbers as well, right?

If a subsystem needs to return a non-standard error number (i.e. not part of errno.h), it is free to define its own E... macros in a custom header file and set the errno variable -- application can read that value after calling the subsystem function, at which point it knows what to expect.

I am really proposing this except that there would be ranges defined in a common header file to ensure that the errors do not overlap across subsystems.

carlescufi commented 1 year ago

It is somewhat helpful to make them globally unique; it makes debugging code paths that expect errno to be set when it isn't (and errno is left with some previous value). And, you can implement strerror too.

Exactly I just wrote that above before seeing your comment, completely agree.

However, that latter is a lot easier to implement if error numbers are dense and managed in a single place. A modest process for allocating new errno values across the whole system that encouraged unifying common definitions across subsystems could work here?

Indeed, I would suggest some include/errno_ranges.h or something similar to that.

I'm not sure it matters if errno values are the same across C libraries, as long as the whole system uses the C library errno.h file and not a Zephyr errno file.

As long as our Zephyr-specific ones are well above the highest one in each of the C libraries we support, I agree.

stephanosio commented 1 year ago

Ok, so the proposal is not just to reserve the error number ranges for subsystems, but also to add subsystem-specific error numbers to the global errno.h (and potentially handle them in strerror as well).

Then I agree it makes sense; but, I am failing to see how we are going to do that for third-party toolchains.

keith-packard commented 1 year ago

Then I agree it makes sense; but, I am failing to see how we are going to do that for third-party toolchains.

Agreed. Maybe have two solutions, one for integrated toolchains and one for external toolchains? External toolchains might not have a way to make strerror provide useful messages for Zephyr additions, and that might require picking a starting errno using internal knowledge about the external toolchain errno range.

carlescufi commented 1 year ago

Ok, so the proposal is not just to reserve the error number ranges for subsystems, but also to add subsystem-specific error numbers to the global errno.h (and potentially handle them in strerror as well).

Exactly!

Then I agree it makes sense; but, I am failing to see how we are going to do that for third-party toolchains.

Well the error ranges will be defined in Zephyr, so it will be a matter of verifying that other toolchains' C library's highest errno doesn't collide with our lowest range?

henrikbrixandersen commented 1 year ago

What would our policy on using subsystem-specific error numbers be? Are all subsystem functions required to use subsystem-specific error numbers or can they continue to use standard errno values?

Do we need to take support for out-of-tree error numbers into account?

fabiobaltieri commented 1 year ago

Architecture WG:

@stephanosio: this may not be sustainable, unclear what we really gain from that, can't have those in third party toolchains
@henrikbrixandersen the idea is clear but what would be our policy for this? it would be hard to not clutter the implementation with this, some vendor used have something similar and it does not seem to be useful (where is this error code coming from?)
@nashif unclear what problem are we solving here? It may require a completly differnt solution, how would we check the range and maintain it?

AI: waiting for @carlescufi to give more details on this

gregshue commented 1 year ago

Do we need to take support for out-of-tree error numbers into account?

Absolutely. Since the Zephyr ecosystem is extensible, we must consider support for out-of-tree-defined error numbers. Ensuring the number ranges do not overlap leads to a global registry/assignment of unique subsystem IDs (used in the first part of the errno). I expect that is a non-starter for proprietary subsystems and interfaces. Things get more complicated when we consider forks of existing subsystems (e.g., ncs fork of the entire Zephyr repo).

unclear what problem are we solving here?

Agreed. Objectively stating the "user need" and benefit (i.e., the problem) will help greatly.

keith-packard commented 1 year ago

Would it be best to leave errno to libc and encourage a different error reporting mechanism in Zephyr APIs?

gregshue commented 1 year ago

The mechanism is to provide failure/success indication in the function return value. The de-facto standard for this is to return an int. Doing otherwise may complicate making certifiable safety-critical code (as well as interfacing with lots of existing internal and external code, and having to adjust the thread-unique errno/error code storage). Errno is a convenient set of existing, reserved meanings to start with.

In practice, OOT modules may have contradictory meanings for the same result value. Avoiding that requires global management of the result values. :-(

keith-packard commented 1 year ago

ok, so you aren't actually touching errno here, just using a convenient set of values which are supposed to be globally unique. I'm not sure what to suggest, other than it seems like you might be happier using a completely disjoint set of numbers from those provided by the C library unless you're willing to insist that the C library conform to your errno instead of trying to wedge zephyr errno values around the existing (and capricious) C library values. That's certainly how it works for other operating systems ...

gregshue commented 1 year ago

AFAICT the proposal has nothing to do with the usage of the global (or thread-unique) variable errno, or with touching the public header errno.h. It even mentions that newlib's errno.h identifies a starting value for user-defined values. Rather, the proposal is to partition the range of user-defined values into sub-ranges for each subsystem. That seems to be the extent of it.

allow subsystems to reserve ranges of errno

This part of the proposal actually doesn't fit well with the implicit architectural pattern of Zephyr. The implicit requirements on the Zephyr Project are to have an architecture that is at least:

composable (substitutable subsystems/components & toolchains)
extensible (e.g., OOT modules)
scalable
configurable
reusable

According to the Software Engineering Body of Knowledge (SWEBoK), v3.0 (also published as ISO/IEC TR 19759), one approach to fulfilling these characteristics is through creating a "software product line", and specifically mentions OO programming as related. This approach is consistent with the Zephyr API Design Guidelines. So it appears the implicit architectural pattern of Zephyr is OO.

When following the OO pattern, the clients need to know about the API and exposed behavior behind it and know nothing about any specific implementation of it. The client must not care which subsystem implements an API, just that the API is provided. This means the extended error codes need to be specified at the API rather than by a specific component.

Because we still must allow for OOT APIs to co-exist in the executable, we probably still cannot guarantee uniqueness AND consistency of extended error code values/meanings.

zephyrproject-rtos / zephyr

Introduce `errno.h`-compatible ranges for subsystems #55858