zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.5k stars 6.43k forks source link

How not to break "out of tree" users #48887

Open nashif opened 2 years ago

nashif commented 2 years ago

Provide some guarantees, guidelines and a process keeping out of tree users operational while the zephyr project code advances with new technologies, code cleanups and other major code and API changes.

Out of tree users are not limited to only drivers, we have users with their own subsystems, architectures, toolchains, SoCs, boards, drivers, driver subsystems etc. Any change in zephyr might break such users if changes are not following a deprecation process, announcements and a grace period given (deprecation period in many cases) to those users to adapt to the new interfaces or upstream code.

The process should find the sweet spot which allows the project to advance with its agenda and roadmap while allowing users to adapt to change.

gmarull commented 1 year ago

Here we a good example of possible issues: #61413

  • A change removes the need for one kconfig needed for something, but it seems someone else was relying on this Kconfig
  • A call to a configurable hook that has the z_ namespace (internal functions in zephyr). The hook z_arm_on_enter_cpuidle in nrfconnect/sdk-nrf@bd581bc is one of those that should have its own namespace and should not be z_ prefixed. Letting it appear internal will make it easy for someone to just kill it, rename it etc and thus break downstreams without any further notice. Additionally, in that source file ('subsys/debug/etb_trace/etb_trace_lp.c') it appears as an internal function , i..e no indication this is actually a hook defined and called somewhere else (by architecture code in Zephyr)

I think this is precisely not an example. Implementation was relying on Kernel internals (_kernel.idle field), something that should not be done. If people are relying on Zephyr internals, it's really their problem if things break.

edit: also, for reference, the action on our side will be to propose a public, documented interface upstream that allows to access this information. I think it's the right way to go, the Kernel is free to do whatever it wants with its internal structs, and nobody should expect any kind of stability in there.

nashif commented 1 year ago

I think this is precisely not an example. Implementation was relying on Kernel internals (_kernel.idle field), something that should not be done. If people are relying on Zephyr internals, it's really their problem if things break.

edit: also, for reference, the action on our side will be to propose a public, documented interface upstream that allows to access this information. I think it's the right way to go, the Kernel is free to do whatever it wants with its internal structs, and nobody should expect any kind of stability in there.

Using _kernel.idle, is definietly a problem, but I am not specifically talking about that here. This is about the z_arm_on_enter_cpuidle hook itself, not its contents

Refering to the few discussion points raised here:

@nashif - architecture features are defined by API (irq enable/disable, sys IO). Architecture APIs are internal, but need a life cycle/stability process. What's the policy for extending arch interfaces? @nashif - need to mark internal APIs clearly so users no not to use them

gmarull commented 1 year ago

My five 5 cents before I go back to idle on this topic:

I think it's nice to be kind to developers, e.g. offering deprecation and so on. But I think the concept of stability is going too far, in a way that could shoot ourselves back. Let me take an example. Imagine we declared the regulators API stable. If you look into more detail, you'll find that the API has a "public" layer, and a secondary layer only used by drivers (aka the "driver/common layer"). To me, the driver layer should never become stable. Yes, this means we can break out of tree drivers (not clients). The maintainer should be free to decide whether it's worth breaking internals or not, there are many variables to evaluate when doing changes. If you/your company decided to create a regulator driver downstream, it should assume its costs and not try to push them upstream. As an API client, though, you'll have better guarantees. Another example would be out of tree SoCs. It's a nice feature, e.g. when developing new chips, but IMHO this should not be part of any stability contract. Simply upstream your code if things are public, or deal with the cost. Finally, I also think that offering too many downstream guarantees deincentivates upstreaming, i.e. bad for the project.

gmarull commented 1 year ago

I think this is precisely not an example. Implementation was relying on Kernel internals (_kernel.idle field), something that should not be done. If people are relying on Zephyr internals, it's really their problem if things break. edit: also, for reference, the action on our side will be to propose a public, documented interface upstream that allows to access this information. I think it's the right way to go, the Kernel is free to do whatever it wants with its internal structs, and nobody should expect any kind of stability in there.

Using _kernel.idle, is definietly a problem, but I am not specifically talking about that here. This is about the z_arm_on_enter_cpuidle hook itself, not its contents

This is still an internal architecture hook which should be reserved for SoC code (again a problem in that implementation, e.g. if run on nRF53 with an idle errata mitigation). I think expecting stability on these things is going too far.

nashif commented 1 year ago

I think it's nice to be kind to developers, e.g. offering deprecation and so on. But I think the concept of stability is going too far, in a way that could shoot ourselves back. Let me take an example. Imagine we declared the regulators API stable. If you look into more detail, you'll find that the API has a "public" layer, and a secondary layer only used by drivers (aka the "driver/common layer"). To me, the driver layer should never become stable. Yes, this means we can break out of tree drivers (not clients). The maintainer should be free to decide whether it's worth breaking internals or not, there are many variables to evaluate when doing changes. If you/your company decided to create a regulator driver downstream, it should assume its costs and not try to push them upstream. As an API client, though, you'll have better guarantees. Another example would be out of tree SoCs. It's a nice feature, e.g. when developing new chips, but IMHO this should not be part of any stability contract. Simply upstream your code if things are public, or deal with the cost. Finally, I also think that offering too many downstream guarantees deincentivates upstreaming, i.e. bad for the project.

You are trivilaizing the issue by limiting it to device driver and hardware support, the discussion here is about APIs in general, so let's first agree this is a not a one size fits all problem. So looking into the hardware support below, putting other APIs aside for now.

Often I see this being quoted: https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst.

Executive Summary

You think you want a stable kernel interface, but you really do not, and you don't even know it. What you want is a stable running driver, and you get that only if your driver is in the main kernel tree. You also get lots of other good benefits if your driver is in the main kernel tree, all of which has made Linux into such a strong, stable, and mature operating system which is the reason you are using it in the first place.

This basically resembles what you are saying above and this works great for the linux Kernel I guess and I fully support that in the context of Linux. For Zephyr however, and given the nature of Zephyr as a FW OS, when it comes to hardware support and drivers this becomes a liability. We want vendors to upstream their SoCs and drivers, however, we also want Zephyr to be used where someone will not upstream any of their SoCs or drivers. Many users will have custom IPs in addition to what a vendor provides as part of upstream Zephyr, so it is not all or nothing when it comes to hardware support.

The balance here is that we need to maintain some level of stability to the hardware facing APIs, so that such out of tree users stick to using Zephyr and keep moving with upstream and ideally contribute back. If we keep breaking hardware support for out of tree users beause of the "stable APIs nonsense", users will just look the other way and end up using their own APIs and probably some other OS or even go back to baremetal and use HALs directly.

The executive summary here is that we need as a project to decide which way we want to go and who our users are and put this in the mission statement somewhere so there is no ambiguity about our goals and who our users are. The assumption that a user of Zephyr has to be "upstream" is wrong IMO and this is first issue we need to deal with and address.

nashif commented 1 year ago

I think this is precisely not an example.

still a good example because:

gmarull commented 1 year ago

I think this is precisely not an example.

still a good example because:

  • a change in upstream zephyr did break something downstream (the topic of this issue)

as expected, because it used non-public Kernel interfaces

  • incorrect usage of internal APIs (namespacing and misuse of internal functions)

Probably because there's a lack of proper docs. Sometimes is just developers taking shortcuts, even if they know it's not allowed. Shortcuts downstream are... the daily fare.

  • namespacing mess (_kernel is internal, z_arm_on_enter_cpuidle is also internal, so it is Ok to use z_arm_on_enter_cpuidle but not Ok to use _kernel, how more confusing can it get for a users/developer)?

Agree, we have a problem here. I'd say though the _kernel doesn't have a z_, so it's not technically internal but private. The ARM hook is internal, not sure if well documented on where/how it can be used.

nashif commented 1 year ago

Agree, we have a problem here. I'd say though the _kernel doesn't have a z_, so it's not technically internal but private. The ARM hook is internal, not sure if well documented on where/how it can be used.

z_ was introduced to replace _, so z_ in zephyr means private. (I have been using internal and private as the same)

gmarull commented 1 year ago

I think it's nice to be kind to developers, e.g. offering deprecation and so on. But I think the concept of stability is going too far, in a way that could shoot ourselves back. Let me take an example. Imagine we declared the regulators API stable. If you look into more detail, you'll find that the API has a "public" layer, and a secondary layer only used by drivers (aka the "driver/common layer"). To me, the driver layer should never become stable. Yes, this means we can break out of tree drivers (not clients). The maintainer should be free to decide whether it's worth breaking internals or not, there are many variables to evaluate when doing changes. If you/your company decided to create a regulator driver downstream, it should assume its costs and not try to push them upstream. As an API client, though, you'll have better guarantees. Another example would be out of tree SoCs. It's a nice feature, e.g. when developing new chips, but IMHO this should not be part of any stability contract. Simply upstream your code if things are public, or deal with the cost. Finally, I also think that offering too many downstream guarantees deincentivates upstreaming, i.e. bad for the project.

You are trivilaizing the issue by limiting it to device driver and hardware support, the discussion here is about APIs in general, so let's first agree this is a not a one size fits all problem. So looking into the hardware support below, putting other APIs aside for now.

Often I see this being quoted: https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst.

Executive Summary

You think you want a stable kernel interface, but you really do not, and you don't even know it. What you want is a stable running driver, and you get that only if your driver is in the main kernel tree. You also get lots of other good benefits if your driver is in the main kernel tree, all of which has made Linux into such a strong, stable, and mature operating system which is the reason you are using it in the first place.

This basically resembles what you are saying above and this works great for the linux Kernel I guess and I fully support that in the context of Linux. For Zephyr however, and given the nature of Zephyr as a FW OS, when it comes to hardware support and drivers this becomes a liability. We want vendors to upstream their SoCs and drivers, however, we also want Zephyr to be used where someone will not upstream any of their SoCs or drivers. Many users will have custom IPs in addition to what a vendor provides as part of upstream Zephyr, so it is not all or nothing when it comes to hardware support.

The balance here is that we need to maintain some level of stability to the hardware facing APIs, so that such out of tree users stick to using Zephyr and keep moving with upstream and ideally contribute back. If we keep breaking hardware support for out of tree users beause of the "stable APIs nonsense", users will just look the other way and end up using their own APIs and probably some other OS or even go back to baremetal and use HALs directly.

The executive summary here is that we need as a project to decide which way we want to go and who our users are and put this in the mission statement somewhere so there is no ambiguity about our goals and who our users are. The assumption that a user of Zephyr has to be "upstream" is wrong IMO and this is first issue we need to deal with and address.

Well, I think working with Zephyr requires some minimal upstream connection, specially in complex scenarios (ie, not only application code). Of course, not everything needs to be upstream (neither on Linux, look at all the vendor linux BSP repos). But downstream has associated costs that can't be avoided, and pushing them upstream is not a good idea IMHO. Upstreaming culture is probably new in the fw embedded world, but I think we should encourage it. It's true that in Zephyr there's no strict boundary between user/kernel space, but still, you can architect code in different ways that can improve maintainability. An example I've seen many times is the usage of I2C/SPI APIs directly in applications. Doing that will just lead to troubles long term, you should be into driver development so that you expose more application oriented APIs to the application (e.g. sensors). We should probably have documentation on recommended patterns to avoid the common mistakes people do when coming from bare-metal or freertos+hal.

Btw, a really biased example, but that can be useful. I've kept updating https://github.com/teslabs/spinner on each Zephyr release as an experiment. The application strictly (or tries to) follow all Zephyr patterns (ie, example-application, custom driver classes, twister, etc.). All migrations have been pretty smooth, no big dramas. Of course, it's a tiny application.

gmarull commented 1 year ago

Agree, we have a problem here. I'd say though the _kernel doesn't have a z_, so it's not technically internal but private. The ARM hook is internal, not sure if well documented on where/how it can be used.

z was introduced to replace , so z_ in zephyr means private. (I have been using internal and private as the same)

so then... we need a way to distinguish private from internal, as it's not the same thing...

nashif commented 1 year ago

so then... we need a way to distinguish private from internal, as it's not the same thing...

right, z_arm_foo() or _arm_foo() are private, arch_foo() is internal.

gregshue commented 1 year ago

@nashif wrote (reformatted):

The executive summary here is that we need as a project to decide which way we want to go and who our users are and put this in the mission statement somewhere so there is no ambiguity about our goals and who our users are.

I definitely agree with the last part. there needs to be no ambiguity about our goals and who our users are. AFAICT, the mission statement is already clear about our users, and other statements in the charter are clear about the goals. Consider:

  1. The mission statement is clear that the primary target users are everyone/anyone using a 32b/64b mcu (no address remapping), particularly in a safety-critical or secure connected devices domains.
  2. Other statements in the charter are clear about the production of associated certification artifacts for safety (security) domains. This already dictates what the project must achieve: 2.1. A Zephyr release (repository SHA) and associated certification artifacts must be reusable by a product manufacturer, leading to supporting an OOT usage model. 2.2. Fail-safe and fail-secure are not always the same behavior, so a common configuration mechanism is necessary as well as a common aggregate architecture. 2.3. Certification can only be done on an executable, so providing certification artifacts depends on a qualified configuration/build system, a qualified toolchain, and one or more representative "application(s)". 2.4. Certifiably safe (secure) code must be developed under a safe (secure) SDLC that meets the constraints of the safety (security) standard. AFAICT, this means requirements-driven development for all code, tools, toolchains in that scope.

Also, at a practical level providing a secure device means manufacturers must deliver security updates in a timely manner (e.g., <= 90 days). This will drive Zephyr Project towards automated full verification of the solution (e.g., test suites, test runner, requirements tracing, etc). This automated verification must be repeatable by manufacturers ... in their environment (additional repos, OOT code, product settings,

This is a very broad scope, but not at all ambiguous.

fabiobaltieri commented 1 year ago

Process WG notes:

@wbober: spinning out a new ticket for internal APIs based on the comment, let's consider this an umbrella issue

wbober commented 1 year ago

Summary of the state as of 7/9/2023.

There is a consensus regarding the following points:

The process WG has agreed to prioritize defining policy for introducing changes in the following Zephyr programming interfaces:

In order to keep the discussion organized discussions on specific interfaces are tracked in separate tasks.

The following aspects has been raised but are parked until we address the prioritized items:

wbober commented 1 year ago

Updated the state in https://github.com/zephyrproject-rtos/zephyr/issues/48887#issuecomment-1681921914

mbolivar-ampere commented 1 year ago
* Devicetree bindings - no subtask yet

@wbober when you get around to this, let me know -- I've lost it now, but I had a prototype script that could detect a large class of "backwards-incompatible" changes to bindings which might be useful for automation. I can make time to resurrect it once the community has reached consensus on requirements.

wbober commented 10 months ago

@mbolivar-ampere We're starting to look into this. Are you able to dedicate a bit of time to get the script back?