zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.43k stars 6.39k forks source link

How not to break "out of tree" users #48887

Open nashif opened 2 years ago

nashif commented 2 years ago

Provide some guarantees, guidelines and a process keeping out of tree users operational while the zephyr project code advances with new technologies, code cleanups and other major code and API changes.

Out of tree users are not limited to only drivers, we have users with their own subsystems, architectures, toolchains, SoCs, boards, drivers, driver subsystems etc. Any change in zephyr might break such users if changes are not following a deprecation process, announcements and a grace period given (deprecation period in many cases) to those users to adapt to the new interfaces or upstream code.

The process should find the sweet spot which allows the project to advance with its agenda and roadmap while allowing users to adapt to change.

galak commented 2 years ago

I think part of the question here is what is considered part of the Zephyr "interface" beyond APIs? Is the build system, Kconfig, devicetree/devicetree bindings? Each of these areas could break an 'out of tree' user due to a change.

nashif commented 2 years ago

I think part of the question here is what is considered part of the Zephyr "interface" beyond APIs? Is the build system, Kconfig, devicetree/devicetree bindings? Each of these areas could break an 'out of tree' user due to a change.

IMO, all of the above.

if we decide for whatever reason to drop, lets say a cmake macro (example, zephyr_library_sources_ifdef), out of tree code using this macro will break. Same with kconfig and devicetree. The level of severity if the breakage might vary and we will agree that not everything we support needs deprecation, however, we need to be aware and cautious about changes in general. The fact that removing any interface in a PR passes CI is not a green light it can be removed without any consequences. We probably need to introduce some categories of changes that needs more attention than others.

galak commented 2 years ago

Various thoughts/comments:

mbolivar-nordic commented 2 years ago

Process WG:

  1. Start the discussion by tackling APIs, with agreement that we need to consider other programming interfaces like Kconfig symbols and devicetree bindings, and keep them in the vision
  2. Make include/zephyr contain only public, user-facing APIs, move internal code out of there (new use of the treewide process!) -- definition of API tbd
  3. Make sure all APIs that remain have documentation and changes against them are checked, so no breaking changes to stable APIs, etc.
  4. Continue discussion in the issue, revisit next week

galak commented 2 years ago

@carlescufi mentioned this, but does API mean application facing or broader?

nashif commented 2 years ago

broader, for example the arch_ interface which is not application facing needs to be in scope as well, this is being used by OOT architectures for example. There are a few other interfaces beside that to consider.

gmarull commented 2 years ago

My five cents: I'd limit this to stable public APIs. Other changes should just be written in the release notes. In any case, if a deprecation is cheap, just do it, your users will appreciate it.

Working out of tree is convenient, but it comes with a maintenance cost.

Good read: https://www.kernel.org/doc/html/latest/process/stable-api-nonsense.html#what-to-do

marc-hb commented 2 years ago

@mmahadevan108 so cmake warnings don't break CI? @carlescufi pretty sure no, because we build all the time with asserts enabled

FWIW I saw these two warnings every day for about a year and none ever stopped anything:

warning: the int symbol CORE_COUNT (defined at src/platform/Kconfig:299) has a
non-int default MP_NUM_CPUS (undefined)
CMake Warning (dev) at CMakeLists.txt:12 (zephyr_library_include_directories):
  uninitialized variable 'sof_module'
This warning is for project developers.  Use -Wno-dev to suppress it.
gregshue commented 2 years ago

broader, for example the arch_ interface which is not application facing needs to be in scope as well, this is being used by OOT architectures for example. There are a few other interfaces beside that to consider.

Agreed. Downstream users may need to add their own module with architectures, SOC definitions, drivers, boards, subsystems, tests, samples, etc. It may even need to contain alternate implementations of existing subsystems. Perhaps the collective set of "public APIs" needs to include whatever could be seen or replaced across modules.

So, what does it mean for something to be "application facing", especially if the product-unique source is only an empty main(){}?

Working out of tree is convenient, but it comes with a maintenance cost.

Working out of tree is strategic and supported. It is also essential for some of us and for an extensible platform.

mbolivar-nordic commented 2 years ago

Process WG:

nashif commented 2 years ago

My feedback offline as I was not able to join the major part of the meeting

  • @gregshue what wasn't clear is why the name zephyr was chosen
  • @fabiobaltieri what would be the alternative to zephyr?
  • @gregshue mcuboot has a native zephyr port; you're expecting the headers it defines in the include/zephyr namespace?
  • @mbolivar-nordic no; mcuboot's library interface (bootutil) is not a zephyr library, and its zephyr port is an application, not a library
  • @mbolivar-nordic I think we're saying that #include <zephyr/foo...> is something we'd like to reserve for APIs defined by the zephyr project
  • @mbolivar-nordic are we talking about module/module collisions or zephyr/module collisions?
  • @gregshue both

@gregshue It is really frustrsating to see this type of discussion always go into modules/mcuboot and things that you might be passionate about, derailing the discussion from the actual topic.

  • @MaureenHelm that's an odd indirection -- why not do that at the API level? We could do something at the doxygen tagging level. The declaration of whether an API is stable or not belongs with the API. We should be able to scrape the tree and get this table. The golden source of truth should be with the API itself.

+1

  • @mbolivar-nordic what are we looking for out of semantic versions of APIs? Don't think we can sell semantic versioning of APIs project wide

I think this would be a very good replacement to how we manage API stability right now using when some API was introduced/modified. Having a versioning scheme in place will help with making changes to stable APIs and marking those changes as non-breaking using semantic versioning. By just looking at the version it will be possible to see if you are still compatible without having to look at git logs or implementation changes in drivers. Tests would also need to continue working. IMO it is worth looking into and bringing it up as a proposal and get more feedback.

yes, maybe drop this in favor of some versioning scheme maintained within the API using doxygen like @MaureenHelm suggested.

  • @gregshue I like the question of 'scope'. I want to be able to write a module that works on multiple zephyr forks.

the only thing of significance here is Zephyr project and its code base, any forks of zephyr are completely irrelevant to this discussion.

  • @gregshue if Zephyr is going to say we're not going to support the following use case, we should be explicit: As a downstream module owner, I need to be able to have one version of my module that builds and integrates with zephyrproject-rtos/zephyr's version of the ecosystem, nrfconnect's, and possibly other versions.

This is implicit and obvious and does not require any statements. We as the zephyr project are not responsible for content maintained in forks of Zephyr.

  • @gregshue is the zephyr project going to put patterns in place to support this use case? I want to be able to identify what the fork is, what the version of the fork is, and get clues in each API about what the semantic version is.

again, I am not sure why we are talking about forks. This is a distraction from the actual topic. Zephyr has 1000s of forks, why do we want this?

gregshue commented 2 years ago

It is really frustrsating to see this type of discussion always go into modules/mcuboot

@nashif It is also really frustrating to see the Zephyr Project not actually support the needs of users trying to complying with a development models it claims to support. module.yml provides an extension of the build system. As an integrator of multiple modules were I cannot control the consolidation or separation of the modules, I need everything done at a higher layer to be independent of which module the source exists in, unless it is scoped to apply to a specific directory subtree (e.g., .clang-format).

derailing the discussion from the actual topic.

As an "out-of-tree" user I am trying to rescope this discussion to meet my needs related to breakage. I'm sorry you think of it as derailing. Perhaps you need to clarify which users you are not trying to address.

Zephyr has 1000s of forks, why do we want this?

I'll assume most of the forks are tracking Zephyr main. I know some long term forks are not, and introducing incompatibilities. Not having a common way to identify the latter is the problem faced by end users. This is the same type of problem that led protocol specs to identify a field (or value range) for vendor-extension commands. If Zephyr Project defines a common mechanism for forks to be identified then end users can avoid conflicting identification solutions invented by each of the fork maintainers.

marc-hb commented 2 years ago

I believe this effort is trying to help people identify API and other incompatible changes between say upstream Zephyr version 42 and upstream Zephyr 45. If you spot one specific place/tool or identification technique that does not help with zephyr forks version 43-gregshue and 44-marc-hb, then offer a more flexible alternative there when discussing implementation details. If you can find such an alternative, chances are it will be better for upstream Zephyr branches too (cause branching and forking are the same thing). If you cannot find such an alternative, then the problem couldn't be solved anyway and no one wasted any time in abstract discussions.

All this without mentioning forks once! Magic :-)

We reject: kings, presidents, and voting. We believe in: rough consensus and running code.

gregshue commented 2 years ago

All this without mentioning forks once! Magic :-)

Almost ... "that does not help with zephyr forks" ... "cause branching and forking"

This is a present user need, not an abstract discussion. Other users independent of me have already indicated on Discord they are building one set of source reusable on both zephyrproject-rtos and nrfconnect ecosystems.

I am not concerned with the forks tracking Zephyr main. I am concerned about identifying interface changes introduced in the nrfConnect fork of Zephyr ecosystem. It would be less of a concern if nrfConnect maintained backwards compatibility at the SHA level, but it didn't.

One solution is to mix an identifier into the API semantic numbering indicating the organization defining the interface.

We reject: kings, presidents, and voting. We believe in: rough consensus and running code.

I hate to tell ya, but that ain't gonna cut it for certifiable code. ;-)

marc-hb commented 2 years ago

This is a present user need, not an abstract discussion

I was referring to the (lack of) solutions, not to the problem.

I am concerned about identifying interface changes introduced in the nrfConnect fork of Zephyr ecosystem.

What makes you think some (good and useful) API change(s) in nrfConnect won't be found in some future upstream Zephyr version? Remember: forking and branching are the same thing.

I hate to tell ya, but that ain't gonna cut it for certifiable code. ;-)

Off-topic again?

gregshue commented 2 years ago

What makes you think some (good and useful) API change(s) in nrfConnect won't be found in some future upstream Zephyr version?

I never thought that. Rather, I thought that an API change upstreamed into Zephyr would now be managed by Zephyr rather than nrfConnect. I hope nRFConnect would then deprecate/remove their implementation and align with the upstream (just like I do with my local patches to Zephyr when I integrate with a fixed version of Zephyr).

I also know that nrfConnect has rewritten Git history, so this fork isn't really the same as a branch.

Off-topic again?

Not really, and definitely on the topic of not breaking "out of tree" users. API definitions are specifications that will need to be traced back to requirements for certifiable executables. They not runnable code. (An inline implementation is distinct from the specification.)

marc-hb commented 2 years ago

I never thought that.

OK then why would API changes in nrfConnect not be manageable using the same processes and tools as API changes across upstream Zephyr branches? Considering these processes and tools don't exist yet, it sounds like you're complaining about a problem that does not exist yet.

I also know that nrfConnect has rewritten Git history, so this fork isn't really the same as a branch.

Pretty sure doxygen does not care about git history. If some other solution or tool ever relies on git history then it will be time to highlight this and discuss pros and cons.

The long story short is that Zephyr has a virtually infinite number of forks so a blanket and super vague request to "support forks" cannot possibly make sense. Only specific requests make sense; for instance: "Can this solution/tool be made compatible with rewritten git histories, pretty please?"

PS: making 1000s of random Zephyr forks "certifiable" sounds... fun! Whatever that means.

gregshue commented 2 years ago

OK then why would API changes in nrfConnect not be manageable using the same processes and tools as API changes across upstream Zephyr branches?

nrfConnect could use the same processes as API changes across upstream Zephyr branches - but it cannot assign different meanings to the same version identifiers unless some other mechanism exists to tell them apart. I look to the Zephyr Project to specify one mechanism for all forks to use.

In order for downstream developers to create a module that works with either upstream Zephyr or an incompatible nrfConnect interface must be able to know at build time which interface definition to call.

it sounds like you're complaining about a problem that does not exist yet.

It exists already. I've just had bigger issues to tackle.

aborisovich commented 2 years ago

Hi everyone, I'm also thinking about solutions to those problems you describe (but don't worry I won't be interfering much in your opinion exchange as I don't have much knowledge about processes in Zephyr). We should think about solutions to all tools/interfaces Zephyr has one by one. I'll start with Kconfig because it seems an easier problem that other ones.

Kconfig maintenance proposition

The problem:

  1. Out of tree Zephyr application sets value to defined in Zephyr CONFIG_EXAMPLE_ZEPHYR_DRIVER=y.
  2. Zephyr project renames CONFIG_EXAMPLE_ZEPHYR_DRIVER to CONFIG_DAI_EXAMPLE_DRIVER.
  3. Out of tree Zephyr application can adjust itself during rebase to next zephyr revision but we also wish to somehow test compatibility from Zephyr perspective and introduce changes to end-users fluently.

Solution: Using Kconfig aliases and obsolete warnings generation using https://www.kernel.org/doc/html/latest/kbuild/kconfig-macro-language.html#built-in-functions $(warning-if,condition,text) function. Example:

$(warning-if,$(EXAMPLE_ZEPHYR_DRIVER ),Kocnfig option EXAMPLE_ZEPHYR_DRIVER is obsolete, please use DAI_EXAMPLE_DRIVER)
config EXAMPLE_ZEPHYR_DRIVER
    default n
    select DAI_EXAMPLE_DRIVER

Result: Out of tree application will receive a nice Kconfig warning that value that config value they set is obsolete... The only painful thing is monitoring on Zephyr side when to remove each of those obsolete variables (here we need robust process solution). The same goes for Devicetree, there is aliases feature (however I do not see any option of printing obsoletion messages here)...

marc-hb commented 2 years ago

but it cannot assign different meanings to the same version identifiers unless some other mechanism exists to tell them apart

Right, different forks and branches must use different identifiers to signal that they are indeed different versions. Not breaking new ground.

I look to the Zephyr Project to specify one mechanism for all forks to use.

An upstream project cannot anticipate all the potentially crazy ways it will be forked and create a versioning scheme that will be compatible with everything and anything. You can offer and recommend a "fork-friendly" versioning scheme; that seems reasonable. Can't wait to see your research and proposition.

nashif commented 2 years ago

@nashif It is also really frustrating to see the Zephyr Project not actually support the needs of users trying to complying with a development models it claims to support.

Funny, I wonder what this issue is about and why it was created in the first place. And what are those development models you are referring to exactly? Please be specific

module.yml provides an extension of the build system. As an integrator of multiple modules were I cannot control the consolidation or separation of the modules, I need everything done at a higher layer to be independent of which module the source exists in, unless it is scoped to apply to a specific directory subtree (e.g., .clang-format).

This is the most vague description of a problem I have seen in a while. I am not sure what are you asking for.

If your module has drivers, boards and anything that is supported out of tree in zephyr and you are interested in keeping those working with zephyr, then this issue is for you. If i understand the above correctly and you are asking us to make your module work with upstream zephyr and other forks the same way, then you are in the wrong place, this is not something we have ever promised, not something that we are interested in and to be honest a very strange request/expectation.

As an "out-of-tree" user I am trying to rescope this discussion to meet my needs related to breakage. I'm sorry you think of it as derailing. Perhaps you need to clarify which users you are not trying to address.

See above. If that is not clear, then I am not sure how else I would be able to clarify it.

Thats it from me. I have spent way too much time on this already.

gregshue commented 2 years ago

And what are those development models you are referring to exactly? Please be specific

In the T2: Star topology, application is the manifest repository, I am strategically reusing the repositories from zephyrproject-rtos, extending forks of other OSS projects to also be Zephyr modules, and putting licensed source into separate modules from my proprietary "applications"/boards/drivers/subsystems/tests/etc. All of my extended/proprietary repositories have the Zephyr glue in the module itself, kept in a Zephyr directory structure in the module-level zephyr/ subdirectory (next to module.yml. This relocation is necessary on some repositories due to name collisions with preexisting subdirectories. Other than the location of the Zephyr directory structure this follows the pattern in Zephyr Project's example-application module.

IIRC, the Zephyr documentation does not indicate any difference in support between the topologies or with developing as a Zephyr repository application, so I expect all the capabilities/tools/etc that work for a Zephyr repository application to also work with a Zephyr workspace application module. Maintaining this support requires all issues related to the Zephyr repository explicitly consider how the issue also may apply to any code in modules. Many (most?) issues will be independent of the module degree-of-freedom. Many issues will be impacted by the module degree-of-freedom.

It is really frustrsating to see this type of discussion always go into modules/mcuboot

I am not sure what are you asking for.

Fundamentally, I am asking the TSC Chair in particular, and voting members/maintainers/collaborators in general, to internalize that:

  1. Modules are a supported mechanism for organizing Zephyr workspaces. Technical discussions MUST provide a solution that also applies to Zephyr-specific source in them.
  2. The Zephyr Project already has created and owns an application module downstream of the Zephyr repository that is frequently recommended on the Discord channels as a pattern for adding proprietary content into a workspace.
  3. Not explicitly considering how an issue or PR is impacted by the module degree-of-freedom is inconsistent with the support Zephyr Project documents.

I shouldn't have to repeatedly raise the unavoidable question about the impact of modules. But apparently I do because raising the question itself is causing frustrations rather than being accepted as necessary consideration.

marc-hb commented 2 years ago

Technical discussions MUST provide a solution that also applies to Zephyr-specific source in them

Then participate in these discussions and provide very specific, technical solutions that address your problems if/when possible. This is just getting started.

But apparently I do because raising the question itself is causing frustrations rather than being accepted as necessary consideration.

I suspect the frustration does not come from any particular topic but from the extend, vagueness and verbosity of the requests combined with the expectation level. e.g.: "MUST" support forks/modules without any technical detail. A fork and a module can be literally anything. Be specific: describe something that does not work and how it can be fixed. Except you can't yet because there's no solution yet.

gregshue commented 2 years ago

Be specific: describe something does not work and how it can be fixed.

Maintaining support for the topologies cannot depend on any single person being a watchdog on all issues. This is not my job full time. I have been engaged as I have time. (See Global Namespace Management? and Replacing zephyr driver/subsys implementations and Support module.yml in zephyr repo.)

If i understand the above correctly and you are asking us to make your module work with upstream zephyr and other forks the same way

I am not asking for you to make my module work with upstream Zephyr and other forks the same way.

gregshue commented 2 years ago

Solution: Using Kconfig aliases and obsolete warnings generation using https://www.kernel.org/doc/html/latest/kbuild/kconfig-macro-language.html#built-in-functions $(warning-if,condition,text) function.

This works reasonably well for flagging deprecated symbols when users go from one release to the next. We also need a solution that works for users that go from one LTS to the next.

As Google members identified, I think a bigger need is to tell out-of-tree developers how to transform their code (and verify the transformation) when a rebase is attempted (which may be from one LTS to the next).

marc-hb commented 2 years ago

Be specific: describe something that does not work and how it can be fixed.

Maintaining support for the topologies cannot depend on any single person being a watchdog on all issues. This is not my job full time.

This is "only" an open-source project: as long as you're the only one who cares about some feature or request getting done then it is your job full time. Explaining and convincing others may help (exceptional communication skills required).

mbolivar-nordic commented 1 year ago

Process WG: defer until next week since @nashif could not attend today

mbolivar-nordic commented 1 year ago

I shouldn't have to repeatedly raise the unavoidable question about the impact of modules. But apparently I do because raising the question itself is causing frustrations rather than being accepted as necessary consideration.

@gregshue my opinion is that you are missing the point here.

Basically all of us work on out of tree modules and we do care about them; please accept that.

I think what people are trying to tell you is that your attempts to rescope this issue are unwelcome distractions from what we are trying to do first. As has previously been stated we are trying to tackle one thing at a time, and that's not modules or mcuboot. I'm going to try to refocus this discussion in the next meeting in half an hour.

gregshue commented 1 year ago

Provide some guarantees, guidelines and a process keeping out of tree users operational

Perhaps we need to clearly describe the range of out of tree users we are trying to keep operational. I assumed it included end users doing freestanding and workspace application modules as well as downstream module developers. Are they not part of the scope trying to be addressed?

marc-hb commented 1 year ago

Are they not part of the scope trying to be addressed?

I think it depends what code they write. Yes for users who follow the guidelines TO BE DEFINED (that's the entire purpose of this issue, read its description again). Others, probably not.

Types of users will be defined by which rules they follow (the rules do not exist yet)

Perhaps we need to clearly describe the range of out of tree users we are trying to keep operational.

A formal definition of this range will be the conclusion of this work, not the starting point.

So no one can tell yet where your personal, current use case(s) will be. Terrifying? That's what this issue wants to address in the future.

mbolivar-nordic commented 1 year ago

Process WG:

mbolivar-nordic commented 1 year ago

Process WG will address this next week with a discussion from @galak on documenting changes to things that are not stable.

mbolivar-nordic commented 1 year ago

Process WG:


PerMac commented 1 year ago

As a tester and a person responsible for internal CIs I have another observation, which I believe was not raised in this topic.

Background: We (Nordic) have an sdk which is expanding zephyr. To allow for a proper integration, we have a fork of zephyr, where some extra patches have to be added on top of upstream code. Several times per year we do a synchronization of this zephyr fork with a current upstream. It is rather demanding process due to the amount of changes.

Issue: During almost every of such synchronization processes we had to promptly fix issues related to twister, which are blocking the whole testing. Most times the command we use in CI was not working any more. Most of such issues were generated by rather minor changes (change in names of twister arguments, some argument becoming default, etc.). Since we have multiple independent CI plans involved in such process and many of those using twister, even such minor issues are escalating quickly. IMO the reason is that twister development is very tightly connected with zephyr. Of course, there are reasons behind it, twister is a tool created to support zephyr's development. However, twister is a very useful tool not only in the scope of pure upstream zephyr. Other projects, based on zephyr, can as well benefit from its usage within their projects (as we do).

Idea: Make twister development decoupled from zephyr, having its own versioning. E.g. by moving it to its own repo. IMO this could help projects like ours, where we could delay an update to a newer version of twister if there are issues there instead of promptly fixing/reverting commits/finding workarounds during the demanding process of the whole zephyr synchronization. It could also result in speeding the update, if some feature is needed, without waiting for the whole zephyr sync. I think it could also be beneficial for the upstream. E.g. we could think of some "staging" environment: twister updates could first be added to a "staging" branch. Some cross-checks with the "main" version could be running there. If everything is fine, "staging" will be merged to "main".

keith-zephyr commented 1 year ago

Related issue: Provide consistent deprecated behavior and reduce downstream breakage #49708

keith-zephyr commented 1 year ago

@wbober - Summary of issues discussed at F2F in Prague @nashif - Only APIs have a life cycle. Devicetree and kconfig don't yet have a defined life cycle. But this is also needed so deprecation policies can be consistent. @nashif - architecture features are defined by API (irq enable/disable, sys IO). Architecture APIs are internal, but need a life cycle/stability process. What's the policy for extending arch interfaces? @nashif - need to mark internal APIs clearly so users no not to use them @gregshue - namespace management is an issue. Include paths, names of boards, names of HW blocks used in samples. What about downstream users that have an out of tree SoC? Documentation tags are another area. @wbober - first pass is to keep scope limited to the core issues only. And then do another iteration to broaden to more public interfaces. @gregshue - header file include path. @gregshue - need a policy how to reference files in modules @fabiobaltieri - not all Kconfig and devicetree bindings are public. Not practical to make all public @galak - Doesn't agree step is "amend policy". Enforcement should be first. Need to figure out it will be managed @nashif - Need to prioritize this list (APIs first, and namespaces) @wbober- Nordiq can allocate resources to help with tooling @wbober - as of today Kconfig, devictree and Cmake is considered unstable. APIs have a life cycle.
@wbober - there is a need to define a policy for the missing area. But we need enforcement to actually make progres @wbober - Need to fill in gaps. Define which Kconfig symbols are public for example. Same for DT bindings. @wbober - once public APIs are defined, Nordiq will start creating tooling @nashif - premature to define tooling without knowing the scope of the Kconfig and DT public areas. @wbober - agrees - the policy needs to be defined along with the items to watch @nashif - first step is define the guidelines and communicate this to maintainers. Make sure reviewers/maintainers raise issues on PRs that change a public interface. @nashif - treewide policy provides some guidance that can be leveraged @gregshue - a policy that isn't enforceable isn't really a policy. So need to consider "enforcability" of specific policies @wbober - enough manual labor can make anything enforceable, but not desired. We can start with manual (code reviews) and then add a tooling later. @wbober - proposed life cycle: Experimental, Unstable, Stable.
@gregshue - when integrated Zephyr with other repositories - has run into namespace conflicts in the Kconfig space. This namespace is flat. Zephyr's Kconfig namepace isn't well manage right now. Minimizing risk of Kconfig conflicts. @nashif - completely avoiding Kconfig conflicts is sepate from the stable public interface issues. @nashif - internal/external for Kconfig. helper symbols by definition are internal (symbol can only be set by another symbol). @gregshue - has considered helper symbols for the internal/external, but even internal symbols can generate conflicts with out of tree symbols @nashif, @keith-zephyr - Agree that namespace issues are important issue, but out of scope for defining public interface lifecycle @keith-zephyr - helper symbols can in some cases be considered pubic -or least part of the architecture API interface @wbober - need to define rules to partition internal/external for header files, Kconfig, and DT bindings to start @nashif, @keith-zephyr - agree with this prioritization @nashif - namespace will need to be dealt with later. CI (twister tooling) also needs to be handled at later steps or as a separate issue. @dleach02 - suggest to @gregshue to enumerate specific risks to his downstream project @gregshue - downstream CI isn't as much of a problem. But risk is integrating multiple projects that generate Kconfig conflicts. Suggests creating a prefix to Zephyr Kconfigs. @nashif - that might be too broad a change

gregshue commented 1 year ago

During today's PWG meeting @nashif asked which modules (in zephyrproject-rtos) were defining Kconfigs. A quick find of at SHA 74c4d1c52 (June 5, 2023) shows at least the following:

nashif commented 1 year ago

As a tester and a person responsible for internal CIs I have another observation, which I believe was not raised in this topic.

Because this is not an out-of-tree user case. CI and tooling used in CI is a different category and many of the policies and discussions here do not apply, i.e. currently there is no intention to maintain APIs backward compatible or define some deprecation for features etc in twister and other tooling. What you are talking about and correct me if I am wrong, are mostly bugs, not because someone intentionally changed some interface or API. Most major changes are usually discussed and reviewed when it comes to tooling, so unless I am missing something that is more serious than implementation bugs, please list those in a new issues ("how not to break downstream CI" maybe)

Where twister is maintained, in-tree or a seperate tool will not solve the problem. Interfaces and documented features can be maintained anywhere. When a change goes into twister, usually it deals with some issue, such issue fixes will need to be integrated with the main tree sooner than later, doing this from out of tree twister will just make things more complicated and given this is CI, every will require pulling external twister. There is also the need to run twister with old code, which is currently possible only because of twister being in the tree.

nashif commented 1 year ago

During today's PWG meeting @nashif asked which modules (in zephyrproject-rtos) were defining Kconfigs. A quick find of at SHA 74c4d1c52 (June 5, 2023) shows at least the following:

  • modules/hal/silabs/zephyr/Kconfig (implicit reference)
  • modules/hal/espressif/zephyr/Kconfig (referenced for espressif/zephyr/module.yml)
  • modules/lib/picolibc/zephyr/Kconfig (referenced from picolibc/zephyr/module.yml)
  • modules/lib/zscilib/Kconfig.zscilib (referenced from zscilib/zephyr/module.yml)
  • modules/lib/chre/platform/zephyr/Kconfig (referenced from chre/zephyr/module.yml)
  • modules/lib/gui/lvgl/zephyr/Kconfig (implicit reference)
  • modules/audio/sof/zephyr/Kconfig (implicit reference)
  • bootloader/mcuboot/boot/zephyr/Kconfig (referenced from mcuboot/zephyr/module.yml)

nice list, but most of those are actually Zephyr Kconfigs, i.e. they are driven by Zephyr and are not part of how configuration of the standalone code of the module works. The only Kconfig users who might conflict is SOF AFAIK but this is already contrained and we can keep the namespace sane given that most SOF developers work on Zephyr already.

gregshue commented 1 year ago

AFAICT, we have been using "out-of-tree" to mean content defined/controlled outside the zephyr repository. This term doesn't seem to be in the Glossary of Terms. Is there an actual definition somewhere?

are not part of how configuration of the standalone code of the module works

Look again at zscilib, chre, lvgl, and mcuboot. Each has module build files or standalone code that is controlled by their locally defined Kconfigs.

Note that we cannot control the symbols encountered by an out-of-tree user. The best we can do is recommend a pattern that scales well and live within it ourselves.

nashif commented 1 year ago

ok, zscilib, mcuboot are both tightly coupled with zephyr, in the case of zclib, the kconfigs already namespaced and kconfig in there is primarily is part of the integration with zephyr. Same thing for chre which has all Kconfig usage in platform/zephyr, i.e. it was added there as part of the porting to zephyr, it is also namespaced. If the ask here about having modules use some prefix and namespace for the integration with zephyr, that is nice and we see that most already do that.

SOF for example is another class of user, it did use kconfig before it started using zephyr and there we had some issues with configs, given that it implemented the same things we had in zephyr, this is all going away as more zephyr integration happens, better namespacing would have made certain things easier, but this is an exception, we do not have this type of usage very often where two similar system integrate with each other (where CONFIG_LOG can become a contested string).

Having said all of that, I do not see how namespacing is going to solve the problem we are dealing with here. It is a different issue, important one, but wheter I call an API zephyr_blah() or just blah() does not really matter if I change the signature or change the behaviour and do not provide backward compatibility.

gregshue commented 1 year ago

If the ask here about having modules use some prefix and namespace for the integration with zephyr, that is nice and we see that most already do that.

That should be the recommendation from the Zephyr Project to all repositories being integrated with Zephyr. The Zephyr Project has some control over every repository within zephyrproject-rtos, so this should be required of all repositories under zephyrproject-rtos.

I do not see how namespacing is going to solve the problem we are dealing with here.

Breaking changes will inevitably happen. The aggregate architecture must be able to evolve (e.g., pinctrl). We cannot eliminate it, so we must reduce:

nashif commented 1 year ago

Thoughts about Kconfig:

PerMac commented 1 year ago

Because this is not an out-of-tree user case.

We have hundreds of tests and samples in our repo, which is not part of zephyr tree. We are using tooling from the zephyr tree to execute them. Very often synchronization with zephyr is blocked due to changes in the tooling. Why this doesn't count as an out-of-tree user case?

currently there is no intention to maintain APIs backward compatible or define some deprecation for features etc in twister and other tooling.

Why? Is it set in stone and out of a discussion?

What you are talking about and correct me if I am wrong, are mostly bugs, not because someone intentionally changed some interface or API.

Not really. Most of those are intentional changes. --testcase-root -> --testsuite-root. Or the recent one, when --board-root is loaded by default for out-of-tree modules. Some issues were introduced by myself as well, where I tried to unified how tests ids are handled for in-tree and out-of-tree tests ending in changes needed in downstream CIs twice, when it was added and then reverted after a while.

Most major changes are usually discussed and reviewed when it comes to tooling, so unless I am missing something that is more serious than implementation bugs, please list those in a new issues ("how not to break downstream CI" maybe)

Indeed. But as you pointed, there is no intention for backward compatibility. And I think it is generally not evaluated how the changes can affect out-of-tree usage when reviewing new features/fixes. Definitely, we will think about what can be done on "how not to break downstream CI". I wanted to share my POV here, since I found this within broad topic as "How not to break "out of tree" users".

Where twister is maintained, in-tree or a seperate tool will not solve the problem. Interfaces and documented features can be maintained anywhere. When a change goes into twister, usually it deals with some issue, such issue fixes will need to be integrated with the main tree sooner than later, doing this from out of tree twister will just make things more complicated and given this is CI, every will require pulling external twister.

As already mentioned, I believe having a separate place for development than zephyr's main tree can benefit the tooling. I know that we have internal teams that needs a custom (patched) version of twister, e.g. to support testing of features which are not public yet. Since twister comes a a part of a big package as zephyr, it requires more effort to work on the tool itself. Using twister in out-of-tree project requires synchronization with the whole zephyr tree or rather cherry-picking certain commit. Sure, updating/fixing twister will require an extra step, e.g. changing the version in the manifest. However, not every change to twister requires immediate update in the zephyr.

There is also the need to run twister with old code, which is currently possible only because of twister being in the tree.

I don't follow this. I am not proposing developing twister as an independent python package installed e.g with pip, as west is (although personally i think it can be beneficial). If twister version is controlled through west manifest then running twister with old code is as easy as checking out old zephyr and doing west update, to get twister which was used back then. What's more, if twister repo is independent from zephyr's main tree, one can test old zephyr with new twister and vice versa, which is now not that easy. E.g. right now I cannot check if proposed changes to twister won't brake our internal usage, since we are not using main zephyr. With twister as a separate repo it will be as easy as referencing a twister PR in the projects manifest. I am aware, that some changes in twister are coupled with stuff happening in zephyr and obviously not every version of twister will work with every version of zephyr. But I think the amount of such couplings is limited and during the most of development 1:1 coupling is not a must.

I know that my issue is not in line with the ongoing discussion, however, I don't find it off-topic. I agree we can move the discussion to a separate issue, as you proposed. Nevertheless, I wanted to share it with broader audience, since this issue is becoming more and more present in our development as more and more of our teams are starting to use twister in their verification plans and updating zephyr literally breaks their work.

nashif commented 1 year ago

We have hundreds of tests and samples in our repo, which is not part of zephyr tree. We are using tooling from the zephyr tree to execute them. Very often synchronization with zephyr is blocked due to changes in the tooling. Why this doesn't count as an out-of-tree user case?

This issue is about end users and API compatibility and IMO we should keep it this way. This also includes tests and sample. This why we deprecated ztest for example and will only remove old ztest once the deprecation period has passed.

I agree there needs some level of control and some assurance and attention paid to how our tooling moves forward to avoid breaking downstream CI, but this will need to be discussed in a completely different context. CI environment, test environments and approach to testing and CI in general varies from one organisation to the next, none of that should impact the upstream CI activities.

currently there is no intention to maintain APIs backward compatible or define some deprecation for features etc in twister and other tooling.

Why? Is it set in stone and out of a discussion?

No, it is not set in stone. But from your initial comment it is not clear exactly what the problem is and how severe the problem is. We have been trying as much as we can to keep old options working and backward compatible, things get missed.

Indeed. But as you pointed, there is no intention for backward compatibility.

I am talking about backward compatibility on a different layer. I think external interface (command line options) should be backward compatible and we should not drop options randomly. However, we can't keep for example the history of how we generate results or reports and how we deal with tests in general backward compatible. If we decide that some tests should be marked differently at some point, does not mean we will have to maintain the old behavior while we implement the new one.

and I think it is generally not evaluated how the changes can affect out-of-tree usage when reviewing new features/fixes.

That is the thing, CI environment are different and there is no way for us to track the various way of running CI environments and testing. The only defense you will have is, upstream your code, participate in review and try to be close to upstream as much as possible.

I don't follow this. I am not proposing developing twister as an independent python package installed e.g with pip, as west is (although personally i think it can be beneficial). If twister version is controlled through west manifest then running twister with old code is as easy as checking out old zephyr and doing west update, to get twister which was used back then. What's more, if twister repo is independent from zephyr's main tree, one can test old zephyr with new twister and vice versa, which is now not that easy. E.g. right now I cannot check if proposed changes to twister won't brake our internal usage, since we are not using main zephyr. With twister as a separate repo it will be as easy as referencing a twister PR in the projects manifest. I am aware, that some changes in twister are coupled with stuff happening in zephyr and obviously not every version of twister will work with every version of zephyr. But I think the amount of such couplings is limited and during the most of development 1:1 coupling is not a must.

I think all of this can also be resolved with in-tree twister. Before going any other way, I would like to see what are the "interfaces" and assets we want to protect to avoid breakage, address those with additional testing and documentation and some guidelines etc. Moving twister to a seperate repo will not provide any immediate results if we do not define the interface and get more input etc.

nashif commented 1 year ago

I think all of this can also be resolved with in-tree twister. Before going any other way, I would like to see what are the "interfaces" and assets we want to protect to avoid breakage, address those with additional testing and documentation and some guidelines etc. Moving twister to a seperate repo will not provide any immediate results if we do not define the interface and get more input etc.

in other words, lets not jump into solution space before we have evaluated the problem first.

gregshue commented 1 year ago

This issue is about end users and API compatibility

Here are a few points to consider:

nashif commented 1 year ago
  • Manufacturers of ETSI 303645 compliant devices are recommended to act on disclosed vulnerabilities in a timely manner (Provision 5.2-2). Conventionally the process is completed within 90 days for a software solution. In order for a manufacturer to roll that solution out it has to propagate through and be verified in intermediate forks/projects in a much shorter timeframe. Problems for Nordic become a problem for some of their customers (e.g., me).

¯\(ツ)

No idea where all of this going, so I am just going to take a break from this issue.

fabiobaltieri commented 1 year ago

Hey, few thoughts on my side now that I had a bit of time thinking about it:

nashif commented 1 year ago

Here we a good example of possible issues: https://github.com/zephyrproject-rtos/zephyr/issues/61413