zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.52k stars 6.45k forks source link

[RFC] CI scopes #78403

Open PerMac opened 5 days ago

PerMac commented 5 days ago

Introduction

This RFC is meant to serve as a place to discuss how our twister-based CI should work. What should be the scope (which platforms) for testing PRs/on push/scheduled (weekly) builds. It also proposes a solution to optimize platform scope based on integration_platforms and platform_keys.

Problem description

The CIs are running too long. This reaches extremes with weekly build trying to build everything on every platform lasting ~20h https://github.com/zephyrproject-rtos/zephyr/actions/workflows/twister.yaml?query=event%3Aschedule. It seems much of the time is spent on redundant processes, in particular building the same tests on multiple (even hundreds) of platforms, where majority builds wouldn’t cover any new area. Moreover, if a scope is broad (extreme in nightly with all platforms), a lot of time is spent by cmake based filtration (“filter” entry in yamle). Twister has to run partial cmake (with package helpers) for each platform in a scope (several seconds each check) in order to figure out should the board be filtered out or not.

Proposed change

weekly run:

on PR run:

on push:

Detailed RFC

Using “filter” keyword is taxing for twister performance, but it also allows for a quite robust filtering mechanism based on actual dt/kconfigs, in contrast to e.g. “depends_on” which is based on “ad hoc” strings placed by developers in boards descriptions (or not if forgotten or not being aware that such string can even have relation anywhere in the tree https://github.com/zephyrproject-rtos/zephyr/pull/77823#issuecomment-2343822428).
That’s why I believe we shouldn’t be getting rid of “filter” but focus on how to run twister in a more targeted way.

Testing with twister leaned more into "fishing with net" workflow. We drop a net and afterwards check which platforms were caught. Being more targeted can be achieved with “integration_platforms” and using integration mode, where twister is told by the test where to verify it. What’s more, skips are turned to errors on integration platforms, making sure such tests won’t slip into some crack (e.g. due to typos causing skips). Another great and indirect feature of integration_platforms is that it shifts responsibility of guessing the proper testing scope from twister (hence CI/twister maintainers) giving agency to maintainers of a given region.

As for weekly build, IMO the main question is “what is its purpose?”. The current approach was to try “all on all” (all tests and samples in tree on all platforms) to verify as much as possible. As we see, this is no longer feasible. Anas had a valid observation, that from the zephyr project perspective, we should focus on “verification that zephyr is working”, and not on “if all platforms in the tree are working with all zephyr tests/samples”. Having this in mind, maybe we shouldn’t run “all on all” but only a minimal subset of platforms to provide the best coverage? Do we know how to achieve this?

My proposal is to find some combination of platforms properties (categories) that nicely fill the space of possible variability that could affect the result of a test. Then add to twister a global level “platform_key” functionality (pass through args in twister call affecting all tests in a session instead of individual keys defined in individual yamls for individual tests).

With current yamls nothing more than "architecture" can be obtained as a key which is rather rough grain. The goal would be to get some better granularity. E.g. to build on a single platform with a given soc used. Socs should be easy to get from board.yaml and it is already recognized that we want to use those (or some mutual variant for yamls used in twister and buildsystem) with twister https://github.com/zephyrproject-rtos/zephyr/issues/74142. To limit blind spots we could make the selection of platform random (with known seed).

On PR scope should be more targeted. This is already obtained by using test_plan.py script dynamically selecting which tests to run based on changes in the PR and usage of integration mode. I believe there is still a lot to gain if integration_platforms are directly added to every single test and sample.

Dependencies

Add section explaining the CI workflows in the docs. Also add section on how contributors should use twister to get interesting results. Codifying the rules for integration_platforms and adding validation checks. E.g. no more than 2 integration platforms per test, if possible use (sim)emulator etc..

Concerns and Unresolved Questions

Alternatives

Use platform_allow to limit the scope. As in https://github.com/zephyrproject-rtos/zephyr/pull/78024/files#diff-0b73ca0ba59ca0865fef3b2346e9911f6bf2610237eb9ec6619e50271cd18862R77-R78 and in #77839

I think using platform_allow to limit sample execution is a problematic way to go. It will move us back to the place we were 4 years ago: https://github.com/zephyrproject-rtos/zephyr/issues/25104 - people will not be able to run samples on their boards with twister unless those are in platform_allow (without using --force-platforms which would introduce further issues). Integration_platforms were introduced to resolve that issue https://github.com/zephyrproject-rtos/zephyr/pull/26382

Another option is including analysis of coverage reports and selecting the combination of tests/samples/boards that provides the greatest coverage in total. This solution would require some more effort in collecting the coverage data and its analysis.

nashif commented 5 days ago

the focus of this RFC is CI, which needs to be addressed, but any solution needs to deal with the fact that developers should start using twister before pushing to CI and be able to run twister without having to wait for hours and hog the system. So, this needs to go beyond CI and address developer experience with twister running on local machine and encourage people to use twister locally more often, thus, reducing the load on CI....

Putting CI aside and thinking about coverage levels that can be invoked on the command line with increasing coverage is a better way to approch this, CI will just benefit from that. Also, it is very important to highlight the goal of testing and running twister for the zephyr project and as part of the zephyr project. The goal is feature testing and testing zephyr itself, it should never be seen as platform validation or quality control. This part needs to be supported but it should be mostly about feature testing. If I can verify a feature or a test a change on 1 target, there is no need for me run the same test on the remaining target. I still need to be able to run the same tests on other simulation and hw targets if applicable, but the benefit of "building" hello world on 1000 targets without actually running anything is close to zero, especially if we validate it on 1 or few in CI through emulators. If a board is broken or does not build, this can be validation with test (we have the build_on_all to catch exaxtly this type of problem).

  • Adding integration_platforms to every existing test.

We do not need this and and it does not really fix the problem as described. We already have most tests performing as expected using the default platform coverage which can be improved further using platform_key for example. If you use integration_platforms to solve the issue, then integration platforms loses the value it has right now and many tests will have to fill it with most simulators we have, how is this going to help?

How will a test like this look like? https://github.com/zephyrproject-rtos/zephyr/blob/fd4f3ce246c9ce1ddbd04b20054ebca3b9bbdc6f/tests/kernel/events/event_api/testcase.yaml#L1

if all tests/samples are going to end up like this:

https://github.com/zephyrproject-rtos/zephyr/blob/86af9c4b0766842aa97b5b9bead12c71ca5d69ed/samples/sensor/accel_polling/sample.yaml#L13

we did not solve any problem.

The main problem driving any change in this area, as we can see from the linked PRs above is that everyone will want their architecture or platform to be listed and you end up with a huge and unmanageable list of platforms. A kernel test like the above is interesting because of how minimal it is...

We see many problems in ci (on push) when minimal testing is done on core features, unfortinatly there are variations in architectures and level of support and something that passes CI in a PR might fail when you expand coverage on push. So, switching those (majority of tests) to use integration platforms might be a bad idea if not done on a test by test basis and with a lof consideration.

Using “filter” keyword is taxing for twister performance, but it also allows for a quite robust filtering mechanism based on actual dt/kconfigs, in contrast to e.g. “depends_on” which is based on “ad hoc” strings placed by developers in boards descriptions (or not if forgotten or not being aware that such string can even have relation anywhere in the tree #77823 (comment)).

There is no intention to get rid of filter, however, filter does not make sense in some cases, especially if the filter ends up delivering the same platform over and over again (or in some cases the filter is on something enabled int the test/sample itself, which is funny), see an example:

https://github.com/zephyrproject-rtos/zephyr/blob/86af9c4b0766842aa97b5b9bead12c71ca5d69ed/tests/drivers/fuel_gauge/bq27z746/testcase.yaml#L5

here we have a filter and platform_allow, the filter matches native_sim thru the overlay defined in the same test. Now, this is not that bad because the platform_allow already limits the coverage, but this is a pattern i have seen in many other places without the platform_allow and sometimes resulting in the filter yielding no results.

Filters are not perfect and will not guarantee that if a platform matches a filter it actually will build/run a sample or pass the test. We need some other mechanism and while you do not like it, supported and depends_on from the context of testing and targeted coverage make sense. If someone is not keeping their platform yaml file up2date, that is their problem. Everyone will have to incure the penality if we depend on filters everywhere, so from a zephyr perspective, we have tests, samples and we should be able to be selective about what platforms we want to use to validate the feature and the codebase, if a board does support feature X but does not directly advertise it, it will not be used as a target.

The purpose of the linked PRs was to limit coverage to only those platforms that a sample was written for, and where those platforms are documented. I know platform_allow is not great for that and we will want to allow any board to run a sample/test using twister (the purpose of this is another discussion if the sample does not have a runnable test), hence why we need a solution that covers all scenarios:

Two options:

  1. new key: introduce something new, lets call it foo_platforms__ (dont want to deal with naming here) which a combo ofplatform_allowandintegration_platforms`, meaning that if a platform is not listed there, it will still run if filters apply.
  2. Use integration_platforms but change the behaviour: Where it makes sense, add integration_platforms, enable it by default and basically fullfil what is described in (1) with the main difference that with --integration we limit the number of platforms we pick from the list, can be configurable, default can be 1, but you can do all if you have the time and resources.

integration_platforms makes sense in a category of tests, probably it makes sense in some tests as well, especially in additional scenarios within a testsuite testing variations, but I think some of what we have already using default platforms as the primary targets and the additional narrowing of selection using platform_key does not need to change.

My proposal is to find some combination of platforms properties (categories) that nicely fill the space of possible variability that could affect the result of a test.

yeah, that is exaxtly what supported and depends_on already do. Also see

https://github.com/zephyrproject-rtos/zephyr/blob/86af9c4b0766842aa97b5b9bead12c71ca5d69ed/doc/develop/test/twister.rst?plain=1#L507

Using samples as tests is the other issue we have. We now have guidelines that try to limit the abuse and also address some of the issues above, this needs to be taken into consideration.