Proprietary toolchain support sustainability

fabiobaltieri commented 1 year ago

Introduction

Problem description

Hi, I recently bumped into few compiler quirks:

the first priority check implementation broke ARCMWDT due to differences in section naming
the second implementation broke ARMClang due to missing symbols
the troubleshooting of the second implementation was slowed down because ARMClang support was already broken
a recent build time optimization added a warning in clang builds for pretty much anything but native_posix that turned out to be an llvm bug
we are now blocking a Kconfig option removal for unclear compiler limitations (may not be the case, but it adds to the point)

None of the above were detectable in CI. On paper the project supports a handful of different compilers, but in practice the only ones tested in CI are

Zephyr SDK at the current stable version
Clang/LLVM at the version in the CI image, but really only builds for native_posix

Any breakage that does not hit any of these is going to go undetected and has to get caught manually by a reviewer or post-submit, potentially post-release, so it's regression and potential backport.

More importantly, many of those toolchains are proprietary, most developer have no access to them, so one has to rely on someone else to test their code, but at the same time there's not toolchain maintainers or a known list of point of contacts for validating changes against a specific toolchain.

Proposed change

Supported proprietary toolchains should have an area in the MAINTAINER file listing users who can test potentially breaking changes. Ideally there should be a way for Zephyr contributors to access proprietary toolchains for troubleshooting, or the support team should provide help in running the changes on the toolchain and providing build artifacts, or contributing fixes and workarounds directly.
Supported open source toolchains should have some level of testing in CI (Clang/LLVM beyond native_posix is what I'm thinking)

fabiobaltieri commented 1 year ago

cc @cfriedt @evgeniy-paltsev @galak @gmarull @karben @keith-zephyr @kokas-a @nashif @tejlmand

cfriedt commented 1 year ago

At one point, we spoke about reporting test results from trusted test runners outside of Zephyr CI such as SoC vendors like ST or NXP.

The toolchain spec should be an integral part of that. Possibly an enumeration. Custom LLVM and custom GCC should also be a thing. We have for example CROSS_COMPILE, but it would be great to capture more metadata than that for reporting purposes.

Like external modules, external toolchains would need to run a subset of tests (often proprietary toolchains only support a limited number of platforms).

Maybe we could reuse some of that machinery

fabiobaltieri commented 1 year ago

At one point, we spoke about reporting test results from trusted test runners outside of Zephyr CI such as SoC vendors like ST or NXP.

Would it make sense to investigate the possibility of having some custom runners with those toolchains available running a limited set of core tests? Something on the order of magnitude of the current Clang/LLVM one.

tejlmand commented 1 year ago

Would it make sense to investigate the possibility of having some custom runners with those toolchains available running a limited set of core tests? Something on the order of magnitude of the current Clang/LLVM one.

it makes a lot of sense, and has been discussed in Toolchain WG, but guess it mostly has to do with priorities. So if anyone have the skills and bandwidth to carry out such work, then it will be highly appreciated.

tejlmand commented 1 year ago

please find the following proposal as a starting point: https://github.com/zephyrproject-rtos/zephyr/pull/63759

Hopefully this will provide more visible information on whom to contact in case of issues related to those toolchains. Not the final solution, but a step in a better direction.

abrodkin commented 1 year ago

I do support that discussion here as we (Synopsys ARC maintainers) and our customers are suffering from exactly that problem: even though we do quite extensive in-house testing with our proprietary tools, simulators and boards, we only do it post-merge upstream. In that sense we'd prefer to enable 3rd-party developers to see problems which are introduced by their changes before merging these changes in the upstream main tree.

But then we start seeing logistical problems:

From one point of view it's hard to make proprietary tools and even more so HW boards freely available to the wide audience. There're usually legal and availability/scalability problems.
From another point of view all the above difficulties could be solved in one or another way, but that requires some work to be done. For example, come-up with extensible Zephyr CI pipeline architecture so that external (vendor managed) runners could be involved. And on top of the initial implementation we'll get problems with stability (i.e. we'll need troubleshoot things in those external runners) and, again, scalability (even if we may allocate 10 boards, it will be nowhere close to required bandwidth).

That said I fully agree with @tejlmand here: if there's somebody who's willing to bite that bullet and work on that, we'll provide all the support with reviewing, testing etc.

fabiobaltieri commented 1 year ago

@tejlmand @abrodkin do you know if the licensing model of those toolchains would even allow usage in CI for a public project? I imagine we could not just make a public docker image with those, but if someone were to put the work to set it up, would it be possible licensing wise to have custom runners with proprietary compilers installed?

I'm not thinking board testing or anything particularly extensive, just something to build few core samples to just test the core stuff, at least as a start, but if the licensing does not allow that then it's not even worth discussing it. :-)

tejlmand commented 1 year ago

@tejlmand @abrodkin do you know if the licensing model of those toolchains would even allow usage in CI for a public project?

Past talks with ARM regarding Arm compiler 6 licenses has been positive, so should be possible, but we haven't gotten to the point where we have a concrete way of doing the implementation with CI. I would say we're lacking that step before having a final 👍 / 👎 to the solution.

cfriedt commented 1 year ago

@tejlmand GitHub also fully supports Remote Test Runners as shown in a recent Tech Talk by @kartben featuring @szczys from Golioth. https://www.youtube.com/live/940O1CUgh4Q?si=zJlRCSqFqO73tHla

https://blog.golioth.io/golioth-hil-testing-part1/

Technically speaking, Synopsys (or another partner) could easily instantiate the existing public Zephyr container image (on-premesis for hardware-in-the-loop testing or in a private cloud for emulation), add a volume on top of that for the proprietary toolchains, add a secret for a toolchain license key, and lastly add another secret for AWS storage (or elsewhere) to provide test reports.

Very doable. My only concerns are the volume of traffic and of course security (it would be best to host the runner in an isolated environment).

fabiobaltieri commented 1 year ago

Very doable. My only concerns are the volume of traffic and of course security (it would be best to host the runner in an isolated environment).

Yeah that would be the main concern, there's some documentation here about that: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#self-hosted-runner-security, it's not trivial to set up properly but it can certainly be done.

abrodkin commented 1 year ago

From discussion with @stephanosio a couple of months ago I understood that idea of using a private/internal CI image in the main Zephyr CI was rejected by the TSC in the past. That's because it limits user's ability to easily reproduce the issues seen in the CI locally (i.e. users won't be able to simply download the CI docker image and run the tests locally).

And indeed, that's a good question: how to react on failures reproduced by vendor runners?

I mean, what if your PR triggers some issue in a vendor CI. At best you see a build issue and from compiler error message may get enough information for the fix. But what if a test fails during execution: how are you going to debug it? I see only 2 options:

Get creative and re-spin your PR until the problem is gone.
Get hold off the vendor's maintainers and "co-debug" with the person.

If we're OK with such limitations, let's try to prototype some solution for "external" vendor Zephyr CI runners.

Also, speaking of Synopsys runner I'd say we don't need to have on-premises runners which will build tests and run on SDK's QEMU or on our proprietary nSIM simulator. Instead, the same cloud infrastructure which is used by the main Zephyr CI will do for us. I mean the same cloud provider like AWS or whatever is used now.

fabiobaltieri commented 1 year ago

Yeah ideally the toolchain would be generally available, but having it in CI only seems like a better compromise than not testing at all.

Instead, the same cloud infrastructure which is used by the main Zephyr CI will do for us. I mean the same cloud provider like AWS or whatever is used now.

Sounds ideal, private runners seems like a bit of a logistic nightmare. I've no idea how the docker setup works though, @cfriedt could you expand a bit on that?

I'm also wondering if there's some other open source project on GitHub running such a setup that we could look into.

stephanosio commented 1 year ago

As @abrodkin mentioned, the problem has more to do with the project policy (i.e. to ensure that any problems in the Zephyr main CI are locally reproducible) than technical hurdles.

In fact, I have already tested the "private" CI Docker image idea for supporting ARM FVP in our CI in https://github.com/zephyrproject-rtos/zephyr/pull/45099; but, it was rejected for the aforementioned reason.

Yeah ideally the toolchain would be generally available, but having it in CI only seems like a better compromise than not testing at all.

@fabiobaltieri The compromise suggested by the TSC at the time was to create standalone workflows (separate from the Zephyr main CI twister workflow) that run with proprietary tools; but, this comes with some logistical issues (see https://github.com/zephyrproject-rtos/zephyr/pull/45099#issuecomment-1347777632).

By the way, in terms of the CI infrastructure, we already have a very flexible Kubernetes-based self-hosted runner deployment where we can use private Docker images for a specific set of runners or even mount an NFS volume containing proprietary tools as needed. With the current infrastructure, we do not need externally hosted runners, which would truly be a logistical nightmare.

marc-hb commented 9 months ago

I mean, what if your PR triggers some issue in a vendor CI. At best you see a build issue and from compiler error message may get enough information for the fix. But what if a test fails during execution: how are you going to debug it?

Good point. Also, trying to somewhat "centralize" the CI of multiple, proprietary toolchains may not be very "scalable"[1]

FWIW SOF CI (tries to) test the Zephyr main branch daily. The feedback loop is obviously slower than testing PRs before merge but it's not too bad and has exposed many toolchain issues fairly quickly. Post-merge but very soon after merge.

All in all this has so far proved to be a decent trade-off and this obviously scales to any number of proprietary tools because there is zero centralized involvement. The lack of permissions management and other administrative tasks like licenses is especially nice (I saw scary keywords like "security" and "logistical" lurking above already...)

Testing daily means the workload is constant, which in turn means more tests can be run for longer. Testing daily and pre-merge are of course not mutually exclusive; but trying to test every PR with some proprietary toolchain would be putting the cart before the horses if that toolchain does not already have a good track record with some independent automation able to test the Zephyr main branch regularly and engineers actually looking at failures and trying to fix them. In my experience, the second most common CI sin is trying to automate something that isn't maintained much in the first place [2]

My 2 SOF cents.

[1] An approach where some sort of "publisher" workflow notifies a list of distributed and independent CIs would help. [2] The most common CI sin is regularly testing configurations that developers don't use.

fabiobaltieri commented 9 months ago

Good point. Also, trying to somewhat "centralize" the CI of multiple, proprietary toolchains may not be very "scalable"[1]

I would more interested in a plan to make such toolchain available to any Zephyr developer as needed somehow.

marc-hb commented 9 months ago

I would more interested in a plan to make such toolchain available to any Zephyr developer as needed somehow.

Anyone can change sof/west.yml today, point it at their (unmerged) Zephyr commit and submit it to SOF CI.

I agree a more direct method would be extremely useful. It's also more complicated to set up.

stephanosio commented 9 months ago

I would more interested in a plan to make such toolchain available to any Zephyr developer as needed somehow.

The problem is more on the legal (licensing) side than the technical side -- obviously, you cannot just let anyone use a proprietary toolchain for which Zephyr Project has been licensed, without getting a permission from the company that licensed it.

On the technical side, it is actually quite simple -- just provide a one-time SSH access to an ephemeral runner (mxschmitt/action-tmate) with the proprietary toolchain.

marc-hb commented 9 months ago

You mean it's technically easy to provide access to a proprietary toolchain if it is... not proprietary? :-)

cfriedt commented 9 months ago

Sounds ideal, private runners seems like a bit of a logistic nightmare. I've no idea how the docker setup works though, @cfriedt could you expand a bit on that?

I remember using private runners via gitlab a long time ago. It was surprisingly easy. Not entirely sure how it's done with GitHub, but I did linked to a video about it up above.

TL;DR - private runners get authenticated with a key and communicate via a server API as if they were regular runners.

marc-hb commented 3 months ago

XTensa toolchains raise this to a new level because everyone has a different one. Right now we can't even get UINT32_C() to work across the field:

https://github.com/thesofproject/sof/pull/9413

zephyrproject-rtos / zephyr