Closed rettichschnidi closed 6 months ago
EDIT: please keep this bug open until the questions below have been answered, at the very least with a justified "WONTFIX"
Post-mortem: some facts observed while trying to understand what was going on (in https://github.com/zephyrproject-rtos/zephyr/actions/runs/8268746414/job/22622576598?pr=70000 https://github.com/zephyrproject-rtos/zephyr/pull/70000/checks?check_run_id=22625908529 and elsewhere) and some suggestions to make things more robust and less time-consuming.
In typical and very unfortunate fashion, CMake does not "fail fast" but ignores file(DOWNLOAD ...
failures by default and keeps trying to build like nothing happens which results in super cryptic errors down the road. This is crazy, why does CMake keep doing stuff like this? There is a STATUS
output variable but you have to manually inspect STATUS for every download. Maybe someone should implement some SAFE_DOWNLOAD, "fail fast" CMake wrapper and use that instead in https://github.com/zephyrproject-rtos/trusted-firmware-m/blob/main/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
I understand the convenience of having a single build command but if updating these files is not very frequent then maybe it could/should require a different script / command / target and NOT be part of the default build target. Downloading files at build time is just really bad practice, silently and slowly downloading files at build time is much worse.
When downloads hit some rate limits and randomly failed, cryptic C compilation errors were observed. EXC_RETURN_S undefined
and more cryptic ones. #include "something.h" not found
would have been so much clearer. It looks like the build was falling back on the wrong .h files: could this be fixed? Using different .h
filenames maybe?
ninja -j1
is one of the most important build debug steps to avoid being drowned in irrelevant logs. But this does not propagate to other, ninja "children". So on a fast system with many cores, the TFM module build command (3a3e3c041b97f) still runs with dozens of threads and logs are unreadable again. One suggestion in https://discourse.cmake.org/t/external-project-does-not-use-parallel-jobs/3359/4 is to have some sort of "debug flag" that would set PARALLEL_JOBS to 1. @tejlmand any idea about this?
This was really a "perfect storm" and cascade of errors. ANY of the suggestions above would have saved a ton of time.
cc: @SebastianBoe , @henrikbrixandersen , @dleach02 , @d3zd3z , @tejlmand , @stephanosio, @dbaluta
FYI - https://github.com/zephyrproject-rtos/zephyr/pull/69881/checks?check_run_id=22634560072 may be related to this issue
I checked some of the log files and it does look like a very good example, thanks for sharing.
One point to clarify:
HALs must not modify their code during build
The NXP HAL is NOT modifying its code. It is not downloading files. This is the trusted-firmware-m module that is modifying its code and downloading files which proper Zephyr build system normally prevents it from doing. It is a bug with the HWMv2 transition that exposes it.
See:
Some history (as I understand it). Many years ago when Zephyr Project pulled in TFM, the folks responsible (Kumar and @microbuilder and maybe others) discovered that the way the LPC55S69 is integrated in the upstream TFM repo is via this connection to the NXP GitHub. Because of the rules of Zephyr, this was not going to work for them so they modified the fork to include the relevant files directly in the location they would have existed if pulling. With each update of TFM, the maintainers would update the LPC55S69 files if needed to keep things working.
I've had conversations with the Trusted Firmware organization about their lack of guidance on how to integrate support into TFM. From the TFM space, there is nothing wrong with doing what the LPC55S69 did. No guidance so nothing wrong with any approach as long as it works. But we know that Zephyr wants everything under one roof.
The TFM maintainers made the changes to the fork and to Zephyr's integration of TFM to do the following:
The bug in the HWMV2 transition was the patch that fixed it (#70218) where we missed setting the proper board target in the modules/trusted-firmware-m/CMakeLists.txt. Since the actual build was going to try to build TFM for LPC, by missing this fixup to the TFM cmake it triggered the existing upstream flow of grabbing files from the NXP GitHub.
I'm un-assigning me from this ticket because the root problem has been addressed and fixed with #70218. If the TFM maintainers wish to pick this up and do something else that is their prerogative.
I think some of the items @marc-hb brings up might be pulled into a standalone issue/enhancement request ticket instead of being tucked into this ticket where the root problem was just a HWMv2 transition bug.
One observation for the TFM maintainers/collaborators for the next time they need to update TFM and then patch the LPC55S69 files (assuming a better solution is not put into place).
ping @d3zd3z, @SebastianBoe, @butok
Fixed with #70218
@dleach02,
Just to clarify: I still see the issue on the latest main branch, when running:
twister -p lpcxpresso55s69/lpc55s69/cpu0/ns -s samples/subsys/logging/logger/sample.logger.basic
Here's the twister build output seen build.log
I'm not sure if this is the Issue to keep track of it or if I should open another one.
EDIT: As stated below, I hadn't properly tested the fix. Issuing west update
after checking out to the latest/greatest fixes the issue for me. Thanks.
Thanks @dleach02
I think some of the items @marc-hb brings up might be pulled into a standalone issue/enhancement request ticket instead of being tucked into this ticket where the root problem was just a HWMv2 transition bug.
The HWMv2 transition bug was fixed before the discussion in this issue even started and does not need to be discussed any more so please let me call "dibs" and use this issue for the other, secondary build issues it exposed beyond the HWMv2 transition bug.
Mistakes have happened and they will happen again. The purpose of the build system is not just to build stuff when all planets have aligned. It's also supposed to "fail fast", not bury problems under many screens of confusing and irrelevant error messages and give decent clues when something goes wrong. This issue was named "Build errors when building for lpcxpresso55s69/lpc55s69/cpu0/ns" and I think this title is still appropriate.
To be honest I don't have very high expectations and I expect most my suggestions to be closed as "WONTFIX" for one reason or another (and you already started some of these answers). But I'd like to give these suggestions a little bit of a chance and to see these answers in the record. Any suggestion would save a lot of time.
Just to clarify: I still see the issue on the latest main branch, when running:
twister -p lpcxpresso55s69/lpc55s69/cpu0/ns -s samples/subsys/logging/logger/sample.logger.basic
That's not my experience. The fix 2d9a2cec8d14 works 100% for me.
After adding the line 0.0.0.0 raw.githubusercontent.com
in [c:/Windows/system32/drivers]/etc/hosts
, the twister command above passes 100% of the time. When I revert 2d9a2cec8d14, it fails 100% of the time.
Tests performed on recent commit 67d7c13da9d0
The HWMv2 transition bug was fixed before the discussion in this issue even started and does not need to be discussed any more
Just trying to prove me wrong? :-)
@marc-hb Ok. I'm testing on a more recent commit: ddb147d0a45, and I'm still experiencing the build failure.
I'll create a separate issue to follow-up on my case, then.
EDIT: I apologize - I hadn't issued the west update
command after checking out to the newer commit. This issue is fixed for me as well. Thanks!
After adding the line 0.0.0.0 raw.githubusercontent.com in [c:/Windows/system32/drivers]/etc/hosts,
That's not enough actually. Even with this line and before the fix/workaround 2d9a2cec8d14ecf, the evil trusted-firmware-m
build manages to modify files in platform/ext/target/nxp/
and pollute git diff
anyway.
I hadn't issued the west update command after checking out to the newer commit
You definitely want to do that but that's not enough. Before every time you test you also want to run west status
and git -C ../modules/tee/tf-m/trusted-firmware-m/ checkout HEAD .
if needed.
Whether a download is involved or not, silently modifying files checked in git when invoking the default build target is pure evil.
But we know that Zephyr wants everything under one roof.
It's not just Zephyr. "Software Bills of Materials" are becoming legal requirements for obvious security reasons
Super ironic to see this in the build of "trusted" firmware.
After adding the line 0.0.0.0 raw.githubusercontent.com in [c:/Windows/system32/drivers]/etc/hosts,
That's not enough actually. Even with this line and before the fix/workaround 2d9a2ce, the evil
trusted-firmware-m
build manages to modify files inplatform/ext/target/nxp/
and pollutegit diff
anyway.I hadn't issued the west update command after checking out to the newer commit
You definitely want to do that but that's not enough. Before every time you test you also want to run
west status
andgit -C ../modules/tee/tf-m/trusted-firmware-m/ checkout HEAD .
if needed.Whether a download is involved or not, silently modifying files checked in git when invoking the default build target is pure evil.
If the TFM_PLATFORM_NXP_HAL_FILE_PATH is set to a path, no download is done. This is the common Upstream TFM approach to download external libraries or use them without downloading if the path is set (for Mbedcrypto, MCUBoot, t_cose, qcbor, ethos_u_core_core_drivers etc.) BUT: TFM also does auto-generation of many files (linker files, and even sources and header files). What exactly modified files do you see?
@butok, it was initially the HWMV2 didn't set a flag right so TFM was pulling the mcu-sdk files from the NXP github. Passed our internal testing because functionally things worked and built and the build system for Zephyr didn't complain about this.
Fixing the flag should stop the tfm build from doing this but I can't say anything else about other aspects of TFM build. This is where the maintainers need to chime in if there is an issue.
@d3zd3z can you please take a look?
Fixing the flag should stop the tfm build from doing this but I can't say anything else about other aspects of TFM build. This is where the maintainers need to chime in if there is an issue.
Just checked again. I cannot see any modified repository files after building TFM examples. We can close this issue.
Just checked again. I cannot see any modified repository files after building TFM examples.
Yes, but only thanks to a super fragile, one-line if(CONFIG_...)
line. This is not good enough. If you think it is, just search the internet for "JiaT75".
Also, I raised a number of other, different but related "build robustness issues" in my comment https://github.com/zephyrproject-rtos/zephyr/issues/70215#issuecomment-1998121524
Here's a very quick reproduction of the issue; much simpler than what was shared previously:
cd zephyr; git checkout d9fd292f4b9a # today's main branch
west update
west diff
--- zephyr/modules/trusted-firmware-m/CMakeLists.txt
+++ zephyr/modules/trusted-firmware-m/CMakeLists.txt
@@ -226,7 +226,7 @@ if (CONFIG_BUILD_WITH_TFM)
string(REPLACE "toolchain" "toolchain_ns" TFM_TOOLCHAIN_NS_FILE ${TFM_TOOLCHAIN_FILE})
- if(CONFIG_BOARD_LPCXPRESSO55S69_LPC55S69_CPU0_NS)
+ if(xxxCONFIG_BOARD_LPCXPRESSO55S69_LPC55S69_CPU0_NS)
# Supply path to NXP HAL sources used for TF-M build
set(TFM_PLATFORM_NXP_HAL_FILE_PATH ${ZEPHYR_TRUSTED_FIRMWARE_M_MODULE_DIR}/platform/ext/target/nxp/)
list(APPEND TFM_CMAKE_ARGS -DTFM_PLATFORM_NXP_HAL_FILE_PATH=${TFM_PLATFORM_NXP_HAL_FILE_PATH})
=== diff for trusted-firmware-m (modules/tee/tf-m/trusted-firmware-m):
diff --git modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
index fbb84a699e7d..6160f5df982b 100644
--- modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
+++ modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
@@ -6,6 +6,10 @@
#
#-------------------------------------------------------------------------------
+# "Trusted" firmware? Except for supply chain attacks.
+# Even "better": CMake silently ignores failed DOWNLOAD
+message(FATAL_ERROR "Very sneaky and unreliable, build-time overwrite of sources should
+ never happen for the default build target")
+
#========================= Pull MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk =========================#
file(DOWNLOAD https://raw.githubusercontent.com/NXPmicro/mcux-sdk/${NXP_SDK_GIT_TAG}/drivers/casper/fsl_casper.c ${NXP_HAL_FILE_PATH}/common/Native_Driver/drivers/fsl_casper.c)
file(DOWNLOAD https://raw.githubusercontent.com/NXPmicro/mcux-sdk/${NXP_SDK_GIT_TAG}/drivers/casper/fsl_casper.h ${NXP_HAL_FILE_PATH}/common/Native_Driver/drivers/fsl_casper.h)
west build --board lpcxpresso55s69/lpc55s69/cpu0/ns samples/hello_world/
CMake Warning at cmake/version.cmake:22 (message):
Actual TF-M version is not available from Git repository. Settled to
v2.0.0
Call Stack (most recent call first):
CMakeLists.txt:22 (include)
-- [Crypto service] Using P256M software driver in PSA Crypto backend
-- Pulling MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk
CMake Error at platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake:11 (message):
Very sneaky and unreliable, build-time overwrite of sources should never
happen for the default build target
Call Stack (most recent call first):
platform/ext/target/nxp/lpcxpresso55s69/CMakeLists.txt:18 (include)
Hi @marc-hb This is the TFM Project approach to pull external code. TFM is able to fetch all external libraries based on "super fragile" parameters. They can be found in \trusted-firmware-m\lib\ext directory. Are you able to convince TFM project to change it first?
There has been no updates for the last 2 weeks on this issue. @marc-hb is this something that you are or want to work on, and do you think it is a high prio bug or should this be closed or downgraded?
@aescolar I think the reason this was re-opened was based on a misunderstanding from my end:
The issue (TF-M module pulls in external code for LPC55S69) has been fixed in #70218, not because the TFM project has changed the way to tie their dependencies, but because the typo that caused the Build system to trigger that sync-process has been addressed.
IMO - This was a high priority bug because CI Twister failures were holding back multiple PRs. That is no longer the case.
As mentioned on the Bug Triage & Release Meeting, I'll close this issue. Feel free to re-open it again if you disagree but, in any case, I don't think is a High-Priority bug for Zephyr anymore.
Broken again?
Problem
Running Twister on tests in tests/kernel/common, targeting platform lpcxpresso55s69/lpc55s69/cpu0/ns, results in build errors.
To Reproduce
Steps to reproduce the behavior:
cd zephyr
twister -p lpcxpresso55s69/lpc55s69/cpu0/ns --inline-logs -s tests/kernel/common/kernel.common -s tests/kernel/common/kernel.common.misra
(rarely works)This however works always:
twister -p lpcxpresso55s69/lpc55s69/cpu0/ns --inline-logs -s tests/kernel/common/kernel.common
--no-clean
.Expected behavior
HALs must not modify their code during build.I got this wrong, it is the trusted-firmware-mImpact
It prevents my PRs from passing CI.
Logs and console output
Failing build
Environment (please complete the following information):
Hints
modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/common/Native_Driver/drivers/
get modified after each run.tfm
subfolder within the build directory has the same effect - it touches the files in the HAL:-- [Crypto service] Using P256M software driver in PSA Crypto backend -- Pulling MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk -- Configuring done -- Generating done -- Build files have been written to: /home/reto/code/3rd-party/zephyrproject/zephyr/twister-out/lpcxpresso55s69_lpc55s69_cpu0_ns/tests/kernel/common/kernel.common/tfm