zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.9k stars 6.64k forks source link

TFM build modifies platform files in TFM module #70215

Closed rettichschnidi closed 6 months ago

rettichschnidi commented 8 months ago

Problem

Running Twister on tests in tests/kernel/common, targeting platform lpcxpresso55s69/lpc55s69/cpu0/ns, results in build errors.

To Reproduce

Steps to reproduce the behavior:

  1. cd zephyr
  2. twister -p lpcxpresso55s69/lpc55s69/cpu0/ns --inline-logs -s tests/kernel/common/kernel.common -s tests/kernel/common/kernel.common.misra (rarely works)

This however works always:

Expected behavior

  1. Build works reliable, independent of other targets getting built at the same time.
  2. HALs must not modify their code during build. I got this wrong, it is the trusted-firmware-m
  3. Targets which violate 2), must not be part of CI

Impact

It prevents my PRs from passing CI.

Logs and console output

Failing build

Environment (please complete the following information):

Hints

-- [Crypto service] Using P256M software driver in PSA Crypto backend -- Pulling MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk -- Configuring done -- Generating done -- Build files have been written to: /home/reto/code/3rd-party/zephyrproject/zephyr/twister-out/lpcxpresso55s69_lpc55s69_cpu0_ns/tests/kernel/common/kernel.common/tfm

marc-hb commented 8 months ago

EDIT: please keep this bug open until the questions below have been answered, at the very least with a justified "WONTFIX"

Post-mortem: some facts observed while trying to understand what was going on (in https://github.com/zephyrproject-rtos/zephyr/actions/runs/8268746414/job/22622576598?pr=70000 https://github.com/zephyrproject-rtos/zephyr/pull/70000/checks?check_run_id=22625908529 and elsewhere) and some suggestions to make things more robust and less time-consuming.

This was really a "perfect storm" and cascade of errors. ANY of the suggestions above would have saved a ton of time.

cc: @SebastianBoe , @henrikbrixandersen , @dleach02 , @d3zd3z , @tejlmand , @stephanosio, @dbaluta

ubieda commented 8 months ago

FYI - https://github.com/zephyrproject-rtos/zephyr/pull/69881/checks?check_run_id=22634560072 may be related to this issue

marc-hb commented 8 months ago

I checked some of the log files and it does look like a very good example, thanks for sharing.

dleach02 commented 8 months ago

One point to clarify:

HALs must not modify their code during build

The NXP HAL is NOT modifying its code. It is not downloading files. This is the trusted-firmware-m module that is modifying its code and downloading files which proper Zephyr build system normally prevents it from doing. It is a bug with the HWMv2 transition that exposes it.

See:

https://github.com/zephyrproject-rtos/trusted-firmware-m/blob/main/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake

Some history (as I understand it). Many years ago when Zephyr Project pulled in TFM, the folks responsible (Kumar and @microbuilder and maybe others) discovered that the way the LPC55S69 is integrated in the upstream TFM repo is via this connection to the NXP GitHub. Because of the rules of Zephyr, this was not going to work for them so they modified the fork to include the relevant files directly in the location they would have existed if pulling. With each update of TFM, the maintainers would update the LPC55S69 files if needed to keep things working.

I've had conversations with the Trusted Firmware organization about their lack of guidance on how to integrate support into TFM. From the TFM space, there is nothing wrong with doing what the LPC55S69 did. No guidance so nothing wrong with any approach as long as it works. But we know that Zephyr wants everything under one roof.

The TFM maintainers made the changes to the fork and to Zephyr's integration of TFM to do the following:

The bug in the HWMV2 transition was the patch that fixed it (#70218) where we missed setting the proper board target in the modules/trusted-firmware-m/CMakeLists.txt. Since the actual build was going to try to build TFM for LPC, by missing this fixup to the TFM cmake it triggered the existing upstream flow of grabbing files from the NXP GitHub.

I'm un-assigning me from this ticket because the root problem has been addressed and fixed with #70218. If the TFM maintainers wish to pick this up and do something else that is their prerogative.

I think some of the items @marc-hb brings up might be pulled into a standalone issue/enhancement request ticket instead of being tucked into this ticket where the root problem was just a HWMv2 transition bug.

dleach02 commented 8 months ago

One observation for the TFM maintainers/collaborators for the next time they need to update TFM and then patch the LPC55S69 files (assuming a better solution is not put into place).

ping @d3zd3z, @SebastianBoe, @butok

dleach02 commented 8 months ago

Fixed with #70218

ubieda commented 8 months ago

@dleach02,

Just to clarify: I still see the issue on the latest main branch, when running:

twister -p lpcxpresso55s69/lpc55s69/cpu0/ns -s samples/subsys/logging/logger/sample.logger.basic

Here's the twister build output seen build.log

I'm not sure if this is the Issue to keep track of it or if I should open another one.

EDIT: As stated below, I hadn't properly tested the fix. Issuing west update after checking out to the latest/greatest fixes the issue for me. Thanks.

marc-hb commented 8 months ago

Thanks @dleach02

I think some of the items @marc-hb brings up might be pulled into a standalone issue/enhancement request ticket instead of being tucked into this ticket where the root problem was just a HWMv2 transition bug.

The HWMv2 transition bug was fixed before the discussion in this issue even started and does not need to be discussed any more so please let me call "dibs" and use this issue for the other, secondary build issues it exposed beyond the HWMv2 transition bug.

Mistakes have happened and they will happen again. The purpose of the build system is not just to build stuff when all planets have aligned. It's also supposed to "fail fast", not bury problems under many screens of confusing and irrelevant error messages and give decent clues when something goes wrong. This issue was named "Build errors when building for lpcxpresso55s69/lpc55s69/cpu0/ns" and I think this title is still appropriate.

To be honest I don't have very high expectations and I expect most my suggestions to be closed as "WONTFIX" for one reason or another (and you already started some of these answers). But I'd like to give these suggestions a little bit of a chance and to see these answers in the record. Any suggestion would save a lot of time.

marc-hb commented 8 months ago

Just to clarify: I still see the issue on the latest main branch, when running:

twister -p lpcxpresso55s69/lpc55s69/cpu0/ns -s samples/subsys/logging/logger/sample.logger.basic

That's not my experience. The fix 2d9a2cec8d14 works 100% for me.

After adding the line 0.0.0.0 raw.githubusercontent.com in [c:/Windows/system32/drivers]/etc/hosts, the twister command above passes 100% of the time. When I revert 2d9a2cec8d14, it fails 100% of the time.

Tests performed on recent commit 67d7c13da9d0

The HWMv2 transition bug was fixed before the discussion in this issue even started and does not need to be discussed any more

Just trying to prove me wrong? :-)

ubieda commented 8 months ago

@marc-hb Ok. I'm testing on a more recent commit: ddb147d0a45, and I'm still experiencing the build failure.

I'll create a separate issue to follow-up on my case, then.

EDIT: I apologize - I hadn't issued the west update command after checking out to the newer commit. This issue is fixed for me as well. Thanks!

marc-hb commented 8 months ago

After adding the line 0.0.0.0 raw.githubusercontent.com in [c:/Windows/system32/drivers]/etc/hosts,

That's not enough actually. Even with this line and before the fix/workaround 2d9a2cec8d14ecf, the evil trusted-firmware-m build manages to modify files in platform/ext/target/nxp/ and pollute git diff anyway.

I hadn't issued the west update command after checking out to the newer commit

You definitely want to do that but that's not enough. Before every time you test you also want to run west status and git -C ../modules/tee/tf-m/trusted-firmware-m/ checkout HEAD . if needed.

Whether a download is involved or not, silently modifying files checked in git when invoking the default build target is pure evil.

marc-hb commented 8 months ago

But we know that Zephyr wants everything under one roof.

It's not just Zephyr. "Software Bills of Materials" are becoming legal requirements for obvious security reasons

https://www.cisa.gov/sbom

Super ironic to see this in the build of "trusted" firmware.

butok commented 8 months ago

After adding the line 0.0.0.0 raw.githubusercontent.com in [c:/Windows/system32/drivers]/etc/hosts,

That's not enough actually. Even with this line and before the fix/workaround 2d9a2ce, the evil trusted-firmware-m build manages to modify files in platform/ext/target/nxp/ and pollute git diff anyway.

I hadn't issued the west update command after checking out to the newer commit

You definitely want to do that but that's not enough. Before every time you test you also want to run west status and git -C ../modules/tee/tf-m/trusted-firmware-m/ checkout HEAD . if needed.

Whether a download is involved or not, silently modifying files checked in git when invoking the default build target is pure evil.

If the TFM_PLATFORM_NXP_HAL_FILE_PATH is set to a path, no download is done. This is the common Upstream TFM approach to download external libraries or use them without downloading if the path is set (for Mbedcrypto, MCUBoot, t_cose, qcbor, ethos_u_core_core_drivers etc.) BUT: TFM also does auto-generation of many files (linker files, and even sources and header files). What exactly modified files do you see?

dleach02 commented 8 months ago

@butok, it was initially the HWMV2 didn't set a flag right so TFM was pulling the mcu-sdk files from the NXP github. Passed our internal testing because functionally things worked and built and the build system for Zephyr didn't complain about this.

Fixing the flag should stop the tfm build from doing this but I can't say anything else about other aspects of TFM build. This is where the maintainers need to chime in if there is an issue.

nashif commented 7 months ago

@d3zd3z can you please take a look?

butok commented 7 months ago

Fixing the flag should stop the tfm build from doing this but I can't say anything else about other aspects of TFM build. This is where the maintainers need to chime in if there is an issue.

Just checked again. I cannot see any modified repository files after building TFM examples. We can close this issue.

marc-hb commented 7 months ago

Just checked again. I cannot see any modified repository files after building TFM examples.

Yes, but only thanks to a super fragile, one-line if(CONFIG_...) line. This is not good enough. If you think it is, just search the internet for "JiaT75".

Also, I raised a number of other, different but related "build robustness issues" in my comment https://github.com/zephyrproject-rtos/zephyr/issues/70215#issuecomment-1998121524

Here's a very quick reproduction of the issue; much simpler than what was shared previously:

cd zephyr; git checkout d9fd292f4b9a # today's main branch
west update
west diff
--- zephyr/modules/trusted-firmware-m/CMakeLists.txt
+++ zephyr/modules/trusted-firmware-m/CMakeLists.txt
@@ -226,7 +226,7 @@ if (CONFIG_BUILD_WITH_TFM)

   string(REPLACE "toolchain" "toolchain_ns" TFM_TOOLCHAIN_NS_FILE ${TFM_TOOLCHAIN_FILE})

-  if(CONFIG_BOARD_LPCXPRESSO55S69_LPC55S69_CPU0_NS)
+  if(xxxCONFIG_BOARD_LPCXPRESSO55S69_LPC55S69_CPU0_NS)
     # Supply path to NXP HAL sources used for TF-M build
     set(TFM_PLATFORM_NXP_HAL_FILE_PATH ${ZEPHYR_TRUSTED_FIRMWARE_M_MODULE_DIR}/platform/ext/target/nxp/)
     list(APPEND TFM_CMAKE_ARGS -DTFM_PLATFORM_NXP_HAL_FILE_PATH=${TFM_PLATFORM_NXP_HAL_FILE_PATH})

=== diff for trusted-firmware-m (modules/tee/tf-m/trusted-firmware-m):
diff --git modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
index fbb84a699e7d..6160f5df982b 100644
--- modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
+++ modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake
@@ -6,6 +6,10 @@
 #
 #-------------------------------------------------------------------------------

+# "Trusted" firmware? Except for supply chain attacks.
+# Even "better": CMake silently ignores failed DOWNLOAD
+message(FATAL_ERROR "Very sneaky and unreliable, build-time overwrite of sources should
+        never happen for the default build target")
+
 #========================= Pull MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk =========================#
 file(DOWNLOAD https://raw.githubusercontent.com/NXPmicro/mcux-sdk/${NXP_SDK_GIT_TAG}/drivers/casper/fsl_casper.c  ${NXP_HAL_FILE_PATH}/common/Native_Driver/drivers/fsl_casper.c)
 file(DOWNLOAD https://raw.githubusercontent.com/NXPmicro/mcux-sdk/${NXP_SDK_GIT_TAG}/drivers/casper/fsl_casper.h  ${NXP_HAL_FILE_PATH}/common/Native_Driver/drivers/fsl_casper.h)
west build --board lpcxpresso55s69/lpc55s69/cpu0/ns samples/hello_world/

CMake Warning at cmake/version.cmake:22 (message):
  Actual TF-M version is not available from Git repository.  Settled to
  v2.0.0
Call Stack (most recent call first):
  CMakeLists.txt:22 (include)

-- [Crypto service] Using P256M software driver in PSA Crypto backend
-- Pulling MCUxpresso NXP SDK drivers from https://github.com/NXPmicro/mcux-sdk
CMake Error at platform/ext/target/nxp/lpcxpresso55s69/pull_drivers.cmake:11 (message):
  Very sneaky and unreliable, build-time overwrite of sources should never
  happen for the default build target
Call Stack (most recent call first):
  platform/ext/target/nxp/lpcxpresso55s69/CMakeLists.txt:18 (include)
butok commented 7 months ago

Hi @marc-hb This is the TFM Project approach to pull external code. TFM is able to fetch all external libraries based on "super fragile" parameters. They can be found in \trusted-firmware-m\lib\ext directory. Are you able to convince TFM project to change it first?

aescolar commented 6 months ago

There has been no updates for the last 2 weeks on this issue. @marc-hb is this something that you are or want to work on, and do you think it is a high prio bug or should this be closed or downgraded?

ubieda commented 6 months ago

@aescolar I think the reason this was re-opened was based on a misunderstanding from my end:

The issue (TF-M module pulls in external code for LPC55S69) has been fixed in #70218, not because the TFM project has changed the way to tie their dependencies, but because the typo that caused the Build system to trigger that sync-process has been addressed.

IMO - This was a high priority bug because CI Twister failures were holding back multiple PRs. That is no longer the case.

As mentioned on the Bug Triage & Release Meeting, I'll close this issue. Feel free to re-open it again if you disagree but, in any case, I don't think is a High-Priority bug for Zephyr anymore.

marc-hb commented 3 months ago

Broken again?