wolfi-dev / os

Main package repository for production Wolfi images
Other
826 stars 254 forks source link

openssf-compiler-options: `-Wl,-z,now` causes nvidia-device-plugin to fail to load #34568

Open dannf opened 1 day ago

dannf commented 1 day ago

I'm reporting this per https://github.com/orgs/wolfi-dev/discussions/33052.

I found that while a rebuild of nvidia-device-plugin w/ openssf-compiler-flags succeeds, the tests will fail:

2024/11/19 00:38:19 WARN + nvidia-device-plugin --version
2024/11/19 00:38:19 WARN nvidia-device-plugin: symbol lookup error: nvidia-device-plugin: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV

nvidia-device-plugin-build-and-test-fail.txt nvidia-device-plugin-no-rebuild-test-ok.txt

tuananh commented 1 day ago

could have just change it to -Wl,-z,lazy and it should work.

i guess that this pkg need to link with libnvidia-ml on the host, that's why lazy is being used.

dannf commented 1 day ago

could have just change it to -Wl,-z,lazy and it should work.

Yeah, that did work when I hacked it onto the end of the options in openssf.spec - but I didn't identify a way to do it with build commands/environment variables. Neither LDFLAGS nor CGO_LDFLAGS did the trick.

i guess that this pkg need to link with libnvidia-ml on the host, that's why lazy is being used.

That's right.

tuananh commented 1 day ago

ah i think it's because we use make to build (https://github.com/NVIDIA/k8s-device-plugin/blob/main/Makefile) . maybe convert this to use go/build and then we can use ldflags override in the pipeline

https://github.com/chainguard-dev/melange/blob/main/pkg/build/pipelines/go/build.yaml

dannf commented 15 hours ago

~ah i think it's because we use make to build (https://github.com/NVIDIA/k8s-device-plugin/blob/main/Makefile) . maybe convert this to use go/build and then we can use ldflags override in the pipeline~

https://github.com/chainguard-dev/melange/blob/main/pkg/build/pipelines/go/build.yaml

Thanks @tuananh. That would provide a hook for passing a clean -ldflags, but the problem persists:

[...]
  - uses: go/build
    with:
      packages: ./cmd/nvidia-device-plugin
      ldflags: -extldflags="-Wl,-z,lazy"
      output: test
  - runs: |
      exit 1
$ make debug/nvidia-device-plugin
[...]
2024/11/19 21:42:52 INFO running step "go/build"
^[[F2024/11/19 21:43:03 ERRO Step failed: exit status 1
/bin/sh -c set -e 
[ -d '/home/build' ] || mkdir -p '/home/build'
cd '/home/build'
exit 1

exit 0
2024/11/19 21:43:03 INFO Execing into pod "" to debug interactively. workdir=/home/build
2024/11/19 21:43:03 INFO Type 'exit 0' to continue the next pipeline step or 'exit 1' to abort.
~ $ ./melange-out/nvidia-device-plugin/usr/bin/test 
./melange-out/nvidia-device-plugin/usr/bin/test: symbol lookup error: ./melange-out/nvidia-device-plugin/usr/bin/test: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV

The spec-defined options just seem to be super sticky.

tuananh commented 10 hours ago

yeah i tried it too and it didnt work.