xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

Tensorflow, gfx803, `hipErrorNoBinaryForGpu: Unable to find code object for all current devices` #16

Closed riaqn closed 2 years ago

riaqn commented 3 years ago

Environment

Hardware description
GPU RX580
CPU 3700X
Software version
OS Linux 5.13.10
ROCm 4.3.0
Python 3.9.6
Tensorflow-rocm 2.5.0

What is the expected behavior

-Tensorflow should run correctly.

What actually happens

root@darkbox ~# python -m deeporn.fit
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.summary API due to missing TensorBoard installation.
WARNING:deeporn.model:test_run=True
2021-08-21 08:57:36.488884: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so
/home/yay/.cache/yay/hip-rocclr/src/HIP-rocm-4.3.0/rocclr/hip_code_object.cpp:486: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
fish: Job 1, 'python -m deeporn.fit' terminated by signal SIGABRT (Abort)

How to reproduce

Do I have to recompile tensorflow?

xuhuisheng commented 3 years ago

I am not aware that tensorflow-rocm-2.5.0 had been released. I will test it and verify if it support gfx803.

And I have seen you used linux kernel-5.13 and python-3.9. You didn't test under ubuntu-20.04. right?

update Verified tensorflow-rocm-2.5.0 didn't support gfx803, now! Sad news. The workaround is using tensorflow-rocm-2.4.3. pip3 install tensorflow-rocm==2.4.3. I will try to find a way to recompile tensorflow-rocm-2.5.0.

riaqn commented 3 years ago

Thanks for the quick reply! I used Arch Linux - does the linux distribution matter?

Unfortunately, only 2.5.0 is available from pypi as binary packages.

I'm now trying to recompile tensorflow-rocm 2.5.0 using this AUR building script, which supports gfx803. https://github.com/rocm-arch/tensorflow-rocm

but encountered some issue: https://github.com/rocm-arch/tensorflow-rocm/issues/31

xuhuisheng commented 3 years ago

I recompiled https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/tree/r2.5-rocm-enhanced , and didn't meet your compiling error.

The mnist run properly.

I guess it may be caused by gcc-10, The gcc used in ubuntu-20.04.2 is gcc-9.

BTW, I just install bazel-3.7.2 and execute build_rocm_python3 then wait about 3 hours, the tensorflow-2.5.0-cp38-cp38-linux_x86_64.whl had been built successly. I used ubuntu:20.04 image of docker, just remember install depends package.

Now I cannot make sure tensorflow-rocm used local gpu config or we need do some config likes AMDGPU_TARGETS=gfx803

riaqn commented 3 years ago

I'm actually using gcc-11. Let me try gcc-9.

xuhuisheng commented 3 years ago

Seems only tensorflow-rocm-2.5.0 provided python39 whl. Maybe you can try python3.8 with tensorflow-rocm-2.4.3 https://pypi.org/project/tensorflow-rocm/2.4.3/#files

riaqn commented 3 years ago

OK, after some research, the problem is that TF-2.5.0 referencing an outdated version of ruy. The issue in ruy is fixed in later commit. TF-2.6.0 references a later version of ruy which is fine.

It seems that TF doesn't backport fixes, meaning we can only wait for tensorflow-rocm 2.6.0, or make a patch according to this issue, or use GCC-10 to compile (but I'm having some problem with this too)

riaqn commented 3 years ago

@xuhuisheng I just realize that you mentioned to patch rocblas (removing library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml). Do we still need to do that for rocm4.3.0?

xuhuisheng commented 3 years ago

@riaqn Please see the document for what this patch doing. https://github.com/xuhuisheng/rocm-build/tree/master/gfx803#rocm-37-broke-on-gfx803

xuhuisheng commented 2 years ago

3 months after last posts, I will close this issue, please reopen if there is any updates.