xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

3.9 crashes during building on gfx803 with me but 3.10 does not crash. #3

Closed crypt0miester closed 3 years ago

crypt0miester commented 3 years ago

hey man, firstly, thanks for the work.

it has been two days for me trying to build rocm for tensorflow.

I got to the point of despair and raising this issue.

my setup: GPU: Sapphire Radeon RX570 4GB CPU: Intel Celeron RAM: 8GB

my quesion is, do you have the 3.10 rocSPARSE version which would work on gfx803?

I tried building your version but it was for 3.9 right?

I still get the hipErrorNoBinaryForGpu issue even after rebuilding your version of the rocSPARSE

anything would be helpful. Thanks

xuhuisheng commented 3 years ago

ROCm-3.10 is as same as ROCm-3.9. You could clone https://github.com/ROCmSoftwarePlatform/rocSPARSE, checkout 3.10.x, move AMDGPU_TARGETS before the include. Then rebuild rocSPARSE. ROCm-4.0 is the same, too.

crypt0miester commented 3 years ago

Excellent. will try to do that. and get back to you. I have tried to use your check.sh the rocBlas is "core dumped" have you encountered this issue before?

btw, should I do a full reinstallation after these errors? or just rebuild rocSPARSE?

crypt0miester commented 3 years ago

so I got rocSPARSE to work but rocBlas one issue didnt resolve itself. lol

/rocm-build/check $ sudo bash check.sh 
check.sh: line 9:  2204 Illegal instruction     (core dumped) ./build/hello_rocblas
[rocFFT]    1.0.8.966-rocm-rel-3.10-27-2d35fd6
[rocPRIM]   201005
[rocRAND]   201006
[rocSPARSE] 101800
[rccl]      2708
check.sh: line 33:  2459 Illegal instruction     (core dumped) ./build/hello_miopen
check.sh: line 37:  2500 Illegal instruction     (core dumped) ./build/hello_rocsolver
crypt0miester commented 3 years ago

managed to solve a lot of issues. now tensorflow just "Illegal Instrucion (core dumped)"

is it because of rocBlas?

Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.add(3,5)
2021-02-17 18:09:54.024738: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
2021-02-17 18:09:54.554061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.34GHz coreCount: 32 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.31GiB/s
Illegal instruction (core dumped)
xuhuisheng commented 3 years ago

Actually I haven't meet this Illegal instruction error on my RX580 8G.

But I still suggest rebuild rocBLAS with -DBUILD_WITH_TENSILE_HOST=OFF, I upload https://github.com/xuhuisheng/rocm-build/blob/rocm-4.1.x/gfx803/22.rocblas.sh, please try it.

BUILD_WITH_TENSILE_HOST=OFF will disable the asm scripts, just use old C language to implement GEMM. I believe the new asm GEMM have issues for gfx803. But right now, cannot find the point.

crypt0miester commented 3 years ago

I think I bricked my gpu. I will try with another gpu. (atiflash/amdvbflash -i failed to test it)

I'll close this issue. I will open again if I found this issue unresolved.

Thanks mate.

xuhuisheng commented 3 years ago

Actually, I often make my RX580 crashed before rebuild rocBLAS, if I ran language model example from pytorch-example https://github.com/pytorch/examples/tree/master/word_language_model . My solution is wait a while and reset the compute, then GPU will wake up.

crypt0miester commented 3 years ago

how do I reset the compute?

xuhuisheng commented 3 years ago

I mean shutdown the power and reboot.

crypt0miester commented 3 years ago

alright. will get back to you.

crypt0miester commented 3 years ago

I was able to fix the GPU issue.

you are correct perhaps it is a rocBLAS issue.

I tried using your fix but I got on patching

error: patch failed: library/src/blas_ex/rocblas_gemm_ext2.hpp:4
error: library/src/blas_ex/rocblas_gemm_ext2.hpp: patch does not apply

I used

repo init -u https://github.com/RadeonOpenCompute/ROCm.git -b roc-4.0.x
repo sync

because 4.1.x is

manifests:
fatal: couldn't find remote ref refs/heads/roc-4.1.x

any solutions?

xuhuisheng commented 3 years ago

OK. Seems this patch related up-coming ROCm-4.1 is not suitable with ROCm-4.0. Which version do you want? I will make a related patch for the version. Or you can just modify library/src/blas_ex/rocblas_gemm_ext2.hpp, move #include "rocblas_gemm_ex.hpp" outof #ifdef USE_TENSILE_HOST.

This will allow we using USE_TENSILE_HOST=OFF, otherwise it will report a error that cannot find some functions.

I will reopen this issue.

crypt0miester commented 3 years ago

modified and removed rm -rf $ROCM_GIT_DIR/rocBLAS/library/src/blas3/Tensile/Logic/asm_full/r9nano*

let's see, wish me luck. :)

crypt0miester commented 3 years ago

still getting the same thing after building. the build was successful too.

maybe this is a kernel issue?

which kernel version are you using?

I am using 5.4.0-65-generic

this is

xuhuisheng commented 3 years ago

It's weired that hello-rocblas did nothing but load the librocblas.so and print a version string. My environment is ubuntu-20.04.1 with linux-5.4.0-64.

And you can verify whether it is the kernel problem by running hip sample. https://github.com/xuhuisheng/rocm-build/blob/rocm-4.1.x/check/run-hip.sh. The hip square sample didnot use any rocm-libs component, just run a simple kernel function. If hip sample didnot throw errors, we can tell the kernel and hip level is correct.

After do some search, it said the Illegal instruction may cause by toolchain cross compiling. I suggest using docker to prepare a clear ubuntu:20.04 to install ROCm. check.sh should report rocblas version correctly, even not rebuild.

crypt0miester commented 3 years ago

alright. I will try to use linux-5.4.0-64. and will come back to you.

crypt0miester commented 3 years ago

I got a

$ dmesg | grep amd
[    0.000000] Linux version 5.4.0-64-generic (buildd@lcy01-amd64-021) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #72-Ubuntu SMP Fri Jan 15 10:27:54 UTC 2021 (Ubuntu 5.4.0-64.72-generic 5.4.78)
[    2.859037] amdkcl: loading out-of-tree module taints kernel.
[    2.859062] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[    3.072508] amdgpu: Unknown symbol amd_iommu_bind_pasid (err -2)
[    3.072678] amdgpu: Unknown symbol amd_iommu_set_invalidate_ctx_cb (err -2)
[    3.072796] amdgpu: Unknown symbol amd_iommu_free_device (err -2)
[    3.080007] amdgpu: Unknown symbol amd_iommu_unbind_pasid (err -2)
[    3.080037] amdgpu: Unknown symbol amd_iommu_init_device (err -2)
[    3.080279] amdgpu: Unknown symbol amd_iommu_set_invalid_ppr_cb (err -2)

on linux-5.4.0.64

I guess I should move on. it took me a week doing this.

you can close the issue.

cheers xuhu.

crypt0miester commented 3 years ago

the issue seem to be in miopen.

when I do sh run-miopen.sh, I get:

Illegal instruction (core dumped)

crypt0miester commented 3 years ago

which python version are you using?

xuhuisheng commented 3 years ago

Using Python-3.8.5, which is the default pthon version of ubuntu-20.04.1.

Do you have an apu on this computer? somebody said there is a bug on environment which have an apu and gpu. please refer this issue: https://github.com/RadeonOpenCompute/ROCm/issues/1306#issuecomment-743404462

try /opt/rocm/bin/rocminfo to check if there is both apu and gpu.

crypt0miester commented 3 years ago

no APUs. I will try to build with rocm-4.

if it didnt work I guess I'll have to find a way for it to work with 3.5- and below.

xuhuisheng commented 3 years ago

I suggest install ROCm-4.0, and run the check.sh. If there is still Illegal instruction, will not need to rebuild the rocblas.

Because rebuild only solve the gfx803 issue, Illegal instruction could cause by other reason.

crypt0miester commented 3 years ago

I still got Illegal instruction with ROCm-4.0 :disappointed:

trying with 3.3 now. everything is working but I am trying to figure out which tensorflow to use. tensorflow-rocm==2.2.0 and 2.3.0 did not work.
ImportError: "libamdhip64.so.3": cannot open shared object file: No such file or directory

xuhuisheng commented 3 years ago

I test tensorflow-rocm==2.2.0rc5 localy successly with ROCm-3.3.

And when you have time, could you use docker installing an ubuntu:20.04 image to test ROCm-4.0 with check.sh? thank you.

crypt0miester commented 3 years ago

I got it working with ROCm-3.3 and tensorflow-rocm==2.2.0

will do that when I have time. Thanks xuhu.