Open TheTrustedComputer opened 9 months ago
Building ROCm 5.4(.3) indeed fixes the MIOpen compiler error users faced with 5.2.3. The checks passed with flying colors, except that ws_size is 0 rather than 576.
MIOPEN_VERSION_MAJOR:2
MIOPEN_VERSION_MINOR:19
MIOPEN_VERSION_PATCH:0
ws_size = 0
find conv algo
time : 0.01444
[0] = 0
[1] = 3
[2] = 8
[3] = 13
[4] = 18
[5] = 8
[6] = 15
[7] = 29
[8] = 35
[9] = 41
[10] = 47
[11] = 18
[12] = 35
[13] = 59
[14] = 65
[15] = 71
[16] = 77
[17] = 28
[18] = 55
[19] = 89
[20] = 95
[21] = 101
[22] = 107
[23] = 38
[24] = 75
[25] = 119
[26] = 125
[27] = 131
[28] = 137
[29] = 48
[30] = 20
[31] = 21
[32] = 22
[33] = 23
[34] = 24
[35] = 0
run-miopen-img.sh: produces the exact image as the reference, further confirming its functionality.
handle conv start
out shape 1 3 728 410
ws_size = 0
find conv algo
time : 0.270198
save bmp start
save bmp end
free mem start
free mem end
My ROCm 5.4.3 build log against gfx1012 target. Now AMD MIGraphX couldn't compile while I could with 5.2.3. I wasn't able to figure out how to resolve this, but it's probably not necessary for PyTorch anyway.
ROCm 5.4.3 gfx1012 Ubuntu 22.04 Docker build log
00.rocm-core.sh: PASS
11.rocm-llvm.sh: PASS
12.roct-thunk-interface.sh: PASS
13.rocm-cmake.sh: PASS
14.rocm-device-libs.sh: PASS
15.rocr-runtime.sh: PASS
* need xxd (apt install xxd)
16.rocminfo.sh: PASS
* need kmod (apt install kmod)
17.rocm-compilersupport.sh: PASS
18.hip.sh: PASS
* need dot (apt install graphviz)
21.rocfft.sh: PASS
* may need GPU exposure in container (/dev/dri; /dev/kfd)
navi14/22.rocblas.sh: PASS
23.rocprim.sh: PASS
24.rocrand.sh: PASS
navi14/25.rocsparse.sh: PASS
* comment out N/A patch
26.hipsparse.sh: PASS
27.rocm_smi_lib.sh: PASS
28.rccl.sh: PASS
* apply patch in issue #44
29.hipfft.sh: PASS
31.rocm-opencl-runtime.sh: PASS
32.clang-ocl.sh: PASS
33.rocprofiler.sh: PASS
34.roctracer.sh: PASS
35.half.sh: PASS
36.miopen.sh: PASS
* need Niels Lohmann's JSON fork (apt install nlohmann-json3-dev)
* patch Boost 1.74.0 to resolve linker error, see https://github.com/boostorg/spirit/commit/f3998fb2bbbcd29aacfc1b27d92af570d154fb9b; build it with -fPIC
* set -DCMAKE_PREFIX_PATH to path of patched Boost to cmake args
* add -DMIOPEN_USE_COMPOSABLEKERNEL=0 to cmake args to disable composable kernels
37.rocm-utils.sh: PASS
41.rocdbgapi.sh: PASS
42.rocgdb.sh: PASS
* need GMP (apt install libgmp-dev)
* remove line "--disable-shared" from script
43.rocm-dev.sh: PASS
51.rocsolver.sh: PASS
52.rocthrust.sh: PASS
53.hipblas.sh: PASS
54.rocalution.sh: PASS
55.hipcub.sh: PASS
56.hipsolver.sh: PASS
57.rocm-libs.sh: PASS
61.amdmigraphx.sh: FAIL
* CMake Error at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:430 (foreach):
* Unknown argument:
*
* NO
*
* Call Stack (most recent call first):
* /root/rocm-test/rocm-5.4/AMDMIGraphX/install_deps.cmake:85 (cmake_get_from)
62.rock-dkms.sh: PASS
* change permission masks to what dpkg-deb expects
71.rocm_bandwidth_test.sh: PASS
72.hipfort.sh: PASS
73.rocmvalidationsuite.sh: PASS
74.rocr_debug_agent.sh: PASS
75.hipify.sh: PASS
check.sh:
[HIP] 50422804
[rocBLAS] 2.46.0.24f38911
[rocFFT] 1.0.21.5687cd9
[rocPRIM] 201009
[rocRAND] 201009
[rocSPARSE] 200303
[rccl] 21304
[MIOpen] 2 19 0
[rocSOVLER] 3.20.0.2740dcf
[rocThrust] 101600
[rocALUTION] 20103
[hipCUB] 201012
[hipBLAS] 0 53 0
[hipSPARSE] 200303
[hipRAND] 201009
[hipFFT] 10021
env.sh:
#!/bin/bash
export ROCM_INSTALL_DIR=/opt/rocm
export ROCM_MAJOR_VERSION=5
export ROCM_MINOR_VERSION=4
export ROCM_PATCH_VERSION=3
export ROCM_LIBPATCH_VERSION=50403
export CPACK_DEBIAN_PACKAGE_RELEASE=121~22.04
export ROCM_PKGTYPE=DEB
export ROCM_GIT_DIR=/root/rocm-test/rocm-5.4
export ROCM_BUILD_DIR=/root/rocm-test/rocm-build/build
export ROCM_PATCH_DIR=/root/rocm-test/rocm-build/patch
export AMDGPU_TARGETS="gfx1012"
# export CMAKE_DIR=/home/work/local/cmake-3.18.6-Linux-x86_64
export PATH=$ROCM_INSTALL_DIR/bin:$ROCM_INSTALL_DIR/llvm/bin:$ROCM_INSTALL_DIR/hip/bin:$CMAKE_DIR/bin:$PATH
Furthermore, I didn't have to patch PyTorch since ROCm 5.4.3 contains definitions that were absent in 5.2.3. The MNIST sample training sessions correctly utilize my GPU without the need for the HSA_OVERRIDE_GFX_VERSION environment variable. Since I complied only for the RX 5500 XT, it won't work with other cards without rebuilding.
However, creating a pip wheel for manylinux_2_35_x86_64 and installing it on my Arch host doesn't quite work. I created a PyTorch diagnosis script to test basic tensor and matrix operations. The script fails when hipMAGMA is involved in the calculation. Interestingly, this doesn't happen in the Ubuntu 22.04 Docker container!
Traceback (most recent call last):
File "/home/thetrustedcomputer/Docker/torch_test.py", line 108, in <module>
test_tensors(dev_ids[i])
File "/home/thetrustedcomputer/Docker/torch_test.py", line 14, in wrapper
funct(*args, **kwargs)
File "/home/thetrustedcomputer/Docker/torch_test.py", line 58, in test_tensors
print(torch.det(matrix_a), end = "\n\n")
RuntimeError: CUDA NVRTC error: HIPRTC_ERROR_INVALID_INPUT
Similarly, running the MNIST training session on the host has a similar error:
Traceback (most recent call last):
File "/home/thetrustedcomputer/Docker/pytorch-examples/mnist/main.py", line 7, in <module>
from torchvision import datasets, transforms
File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torchvision/__init__.py", line 6, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torch/library.py", line 440, in inner
handle = entry.abstract_impl.register(func_to_register, source)
File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torch/_library/abstract_impl.py", line 30, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
If you or anyone else has an idea of how to remove these runtime errors, please let me know, and I'll look into it. Much appreciated.
@TheTrustedComputer You can use this docker that already has prebuilt Pytorch with rocm for rx5500 https://hub.docker.com/r/serhiin/rocm_gfx1012_pytorch
UPDATE: I figured out how to resolve the issue when building MIGraphX (ONNX Runtime depends on it as an alternative execution provider) on ROCm 5.4.3.
It turns out that cmake-get
has a bug in its CMake parser that processes the MIT license clause as arguments by checking for a #
character word by word instead of line by line.
Then, I bumped versions in dev-requirements.txt
and requirements.txt
to work with glibc 2.34 and later (see the ROCm 5.2.3 build log), removed sqilte3 to use the system library (libsqlite3-dev), and the rest went smoothly.
Thank you @serhii-nakon for creating a Docker container of the pre-build ROCm and PyTorch for this card and sharing it with everyone. I ended up not needing it.
Environment
What is the Expected Behavior
All build scripts will pass and install respective packages; unit tests won't raise runtime errors. It should behave exactly like the precompiled wheel package for PyTorch 1.13.1 stable and 2.0.0 nightly, considered ancient by today's rapidly evolving technologies.
The latest stable ROCm version that works properly with the RX 5000 series cards is 5.2.x. Since I'm aware that later versions (5.3+) break compatibility with these cards, I'll try my luck by compiling PyTorch 2.2.0 against ROCm 5.2.3 using your build script, which is the latest stable PyTorch version as of writing.
I read that someone created a wheel with PyTorch 2.1.0 and can confirm that it works on my system without crashing.
What Actually Happens
Building rocALUTION failed with illegal instruction detected, similar to the linked comment on issue #35. I guess it can't be used on this card without hacky workarounds. Fortunately, it's not a requirement for PyTorch. All other toolchains succeed without errors. Here's a simple build log I created to do this stuff, including the need to patch parts of code and install additional build dependencies along the way.
After all the builds were finished, I ran your check scripts to ensure everything was installed properly. With the exception of rocALUTION, which apparently isn't supported for this family of cards, they appeared to look fine. However, I seem to get a partially functional installation. The run-miopen.sh and run-miopen-img.sh check scripts produced compilation errors. As for the other checks, they all run OK without problems. Thankfully, it's virtually identical to the prebuilds. Below is the output of run-miopen.sh:
check.sh:
I've tried different versions of GCC (11.4, 10.5, and 9.4), all resulted in the same error. This is something I cannot fix, sadly. In the systemd journal logs, I see several messages saying "Could not parse number of program headers from core file: invalid `Elf' handle". Investigation shows that this was reported upstream and is somewhat specific to ROCm 5.2.3; it has been fixed in 5.3, including the illegal instruction messages in rocALUTION.
https://github.com/ROCm/MIOpen/issues/1764 https://github.com/rocm-arch/rocm-arch/issues/857
Nevertheless, I then proceeded to compile PyTorch 2.2.0 along with hipMAGMA support, torchaudio 2.2.0, and torchvision 0.17. It doesn't out of the box due to the use of missing constants. All of this appears in your target version, 5.4.
After making some modifications to the PyTorch code (see the build log), I was able to make it work. If you have any patches that backport these four hipBLAS and MIOpen constants, please provide them and let me know how to apply them. Thank you very much!
How to Reproduce
Create an Ubuntu 22.04 Docker container with these flags, and perform a
repo init
andrepo sync
on ROCm 5.2.x.You can change the volume mount point to whatever you have on your end. Then, implement those adjustments as indicated in the build log.
For reference, here's my
env.sh
file:Also, do an...
apt update && apt install sudo xxd kmod libtinfo5 graphviz libgmp-dev libcjson-dev
...beforehand, or your
install-dependency.sh
script and building specific toolchains like ROCR-Runtime, HIP, rocminfo, ROCgdb, and AMD MIGraphX won't run.