xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

Getting the 5500 XT to work with a custom build of PyTorch and ROCm #47

Open TheTrustedComputer opened 9 months ago

TheTrustedComputer commented 9 months ago

Environment

Hardware Description
GPU AMD Radeon RX 5500 XT
CPU AMD Ryzen 7 5800X
Software Version
OS Arch (Host); Ubuntu 22.04 (Docker)
ROCm 5.2.3
Python 3.10.12

What is the Expected Behavior

All build scripts will pass and install respective packages; unit tests won't raise runtime errors. It should behave exactly like the precompiled wheel package for PyTorch 1.13.1 stable and 2.0.0 nightly, considered ancient by today's rapidly evolving technologies.

The latest stable ROCm version that works properly with the RX 5000 series cards is 5.2.x. Since I'm aware that later versions (5.3+) break compatibility with these cards, I'll try my luck by compiling PyTorch 2.2.0 against ROCm 5.2.3 using your build script, which is the latest stable PyTorch version as of writing.

I read that someone created a wheel with PyTorch 2.1.0 and can confirm that it works on my system without crashing.

What Actually Happens

Building rocALUTION failed with illegal instruction detected, similar to the linked comment on issue #35. I guess it can't be used on this card without hacky workarounds. Fortunately, it's not a requirement for PyTorch. All other toolchains succeed without errors. Here's a simple build log I created to do this stuff, including the need to patch parts of code and install additional build dependencies along the way.

ROCm 5.2.3 gfx1012 Ubuntu 22.04 Docker build log
00.rocm-core.sh: PASS
11.rocm-llvm.sh: PASS
12.roct-thunk-interface.sh: PASS
13.rocm-cmake.sh: PASS
14.rocm-device-libs.sh: PASS
15.rocr-runtime.sh: PASS
* need xxd (apt install xxd)
16.rocminfo.sh: PASS
* need kmod (apt install kmod)
17.rocm-compilersupport.sh: PASS
18.hip.sh: PASS
* need dot (apt install graphviz)
21.rocfft.sh: PASS
* may need GPU exposure in container (/dev/dri; /dev/kfd)
navi14/22.rocblas.sh: PASS
* edit CMakeLists.txt to include Python path
23.rocprim.sh: PASS
24.rocrand.sh: PASS
* download hipRAND sources manually and comment out N/A patch
navi14/25.rocsparse.sh: PASS
26.hipsparse.sh: PASS
27.rocm_smi_lib.sh: PASS
28.rccl.sh: PASS
29.hipfft.sh: PASS
31.rocm-opencl-runtime.sh: PASS
32.clang-ocl.sh: PASS
33.rocprofiler.sh: PASS
* comment out N/A patch
34.roctracer.sh: PASS
35.half.sh: PASS
36.miopen.sh: PASS
* patch Boost 1.74.0 to resolve linker error, see https://github.com/boostorg/spirit/commit/f3998fb2bbbcd29aacfc1b27d92af570d154fb9b
* build it with -fPIC
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" and set -DCMAKE_PREFIX_PATH to path of patched Boost to cmake args
37.rocm-utils.sh: PASS
41.rocdbgapi.sh: PASS
42.rocgdb.sh: PASS
* needs GMP (apt install libgmp-dev)
43.rocm-dev.sh: PASS
51.rocsolver.sh: PASS
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" to cmake args
52.rocthrust.sh: PASS
53.hipblas.sh: PASS
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" to cmake args
54.rocalution.sh: FAIL
* inserting include_directories(${ROCM_PATH}/include/rocblas) inside body of if(SUPPORT_HIP) in CMakeLists.txt errors with illegal instruction detected
55.hipcub.sh: PASS
56.hipsolver.sh: PASS
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" to cmake args
57.rocm-libs.sh: PASS
61.amdmigraphx.sh: PASS
* may need cJSON (apt install libcjson-dev)
* apply these changes in dev-requirements.txt for glibc >= 2.34: ccache@v4.1 => ccache@v4.2.1; and in requirements.txt: google => protocolbuffers, json@v3.8.0 => json@v3.10.0
* open Embed.cmake to place "#include <string>" in file(WRITE ...) within generate_embed_source function
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" to cmake args
62.rock-dkms.sh: PASS
* set permission mask to 755 on both postinst and prerm
71.rocm_bandwidth_test.sh: PASS
72.hipfort.sh: PASS
73.rocmvalidationsuite.sh: PASS
* add -DCMAKE_CXX_FLAGS="-I${ROCM_INSTALL_DIR}/include/rocblas" to cmake args
* modify ROCBLAS_INC_DIR "${ROCM_PATH}/include" to "${ROCM_PATH}/include/rocblas" in CMakeLists.txt
* bump GIT_TAG from release-1.10.0 to release-1.11.0 to bypass uninitialized variable errors in CMakeGtestDownload.cmake
74.rocr_debug_agent.sh: PASS
75.hipify.sh: PASS

PyTorch 2.2.0 with ROCm 5.2.3: PASS
* backport MIOPEN_CONVOLUTION_ATTRIB_DETERMINISTIC = 0 in miopenConvolutionAttrib_t under MIOpen header (ROCm)
* comment out hipblasCgelsBatched, hipblasDgelsBatched, hipblasSgelsBatched, and hipblasZgelsBatched in pytorch/aten/src/ATen/hip/HIPBlas.cpp
* add "#include <hipsolver/internal/hipsolver-types.h>" to pytorch/aten/src/ATen/native/hip/linalg/BatchLinearAlgebraLib.h
Torchaudio: PASS
Torchvision: PASS

After all the builds were finished, I ran your check scripts to ensure everything was installed properly. With the exception of rocALUTION, which apparently isn't supported for this family of cards, they appeared to look fine. However, I seem to get a partially functional installation. The run-miopen.sh and run-miopen-img.sh check scripts produced compilation errors. As for the other checks, they all run OK without problems. Thankfully, it's virtually identical to the prebuilds. Below is the output of run-miopen.sh:

MIOPEN_VERSION_MAJOR:2
MIOPEN_VERSION_MINOR:17
MIOPEN_VERSION_PATCH:0
ws_size = 576
find conv algo
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildHip] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildHip] In file included from /tmp/comgr-5f42cb/input/naive_conv.cpp:1:
In file included from /tmp/hip_pch.39257/hip_pch.h:1:
In file included from /root/rocm-test/rocm-5.2/HIP/include/hip/hip_runtime.h:54:
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/thread:44:
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/this_thread_sleep.h:38:
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:666:36: error: no matching conversion for functional-style cast from 'const duration<long, std::ratio<1, 1>>' to '__cd' (aka 'duration<long, ratio<num, den>>')
 return __cd(__cd(__lhs).count() - __cd(__rhs).count());
                                   ^~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:1036:47: note: in instantiation of function template specialization 'std::chrono::operator-<long, std::ratio<1, 1000000000>, long, std::ratio<1, 1>>' requested here
 return __time_point(__lhs.time_since_epoch() -__rhs);
                                              ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:3402:47: note: in instantiation of function template specialization 'std::chrono::operator-<std::filesystem::__file_clock, std::chrono::duration<long, std::ratio<1, 1000000000>>, long, std::ratio<1, 1>>' requested here
   return __file_time{__t.time_since_epoch()} - _S_epoch_diff;
                                              ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:3369:16: note: in instantiation of function template specialization 'std::filesystem::__file_clock::_S_from_sys<std::chrono::duration<long, std::ratio<1, 1000000000>>>' requested here
      { return _S_from_sys(chrono::system_clock::now()); }
               ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:514:2: note: candidate constructor not viable: no known conversion from 'const duration<[...], ratio<[...], 1>>' to 'const duration<[...], ratio<[...], 1000000000>>' for 1st argument
 duration(const duration&) = default;
 ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:521:23: note: candidate template ignored: requirement '__and_<std::is_convertible<const std::chrono::duration<long, std::ratio<1, 1>> &, long>, std::__or_<std::chrono::treat_as_floating_point<long>, std::__not_<std::chrono::treat_as_floating_point<std::chrono::duration<long, std::ratio<1, 1>>>>>>::value' was not satisfied [with _Rep2 = std::chrono::duration<long>]
   constexpr explicit duration(const _Rep2& __rep)
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:529:14: note: candidate template ignored: substitution failure [with _Rep2 = long, _Period2 = std::ratio<1, 1>]: non-type template argument is not a constant expression
   constexpr duration(const duration<_Rep2, _Period2>& __d)
             ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/chrono:512:12: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
 constexpr duration() = default;
           ^
In file included from /tmp/comgr-5f42cb/input/naive_conv.cpp:1:
In file included from /tmp/hip_pch.39257/hip_pch.h:1:
In file included from /root/rocm-test/rocm-5.2/HIP/include/hip/hip_runtime.h:62:
In file included from /root/rocm-test/rocm-5.2/hipamd/include/hip/amd_detail/amd_hip_runtime.h:434:
In file included from /opt/rocm/llvm/lib/clang/14.0.0/include/cuda_wrappers/complex:35:
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/stdexcept:39:
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/string:55:
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6608:
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6620:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long, int, char, int>' requested here
  { return __gnu_cxx::__stoa<long, int>(&std::strtol, "stoi", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<int, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<int, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6625:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long, long, char, int>' requested here
  { return __gnu_cxx::__stoa(&std::strtol, "stol", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6630:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<unsigned long, unsigned long, char, int>' requested here
  { return __gnu_cxx::__stoa(&std::strtoul, "stoul", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6635:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long long, long long, char, int>' requested here
  { return __gnu_cxx::__stoa(&std::strtoll, "stoll", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6640:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<unsigned long long, unsigned long long, char, int>' requested here
  { return __gnu_cxx::__stoa(&std::strtoull, "stoull", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6646:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<float, float, char>' requested here
  { return __gnu_cxx::__stoa(&std::strtof, "stof", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<float, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<float, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6650:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<double, double, char>' requested here
  { return __gnu_cxx::__stoa(&std::strtod, "stod", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<double, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<double, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6654:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long double, long double, char>' requested here
  { return __gnu_cxx::__stoa(&std::strtold, "stold", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long double, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long double, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6751:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long, int, wchar_t, int>' requested here
  { return __gnu_cxx::__stoa<long, int>(&std::wcstol, "stoi", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<int, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<int, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6756:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long, long, wchar_t, int>' requested here
  { return __gnu_cxx::__stoa(&std::wcstol, "stol", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6761:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<unsigned long, unsigned long, wchar_t, int>' requested here
  { return __gnu_cxx::__stoa(&std::wcstoul, "stoul", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6766:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long long, long long, wchar_t, int>' requested here
  { return __gnu_cxx::__stoa(&std::wcstoll, "stoll", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6771:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<unsigned long long, unsigned long long, wchar_t, int>' requested here
  { return __gnu_cxx::__stoa(&std::wcstoull, "stoull", __str.c_str(),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long long, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<unsigned long long, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6777:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<float, float, wchar_t>' requested here
  { return __gnu_cxx::__stoa(&std::wcstof, "stof", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<float, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<float, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6781:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<double, double, wchar_t>' requested here
  { return __gnu_cxx::__stoa(&std::wcstod, "stod", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<double, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<double, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:85:7: error: no matching function for call to '_S_chk'
   || _Range_chk::_S_chk(__tmp, std::is_same<_Ret, int>{}))
      ^~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:6785:23: note: in instantiation of function template specialization '__gnu_cxx::__stoa<long double, long double, wchar_t>' requested here
  { return __gnu_cxx::__stoa(&std::wcstold, "stold", __str.c_str(), __idx); }
                      ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:70:4: note: candidate function not viable: no known conversion from 'std::is_same<long double, int>' to 'std::false_type' (aka 'integral_constant<bool, false>') for 2nd argument
   _S_chk(_TRet, std::false_type) { return false; }
   ^
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/string_conversions.h:73:4: note: candidate function not viable: no known conversion from 'std::is_same<long double, int>' to 'std::true_type' (aka 'integral_constant<bool, true>') for 2nd argument
   _S_chk(_TRet __val, std::true_type)
   ^
17 errors generated when compiling for gfx1012.

terminate called after throwing an instance of 'miopen::Exception'
  what():  /root/rocm-test/rocm-5.2/MIOpen/src/hipoc/hipoc_program.cpp:300: Code object build failed. Source: naive_conv.cpp
run-miopen.sh: line 9: 146789 Aborted                 (core dumped) MIOPEN_ENABLE_LOGGING=0 MIOPEN_LOG_LEVEL=0 ./build/test_miopen

check.sh:

[HIP]        50221153
[rocBLAS]    2.44.0.4a92c6f1
[rocFFT]     1.0.17.d3c798c
[rocPRIM]    201009
[rocRAND]    201009
[rocSPARSE]  200200
[rccl]       21212
[MIOpen]     2 17 0
[rocSOVLER]  3.18.0.d883d5f
[rocThrust]  101500
src/hello_rocalution.cpp:2:10: fatal error: 'rocalution/version.hpp' file not found
#include "rocalution/version.hpp"
         ^~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1012.
check.sh: line 38: ./build/hello_rocalution: No such file or directory
[hipCUB]     201012
[hipBLAS]    0 51 0
[hipSPARSE]  200100
[hipRAND]    201009
[hipFFT]     10017

I've tried different versions of GCC (11.4, 10.5, and 9.4), all resulted in the same error. This is something I cannot fix, sadly. In the systemd journal logs, I see several messages saying "Could not parse number of program headers from core file: invalid `Elf' handle". Investigation shows that this was reported upstream and is somewhat specific to ROCm 5.2.3; it has been fixed in 5.3, including the illegal instruction messages in rocALUTION.

https://github.com/ROCm/MIOpen/issues/1764 https://github.com/rocm-arch/rocm-arch/issues/857

Nevertheless, I then proceeded to compile PyTorch 2.2.0 along with hipMAGMA support, torchaudio 2.2.0, and torchvision 0.17. It doesn't out of the box due to the use of missing constants. All of this appears in your target version, 5.4.

After making some modifications to the PyTorch code (see the build log), I was able to make it work. If you have any patches that backport these four hipBLAS and MIOpen constants, please provide them and let me know how to apply them. Thank you very much!

How to Reproduce

Create an Ubuntu 22.04 Docker container with these flags, and perform a repo init and repo sync on ROCm 5.2.x.

docker run -it --device /dev/dri --device /dev/kfd --volume /mnt/ubuntu22.04:/root/rocm-test ubuntu:22.04

You can change the volume mount point to whatever you have on your end. Then, implement those adjustments as indicated in the build log.

For reference, here's my env.sh file:

#!/bin/bash

export ROCM_INSTALL_DIR=/opt/rocm
export ROCM_MAJOR_VERSION=5
export ROCM_MINOR_VERSION=2
export ROCM_PATCH_VERSION=3
export ROCM_LIBPATCH_VERSION=50203
export CPACK_DEBIAN_PACKAGE_RELEASE=109
export ROCM_PKGTYPE=DEB
export ROCM_GIT_DIR=/root/rocm-test/rocm-5.2
export ROCM_BUILD_DIR=/root/rocm-test/rocm-build/build
export ROCM_PATCH_DIR=/root/rocm-test/rocm-build/patch
export AMDGPU_TARGETS="gfx1012"
# export CMAKE_DIR=/root/rocm-test/cmake-3.18.6
# export PATH=$ROCM_INSTALL_DIR/bin:$ROCM_INSTALL_DIR/llvm/bin:$ROCM_INSTALL_DIR/hip/bin:$CMAKE_DIR/bin:$PATH
export PATH=$ROCM_INSTALL_DIR/bin:$ROCM_INSTALL_DIR/llvm/bin:$ROCM_INSTALL_DIR/hip/bin:$PATH

Also, do an...

apt update && apt install sudo xxd kmod libtinfo5 graphviz libgmp-dev libcjson-dev

...beforehand, or your install-dependency.sh script and building specific toolchains like ROCR-Runtime, HIP, rocminfo, ROCgdb, and AMD MIGraphX won't run.

TheTrustedComputer commented 9 months ago

Building ROCm 5.4(.3) indeed fixes the MIOpen compiler error users faced with 5.2.3. The checks passed with flying colors, except that ws_size is 0 rather than 576.

MIOPEN_VERSION_MAJOR:2
MIOPEN_VERSION_MINOR:19
MIOPEN_VERSION_PATCH:0
ws_size = 0
find conv algo
time : 0.01444
[0] = 0
[1] = 3
[2] = 8
[3] = 13
[4] = 18
[5] = 8
[6] = 15
[7] = 29
[8] = 35
[9] = 41
[10] = 47
[11] = 18
[12] = 35
[13] = 59
[14] = 65
[15] = 71
[16] = 77
[17] = 28
[18] = 55
[19] = 89
[20] = 95
[21] = 101
[22] = 107
[23] = 38
[24] = 75
[25] = 119
[26] = 125
[27] = 131
[28] = 137
[29] = 48
[30] = 20
[31] = 21
[32] = 22
[33] = 23
[34] = 24
[35] = 0

run-miopen-img.sh: produces the exact image as the reference, further confirming its functionality.

handle conv start
out shape 1 3 728 410
ws_size = 0
find conv algo
time : 0.270198
save bmp start
save bmp end
free mem start
free mem end

My ROCm 5.4.3 build log against gfx1012 target. Now AMD MIGraphX couldn't compile while I could with 5.2.3. I wasn't able to figure out how to resolve this, but it's probably not necessary for PyTorch anyway.

ROCm 5.4.3 gfx1012 Ubuntu 22.04 Docker build log
00.rocm-core.sh: PASS
11.rocm-llvm.sh: PASS
12.roct-thunk-interface.sh: PASS
13.rocm-cmake.sh: PASS
14.rocm-device-libs.sh: PASS
15.rocr-runtime.sh: PASS
* need xxd (apt install xxd)
16.rocminfo.sh: PASS
* need kmod (apt install kmod)
17.rocm-compilersupport.sh: PASS
18.hip.sh: PASS
* need dot (apt install graphviz)
21.rocfft.sh: PASS
* may need GPU exposure in container (/dev/dri; /dev/kfd)
navi14/22.rocblas.sh: PASS
23.rocprim.sh: PASS
24.rocrand.sh: PASS
navi14/25.rocsparse.sh: PASS
* comment out N/A patch
26.hipsparse.sh: PASS
27.rocm_smi_lib.sh: PASS
28.rccl.sh: PASS
* apply patch in issue #44
29.hipfft.sh: PASS
31.rocm-opencl-runtime.sh: PASS
32.clang-ocl.sh: PASS
33.rocprofiler.sh: PASS
34.roctracer.sh: PASS
35.half.sh: PASS
36.miopen.sh: PASS
* need Niels Lohmann's JSON fork (apt install nlohmann-json3-dev)
* patch Boost 1.74.0 to resolve linker error, see https://github.com/boostorg/spirit/commit/f3998fb2bbbcd29aacfc1b27d92af570d154fb9b; build it with -fPIC
* set -DCMAKE_PREFIX_PATH to path of patched Boost to cmake args
* add -DMIOPEN_USE_COMPOSABLEKERNEL=0 to cmake args to disable composable kernels
37.rocm-utils.sh: PASS
41.rocdbgapi.sh: PASS
42.rocgdb.sh: PASS
* need GMP (apt install libgmp-dev)
* remove line "--disable-shared" from script
43.rocm-dev.sh: PASS
51.rocsolver.sh: PASS
52.rocthrust.sh: PASS
53.hipblas.sh: PASS
54.rocalution.sh: PASS
55.hipcub.sh: PASS
56.hipsolver.sh: PASS
57.rocm-libs.sh: PASS
61.amdmigraphx.sh: FAIL
* CMake Error at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:430 (foreach):
*   Unknown argument:
* 
*     NO
* 
* Call Stack (most recent call first):
*   /root/rocm-test/rocm-5.4/AMDMIGraphX/install_deps.cmake:85 (cmake_get_from)
62.rock-dkms.sh: PASS
* change permission masks to what dpkg-deb expects
71.rocm_bandwidth_test.sh: PASS
72.hipfort.sh: PASS
73.rocmvalidationsuite.sh: PASS
74.rocr_debug_agent.sh: PASS
75.hipify.sh: PASS

check.sh:

[HIP]        50422804
[rocBLAS]    2.46.0.24f38911
[rocFFT]     1.0.21.5687cd9
[rocPRIM]    201009
[rocRAND]    201009
[rocSPARSE]  200303
[rccl]       21304
[MIOpen]     2 19 0
[rocSOVLER]  3.20.0.2740dcf
[rocThrust]  101600
[rocALUTION] 20103
[hipCUB]     201012
[hipBLAS]    0 53 0
[hipSPARSE]  200303
[hipRAND]    201009
[hipFFT]     10021

env.sh:

#!/bin/bash

export ROCM_INSTALL_DIR=/opt/rocm
export ROCM_MAJOR_VERSION=5
export ROCM_MINOR_VERSION=4
export ROCM_PATCH_VERSION=3
export ROCM_LIBPATCH_VERSION=50403
export CPACK_DEBIAN_PACKAGE_RELEASE=121~22.04
export ROCM_PKGTYPE=DEB
export ROCM_GIT_DIR=/root/rocm-test/rocm-5.4
export ROCM_BUILD_DIR=/root/rocm-test/rocm-build/build
export ROCM_PATCH_DIR=/root/rocm-test/rocm-build/patch
export AMDGPU_TARGETS="gfx1012"
# export CMAKE_DIR=/home/work/local/cmake-3.18.6-Linux-x86_64
export PATH=$ROCM_INSTALL_DIR/bin:$ROCM_INSTALL_DIR/llvm/bin:$ROCM_INSTALL_DIR/hip/bin:$CMAKE_DIR/bin:$PATH

Furthermore, I didn't have to patch PyTorch since ROCm 5.4.3 contains definitions that were absent in 5.2.3. The MNIST sample training sessions correctly utilize my GPU without the need for the HSA_OVERRIDE_GFX_VERSION environment variable. Since I complied only for the RX 5500 XT, it won't work with other cards without rebuilding.

However, creating a pip wheel for manylinux_2_35_x86_64 and installing it on my Arch host doesn't quite work. I created a PyTorch diagnosis script to test basic tensor and matrix operations. The script fails when hipMAGMA is involved in the calculation. Interestingly, this doesn't happen in the Ubuntu 22.04 Docker container!

Traceback (most recent call last):
  File "/home/thetrustedcomputer/Docker/torch_test.py", line 108, in <module>
    test_tensors(dev_ids[i])
  File "/home/thetrustedcomputer/Docker/torch_test.py", line 14, in wrapper
    funct(*args, **kwargs)
  File "/home/thetrustedcomputer/Docker/torch_test.py", line 58, in test_tensors
    print(torch.det(matrix_a), end = "\n\n")
RuntimeError: CUDA NVRTC error: HIPRTC_ERROR_INVALID_INPUT

Similarly, running the MNIST training session on the host has a similar error:

Traceback (most recent call last):
  File "/home/thetrustedcomputer/Docker/pytorch-examples/mnist/main.py", line 7, in <module>
    from torchvision import datasets, transforms
  File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torch/library.py", line 440, in inner
    handle = entry.abstract_impl.register(func_to_register, source)
  File "/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torch/_library/abstract_impl.py", line 30, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist

If you or anyone else has an idea of how to remove these runtime errors, please let me know, and I'll look into it. Much appreciated.

serhii-nakon commented 8 months ago

@TheTrustedComputer You can use this docker that already has prebuilt Pytorch with rocm for rx5500 https://hub.docker.com/r/serhiin/rocm_gfx1012_pytorch

TheTrustedComputer commented 3 months ago

UPDATE: I figured out how to resolve the issue when building MIGraphX (ONNX Runtime depends on it as an alternative execution provider) on ROCm 5.4.3.

It turns out that cmake-get has a bug in its CMake parser that processes the MIT license clause as arguments by checking for a # character word by word instead of line by line.

Then, I bumped versions in dev-requirements.txt and requirements.txt to work with glibc 2.34 and later (see the ROCm 5.2.3 build log), removed sqilte3 to use the system library (libsqlite3-dev), and the rest went smoothly.

Thank you @serhii-nakon for creating a Docker container of the pre-build ROCm and PyTorch for this card and sharing it with everyone. I ended up not needing it.