xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

Fix didnt fixed the problem? #8

Closed snackfart closed 3 years ago

snackfart commented 3 years ago

Environment

Hardware description
GPU R3600
CPU RX 480
Software version
OS Ubuntu 20.04.2
ROCm 4.1.1
Python 3.7

What is the expected behavior

The given fix didnt fixed the problem: https://github.com/RadeonOpenCompute/ROCm/issues/1454 Maybe i didnt sth wrong?

What actually happens

-

How to reproduce

-

xuhuisheng commented 3 years ago

The offcial release of pytorch-1.8.1 didn't support gfx803. We have to compile pytorch by ourselves. Right now, you can refer navi10 documents : https://github.com/xuhuisheng/rocm-build/tree/master/navi10

I think I can add pytorch building script for gfx803, later.

snackfart commented 3 years ago

Which combination of rocm and pytorch does work with a 480 officially?

xuhuisheng commented 3 years ago

@snackfart Unfortunately, The pytorch-1.8.0 is the first offical release (event beta) version on ROCm.

The only way to run pytorch on gfx803 is compiling by ourselves.

xuhuisheng commented 3 years ago

I added scripts for building pytorch from sources. https://github.com/xuhuisheng/rocm-build/tree/master/gfx803#pytorch-181-crashed-on-gfx803

snackfart commented 3 years ago

I added scripts for building pytorch from sources. https://github.com/xuhuisheng/rocm-build/tree/master/gfx803#pytorch-181-crashed-on-gfx803

many thanks, which version of rocm is preferable for this. 3.5 or 4.1.1?

xuhuisheng commented 3 years ago

I am using ROCm-4.1.1, now. I am just run pytorch and tensorflow with ROCm-4.1.1 on some small model, like mnist. Didn't persuade my colleagues to use ROCm on bigger environment, yet.

Building pytorch costs lots of times. Maybe I can try building pytorch on ROCm-3.5.1later.

snackfart commented 3 years ago

I am using ROCm-4.1.1, now. I am just run pytorch and tensorflow with ROCm-4.1.1 on some small model, like mnist. Didn't persuade my colleagues to use ROCm on bigger environment, yet.

Building pytorch costs lots of times. Maybe I can try building pytorch on ROCm-3.5.1later.

Okay, thanks. Can you upload your build of pyTorch for the 803?

snackfart commented 3 years ago

I am using ROCm-4.1.1, now. I am just run pytorch and tensorflow with ROCm-4.1.1 on some small model, like mnist. Didn't persuade my colleagues to use ROCm on bigger environment, yet. Building pytorch costs lots of times. Maybe I can try building pytorch on ROCm-3.5.1later.

Okay, thanks. Can you upload your build of pyTorch for the 803?

the build process fails at:

USE_ROCM=1 USE_NINJA=1 python3 setup.py bdist_wheel
output

``` Building wheel torch-1.8.0a0+56b43f4 -- Building version 1.8.0a0+56b43f4 cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/buran/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/lib/python3/dist-packages -DNUMPY_INCLUDE_DIR=/home/buran/.local/lib/python3.8/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/usr/bin/python3 -DPYTHON_INCLUDE_DIR=/usr/include/python3.8 -DPYTHON_LIBRARY=/usr/lib/libpython3.8.so.1.0 -DTORCH_BUILD_VERSION=1.8.0a0+56b43f4 -DUSE_NINJA=1 -DUSE_NUMPY=True -DUSE_ROCM=1 /home/buran/pytorch -- std::exception_ptr is supported. -- Turning off deprecation warning due to glog. -- Current compiler supports avx2 extension. Will build perfkernels. -- Current compiler supports avx512f extension. Will build fbgemm. -- Building using own protobuf under third_party per request. -- Use custom protobuf build. -- -- 3.11.4.0 -- Caffe2 protobuf include directory: $$ -- Trying to find preferred BLAS backend of choice: MKL -- MKL_THREADING = OMP -- MKL_THREADING = OMP CMake Warning at cmake/Dependencies.cmake:152 (message): MKL could not be found. Defaulting to Eigen Call Stack (most recent call first): CMakeLists.txt:564 (include) CMake Warning at cmake/Dependencies.cmake:175 (message): Preferred BLAS (MKL) cannot be found, now searching for a general BLAS library Call Stack (most recent call first): CMakeLists.txt:564 (include) -- MKL_THREADING = OMP -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_sequential - mkl_core - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_sequential - mkl_core - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_sequential - mkl_core - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_sequential - mkl_core - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_core - gomp - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_core - gomp - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_core - iomp5 - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_core - pthread - m - dl] -- Library mkl_intel_lp64: not found -- Checking for [mkl_intel - mkl_core - pthread - m - dl] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_core - pthread - m - dl] -- Library mkl_gf_lp64: not found -- Checking for [mkl_gf - mkl_core - pthread - m - dl] -- Library mkl_gf: not found -- Checking for [mkl - guide - pthread - m] -- Library mkl: not found -- MKL library not found -- Checking for [Accelerate] -- Library Accelerate: BLAS_Accelerate_LIBRARY-NOTFOUND -- Checking for [vecLib] -- Library vecLib: BLAS_vecLib_LIBRARY-NOTFOUND -- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.so -- Found OpenBLAS include: /usr/include/x86_64-linux-gnu -- Found a library with BLAS API (open). Full path: (/usr/lib/x86_64-linux-gnu/libopenblas.so) -- Brace yourself, we are building NNPACK -- Found PythonInterp: /usr/bin/python3 (found version "3.8.5") -- NNPACK backend is x86-64 -- Failed to find LLVM FileCheck -- git Version: v1.4.0-505be96a -- Version: 1.4.0 -- Performing Test HAVE_STD_REGEX -- success -- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile -- Performing Test HAVE_POSIX_REGEX -- success -- Performing Test HAVE_STEADY_CLOCK -- success CMake Warning at third_party/fbgemm/CMakeLists.txt:61 (message): OpenMP found! OpenMP_C_INCLUDE_DIRS = CMake Warning at third_party/fbgemm/CMakeLists.txt:136 (message): ========== CMake Warning at third_party/fbgemm/CMakeLists.txt:137 (message): CMAKE_BUILD_TYPE = Release CMake Warning at third_party/fbgemm/CMakeLists.txt:138 (message): CMAKE_CXX_FLAGS_DEBUG is -g CMake Warning at third_party/fbgemm/CMakeLists.txt:139 (message): CMAKE_CXX_FLAGS_RELEASE is -O3 -DNDEBUG CMake Warning at third_party/fbgemm/CMakeLists.txt:140 (message): ========== ** AsmJit Summary ** ASMJIT_DIR=/home/buran/pytorch/third_party/fbgemm/third_party/asmjit ASMJIT_TEST=FALSE ASMJIT_TARGET_TYPE=STATIC ASMJIT_DEPS=pthread;rt ASMJIT_LIBS=asmjit;pthread;rt ASMJIT_CFLAGS=-DASMJIT_STATIC ASMJIT_PRIVATE_CFLAGS=-Wall;-Wextra;-fno-math-errno;-fno-threadsafe-statics;-fno-semantic-interposition;-DASMJIT_STATIC ASMJIT_PRIVATE_CFLAGS_DBG= ASMJIT_PRIVATE_CFLAGS_REL=-O2;-fmerge-all-constants -- Found Numa (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnuma.so) -- Using third party subdirectory Eigen. -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.8.5", minimum required is "3.0") -- Using third_party/pybind11. -- pybind11 include dirs: /home/buran/pytorch/cmake/../third_party/pybind11/include -- Adding OpenMP CXX_FLAGS: -fopenmp -- No OpenMP library needs to be linked against HIP VERSION: 4.1.21072-c3eb5ccc ***** Library versions from dpkg ***** rocm-dev VERSION: 4.1.0.40100-26 rocm-device-libs VERSION: 1.0.0.40100-26 hsakmt-roct VERSION: 20210118.1.551.40100-26 hsakmt-roct-dev VERSION: 20210118.1.551.40100-26 hsa-rocr-dev VERSION: 1.2.0.40100-26 ***** Library versions from cmake find_package ***** -- ROCclr at /opt/rocm/lib/cmake/rocclr hip VERSION: 4.1.21072 hsa-runtime64 VERSION: 1.2.40100 amd_comgr VERSION: 2.0.0 rocrand VERSION: 2.10.7 hiprand VERSION: 2.10.7 CMake Error at cmake/public/LoadHIP.cmake:131 (find_package): By not providing "Findrocblas.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "rocblas", but CMake did not find one. Could not find a package configuration file provided by "rocblas" with any of the following names: rocblasConfig.cmake rocblas-config.cmake Add the installation prefix of "rocblas" to CMAKE_PREFIX_PATH or set "rocblas_DIR" to a directory containing one of the above files. If "rocblas" provides a separate development package or SDK, be sure it has been installed. Call Stack (most recent call first): cmake/public/LoadHIP.cmake:176 (find_package_and_print_version) cmake/Dependencies.cmake:1187 (include) CMakeLists.txt:564 (include) -- Configuring incomplete, errors occurred! See also "/home/buran/pytorch/build/CMakeFiles/CMakeOutput.log". See also "/home/buran/pytorch/build/CMakeFiles/CMakeError.log". Traceback (most recent call last): File "setup.py", line 818, in build_deps() File "setup.py", line 315, in build_deps build_caffe2(version=version, File "/home/buran/pytorch/tools/build_pytorch_libs.py", line 50, in build_caffe2 cmake.generate(version, File "/home/buran/pytorch/tools/setup_helpers/cmake.py", line 329, in generate self.run(args, env=my_env) File "/home/buran/pytorch/tools/setup_helpers/cmake.py", line 140, in run check_call(command, cwd=self.build_dir, env=env) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '-GNinja', '-DBUILD_PYTHON=True', '-DBUILD_TEST=True', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/home/buran/pytorch/torch', '-DCMAKE_PREFIX_PATH=/usr/lib/python3/dist-packages', '-DNUMPY_INCLUDE_DIR=/home/buran/.local/lib/python3.8/site-packages/numpy/core/include', '-DPYTHON_EXECUTABLE=/usr/bin/python3', '-DPYTHON_INCLUDE_DIR=/usr/include/python3.8', '-DPYTHON_LIBRARY=/usr/lib/libpython3.8.so.1.0', '-DTORCH_BUILD_VERSION=1.8.0a0+56b43f4', '-DUSE_NINJA=1', '-DUSE_NUMPY=True', '-DUSE_ROCM=1', '/home/buran/pytorch']' returned non-zero exit status 1. ```

xuhuisheng commented 3 years ago

My pytorch package is building on python-3.8, you are using python-3.7, not compitable. And before compiling pytorch, you have to install rocm-dkms sudo apt install rocm-dkms rocm-libs

xuhuisheng commented 3 years ago

Upload torch, torchvision, rocblas, rocrand to baidu cloud disk, please have a try.

url https://pan.baidu.com/s/1zV5j9RPehMvKjqIFaHs0jw code 5jw8

OS Python ROCm GPU
Ubuntu-20.04.2 3.8 4.1.1 RX580
snackfart commented 3 years ago

Upload torch, torchvision, rocblas, rocrand to baidu cloud disk, please have a try.

url https://pan.baidu.com/s/1zV5j9RPehMvKjqIFaHs0jw code 5jw8

OS Python ROCm GPU Ubuntu-20.04.2 3.8 4.1.1 RX580

can you upload your files somewhere else, i have to download a baidu program to download your files. e.g. https://easyupload.io/

xuhuisheng commented 3 years ago

I find I cannot access easyupload or google driver or dropbox. :cry:

snackfart commented 3 years ago

I find I cannot access easyupload or google driver or dropbox. 😢

or upload your files in this repo, e.g. under a folder like builds

snackfart commented 3 years ago

I find I cannot access easyupload or google driver or dropbox. 😢

or upload your files in this repo, e.g. under a folder like builds

or when your files are <25MB you can upload the files with your comment in github

xuhuisheng commented 3 years ago

@snackfart Try this https://github.com/xuhuisheng/rocm-gfx803

snackfart commented 3 years ago

@snackfart Try this https://github.com/xuhuisheng/rocm-gfx803

works thx.

snackfart commented 3 years ago

@snackfart Try this https://github.com/xuhuisheng/rocm-gfx803

works thx.

gDrive Mirror

xuhuisheng commented 3 years ago

Cannot open google driver. :sob:

And I moved archieves from git to release page. Feel better now. https://github.com/xuhuisheng/rocm-gfx803

snackfart commented 3 years ago

Cannot open google driver. 😭

And I moved archieves from git to release page. Feel better now. https://github.com/xuhuisheng/rocm-gfx803

very nice, thanks again

snackfart commented 3 years ago

@xuhuisheng can you explain it behavior? Maybe i have reinstall os, rcm and pytorch to get it working correctly

Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch as tp
>>> tp.add(1,2)
tensor(3)
>>> exit()
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)
xuhuisheng commented 3 years ago

Add AMD_LOG_LEVEL=6 can show debug log, like this:

AMD_LOG_LEVEL=6 python3 main.py

:1:hip_code_object.cpp      :451 : 3970231024154 us: hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp      :453 : 3970231024169 us:   Devices:
:1:hip_code_object.cpp      :455 : 3970231024175 us:     amdgcn-amd-amdhsa--gfx803 - [Not Found]
:1:hip_code_object.cpp      :460 : 3970231024180 us:   Bundled Code Objects:
:1:hip_code_object.cpp      :477 : 3970231024185 us:     host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp      :474 : 3970231024195 us:     hipv4-amdgcn-amd-amdhsa--gfx803:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx803:xnack-]
/home/work/ROCm/HIP/rocclr/hip_code_object.cpp:481: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"

Make sure rocrand and pytorch had been overwritten. like this:

  1. sudo apt install rocm-dkms rocm-libs
  2. sudo dpkg -i rocblas_2.36.0-93c82939_amd64.deb
  3. sudo dpkg -i rocrand_2.10.7-c73b16d_amd64.deb
  4. pip3 install torch-1.8.0a0+56b43f4-cp38-cp38-linux_x86_64.whl
  5. pip3 install torchvision-0.9.0a0+8fb5838-cp38-cp38-linux_x86_64.whl
xuhuisheng commented 3 years ago

HaHa~ I test pytorch-1.7.0 on ROCm-3.5.1 and gfx803. The mnist can run properly. https://github.com/xuhuisheng/rocm-gfx803

snackfart commented 3 years ago

Add AMD_LOG_LEVEL=6 can show debug log, like this:

AMD_LOG_LEVEL=6 python3 main.py

:1:hip_code_object.cpp      :451 : 3970231024154 us: hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp      :453 : 3970231024169 us:   Devices:
:1:hip_code_object.cpp      :455 : 3970231024175 us:     amdgcn-amd-amdhsa--gfx803 - [Not Found]
:1:hip_code_object.cpp      :460 : 3970231024180 us:   Bundled Code Objects:
:1:hip_code_object.cpp      :477 : 3970231024185 us:     host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp      :474 : 3970231024195 us:     hipv4-amdgcn-amd-amdhsa--gfx803:xnack- - [code object v4 is amdgcn-amd-amdhsa--gfx803:xnack-]
/home/work/ROCm/HIP/rocclr/hip_code_object.cpp:481: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"

Make sure rocrand and pytorch had been overwritten. like this:

  1. sudo apt install rocm-dkms rocm-libs
  2. sudo dpkg -i rocblas_2.36.0-93c82939_amd64.deb
  3. sudo dpkg -i rocrand_2.10.7-c73b16d_amd64.deb
  4. pip3 install torch-1.8.0a0+56b43f4-cp38-cp38-linux_x86_64.whl
  5. pip3 install torchvision-0.9.0a0+8fb5838-cp38-cp38-linux_x86_64.whl

after installing all your debs it works i guess, no more code object error

snackfart commented 3 years ago

i cant test it today, but i think it should now. the only error i get in my bigger project is this:

2021-04-15 10:08:11.231034: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-04-15 10:08:11.231051: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

but i guess this should normal.

how can i specify an amd gpu in torch? or will ROCm replace "cpu" with the corresponding amd device?

self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
xuhuisheng commented 3 years ago

Make sure installing tensorflow-rocm, not tensorflow. Looks like your tensorflow still use cuda.

Yes, Feel free to use cuda besiding cpu in pytorch. Here is my test scripts for pytorch. https://github.com/xuhuisheng/rocm-build/blob/master/check/test-pytorch-device.py

And I uploaded pytorch-1.7.0 to https://github.com/xuhuisheng/rocm-gfx803, if you are interesting , please have a try.

snackfart commented 3 years ago

Make sure installing tensorflow-rocm, not tensorflow. Looks like your tensorflow still use cuda.

will do.

Yes, Feel free to use cuda besiding cpu in pytorch.

what is the syntax to specify an amd gpu in torch.device(), not a cuda gpu or a cpu?

Here is my test scripts for pytorch. https://github.com/xuhuisheng/rocm-build/blob/master/check/test-pytorch-device.py

will do, but tomorrow i guess

And I uploaded pytorch-1.7.0 to https://github.com/xuhuisheng/rocm-gfx803, if you are interesting , please have a try.

you are a machine, my dude

xuhuisheng commented 3 years ago

Emm~, I mean just using device = torch.device("cuda"). ROCm aims to totally replace cuda, so the business codes shouldn't need change. I guess mostly codes can run directly. https://github.com/xuhuisheng/rocm-build/blob/master/check/test-pytorch-fc.py#L22

snackfart commented 3 years ago

okay thx, this was a big mystery for me, but the drop in replacement makes sense