xuhuisheng / rocm-gfx803

185 stars 9 forks source link

ROCm 5.3.0 on Ubuntu 22.04.1 LTS with RX580 #19

Closed preet closed 1 year ago

preet commented 1 year ago

Sorry if this is the wrong location for this post: I am trying to install pytorch, but it seems that rocm is not successfully installed (following the steps in this repo's README but on the newer Ubuntu version). Basically fresh install of Ubuntu 22.04.1 LTS, kernel v. 5.15.0-50. The amdgpu-install script installs without errors, but rocminfo and clinfo do not seem to show the right output.

me@astra:~$ sudo /opt/rocm-5.3.0/bin/rocminfo
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1148
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

me@astra:~$ /opt/rocm-5.3.0/opencl/bin/clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3486.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback 

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               0

I understand that the gfx800s series is not supported with newer rocm releases, so I'm not sure what kind of output I should be getting from rocminfo or clinfo or if this is expected and its otherwise functional. Has anyone else tried? Is the output from those utils broken on Ubuntu 20.x as well?

xuhuisheng commented 1 year ago

Please dont install amdgpu-dkms right now, just install rocm-dev and rocm-libs with upstream amdgpu driver.

There is new bugs around rocm-5 driver and roct-thunk-interface, gfx803 will reports no-device. Older driver like upstream driver just fine.

I will update docs when have conclusion. before that we can play with upstream driver

preet commented 1 year ago

Thanks for the reply. I got a bit further; After removing everything with amdgpu-uninstall, I installed specific packages without dkms:

amdgpu-install --usecase=rocm,hip,rocmdevtools --no-dkms

Now rocminfo and clinfo print reasonable results.

Then:

me@astra:~/Dev/env/lib$ pip3.8 list | grep torch torch 1.11.0a0+git503a092 torchvision 0.12.0a0+266279

me@astra:~/Dev/env/lib$ echo $LD_LIBRARY_PATH /opt/rocm-5.3.0/lib

me@astra:~/Dev/env/lib$ ls /opt/rocm/lib | grep libroctx libroctx64.so libroctx64.so.4 libroctx64.so.4.1.0


So that version of pytorch was built against a different rocm? Is there a workaround for this?
xuhuisheng commented 1 year ago

The libroctx64.so and libroctracer64.so had renamed their name. We could create symbolic link for them.

sudo ln -s /opt/rocm-5.3.0/lib/libroctx64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctx64.so.1
sudo ln -s /opt/rocm-5.3.0/lib/libroctracer64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctracer64.so.1
tmpuserx commented 1 year ago

I got the same problem and the commands fix it. but now I am getting another error when try to import torch. any suggestion how to fix this? thanks.

import torch Traceback (most recent call last): File "", line 1, in File "/home/mx/.local/lib/python3.8/site-packages/torch/init.py", line 199, in from torch._C import * # noqa: F403 ImportError: libMIOpen.so.1: cannot open shared object file: No such file or directory

tmpuserx commented 1 year ago

OK. i am able to fix that libMIOpen.so.1 and several so file missing error by installing some packages. Now I am able to import torch and the torch.cuda.is_available() is able to return true now.

But I am getting another issue that tf.config.list_physical_devices('GPU') isn't able to return any GPU device. but rocminfo is able to find my RX580. any suggestion? thanks.


Agent 2


Name: gfx803
Uuid: GPU-XX
Marketing Name: Radeon RX 580 Series
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1430
BDFID: 256
Internal Node ID: 1

tmpuserx commented 1 year ago

Finally i am able to import tensorflow and torch, and both able to detect my RX580 as GPU. I ran benchmark to test the TF, it seems a bit slow than the one I ran with ROCm 3.5.1. Not sure if there will be other issue occur when running TF or torch code, but here I share my steps of setting up the environment hopefully it will help someone.

I reimage my OS with Ubuntu 20.04 LTS and did following steps for the setup:-

sudo apt-get update mkdir rocm5.3 cd rocm5.3/ wget https://repo.radeon.com/amdgpu-install/5.3/ubuntu/focal/amdgpu-install_5.3.50300-1_all.deb sudo apt-get install ./amdgpu-install_5.3.50300-1_all.deb amdgpu-install --usecase=rocm,hip,rocmdevtools,opencl,hiplibsdk,mllib,mlsdk --no-dkms sudo usermod -a -G video $LOGNAME sudo usermod -a -G render $LOGNAME sudo reboot

sudo apt install python3-pip cd rocm5.3 wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm530/hsa-rocr_1.7.0.50300-63.20.04_amd64.deb wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm530/rocblas_2.45.0.50300-63.20.04_amd64.deb wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/torch-1.11.0a0+git503a092-cp38-cp38-linux_x86_64.whl wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/torchvision-0.12.0a0+2662797-cp38-cp38-linux_x86_64.whl wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/tensorflow_rocm-2.8.0-cp38-cp38-linux_x86_64.whl

sudo dpkg -i hsa-rocr_1.7.0.50300-63.20.04_amd64.deb sudo dpkg -i rocblas_2.45.0.50300-63.20.04_amd64.deb pip3 install torch-1.11.0a0+git503a092-cp38-cp38-linux_x86_64.whl pip3 install torchvision-0.12.0a0+2662797-cp38-cp38-linux_x86_64.whl pip3 install tensorflow_rocm-2.8.0-cp38-cp38-linux_x86_64.whl

sudo apt install miopen-hip miopengemm libopenblas-dev hipfft rocrand hipsparse rocfft libopenmpi3

pip3 uninstall protobuf pip3 install protobuf==3.19.0

sudo ln -s /opt/rocm-5.3.0/lib/libroctx64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctx64.so.1 sudo ln -s /opt/rocm-5.3.0/lib/libroctracer64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctracer64.so.1

export LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib/

Actually i added LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib/ to my /etc/environment file so that i dont need to run it every time when i reboot my system.

Thanks for @xuhuisheng provides the patches!

preet commented 1 year ago

Just to follow up, I think I was able to get pytorch working as well on Ubuntu 22.04. I followed basically all of the steps that @tmpuserx has summarized in their post, with the only difference being that I built+installed python3.8 manually along side the default 3.10 that ships with Ubuntu 22.04.

I ran the introductory pytorch mnist example which seemed to run fine. I used nvtop to verify the GPU was being used.

redthing1 commented 1 year ago

I had this problem too.

xuhuisheng commented 1 year ago

@redthing1 The latest ROCm-5.4.1 should solve this issue, please have a try.

redthing1 commented 1 year ago

@redthing1 The latest ROCm-5.4.1 should solve this issue, please have a try.

@xuhuisheng Thank you so much for your work here. It is absolutely invaluable and I am so grateful.

Now I'm able to run pytorch:

❯ python3 -c "import torch; print(torch.cuda.is_available())"
True
xuhuisheng commented 1 year ago

@redthing1

Dont worry.

I just let we can play small samples like mnist on gfx803. More complex sample likes Diffusion stable always break gfx803.