Closed preet closed 1 year ago
Please dont install amdgpu-dkms right now, just install rocm-dev and rocm-libs with upstream amdgpu driver.
There is new bugs around rocm-5 driver and roct-thunk-interface, gfx803 will reports no-device. Older driver like upstream driver just fine.
I will update docs when have conclusion. before that we can play with upstream driver
Thanks for the reply. I got a bit further; After removing everything with amdgpu-uninstall, I installed specific packages without dkms:
amdgpu-install --usecase=rocm,hip,rocmdevtools --no-dkms
Now rocminfo and clinfo print reasonable results.
Then:
me@astra:~/Dev/scratch/install_amdgpu/xuhuisheng_patched$ python3.8
Python 3.8.15 (default, Oct 18 2022, 20:33:33)
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/me/Dev/env/lib/python3.8/site-packages/torch/__init__.py", line 199, in <module>
from torch._C import * # noqa: F403
ImportError: libroctx64.so.1: cannot open shared object file: No such file or directory
>>>
me@astra:~/Dev/env/lib$ pip3.8 list | grep torch torch 1.11.0a0+git503a092 torchvision 0.12.0a0+266279
me@astra:~/Dev/env/lib$ echo $LD_LIBRARY_PATH /opt/rocm-5.3.0/lib
me@astra:~/Dev/env/lib$ ls /opt/rocm/lib | grep libroctx libroctx64.so libroctx64.so.4 libroctx64.so.4.1.0
So that version of pytorch was built against a different rocm? Is there a workaround for this?
The libroctx64.so and libroctracer64.so had renamed their name. We could create symbolic link for them.
sudo ln -s /opt/rocm-5.3.0/lib/libroctx64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctx64.so.1
sudo ln -s /opt/rocm-5.3.0/lib/libroctracer64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctracer64.so.1
I got the same problem and the commands fix it. but now I am getting another error when try to import torch. any suggestion how to fix this? thanks.
import torch Traceback (most recent call last): File "
", line 1, in File "/home/mx/.local/lib/python3.8/site-packages/torch/init.py", line 199, in from torch._C import * # noqa: F403 ImportError: libMIOpen.so.1: cannot open shared object file: No such file or directory
OK. i am able to fix that libMIOpen.so.1 and several so file missing error by installing some packages. Now I am able to import torch and the torch.cuda.is_available() is able to return true now.
But I am getting another issue that tf.config.list_physical_devices('GPU') isn't able to return any GPU device. but rocminfo is able to find my RX580. any suggestion? thanks.
Agent 2
Name: gfx803
Uuid: GPU-XX
Marketing Name: Radeon RX 580 Series
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1430
BDFID: 256
Internal Node ID: 1
Finally i am able to import tensorflow and torch, and both able to detect my RX580 as GPU. I ran benchmark to test the TF, it seems a bit slow than the one I ran with ROCm 3.5.1. Not sure if there will be other issue occur when running TF or torch code, but here I share my steps of setting up the environment hopefully it will help someone.
I reimage my OS with Ubuntu 20.04 LTS and did following steps for the setup:-
sudo apt-get update mkdir rocm5.3 cd rocm5.3/ wget https://repo.radeon.com/amdgpu-install/5.3/ubuntu/focal/amdgpu-install_5.3.50300-1_all.deb sudo apt-get install ./amdgpu-install_5.3.50300-1_all.deb amdgpu-install --usecase=rocm,hip,rocmdevtools,opencl,hiplibsdk,mllib,mlsdk --no-dkms sudo usermod -a -G video $LOGNAME sudo usermod -a -G render $LOGNAME sudo reboot
sudo apt install python3-pip cd rocm5.3 wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm530/hsa-rocr_1.7.0.50300-63.20.04_amd64.deb wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm530/rocblas_2.45.0.50300-63.20.04_amd64.deb wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/torch-1.11.0a0+git503a092-cp38-cp38-linux_x86_64.whl wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/torchvision-0.12.0a0+2662797-cp38-cp38-linux_x86_64.whl wget https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm500/tensorflow_rocm-2.8.0-cp38-cp38-linux_x86_64.whl
sudo dpkg -i hsa-rocr_1.7.0.50300-63.20.04_amd64.deb sudo dpkg -i rocblas_2.45.0.50300-63.20.04_amd64.deb pip3 install torch-1.11.0a0+git503a092-cp38-cp38-linux_x86_64.whl pip3 install torchvision-0.12.0a0+2662797-cp38-cp38-linux_x86_64.whl pip3 install tensorflow_rocm-2.8.0-cp38-cp38-linux_x86_64.whl
sudo apt install miopen-hip miopengemm libopenblas-dev hipfft rocrand hipsparse rocfft libopenmpi3
pip3 uninstall protobuf pip3 install protobuf==3.19.0
sudo ln -s /opt/rocm-5.3.0/lib/libroctx64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctx64.so.1 sudo ln -s /opt/rocm-5.3.0/lib/libroctracer64.so.4.1.0 /opt/rocm-5.3.0/lib/libroctracer64.so.1
export LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib/
Actually i added LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib/ to my /etc/environment file so that i dont need to run it every time when i reboot my system.
Thanks for @xuhuisheng provides the patches!
Just to follow up, I think I was able to get pytorch working as well on Ubuntu 22.04. I followed basically all of the steps that @tmpuserx has summarized in their post, with the only difference being that I built+installed python3.8 manually along side the default 3.10 that ships with Ubuntu 22.04.
I ran the introductory pytorch mnist example which seemed to run fine. I used nvtop to verify the GPU was being used.
I had this problem too.
@redthing1 The latest ROCm-5.4.1 should solve this issue, please have a try.
@redthing1 The latest ROCm-5.4.1 should solve this issue, please have a try.
@xuhuisheng Thank you so much for your work here. It is absolutely invaluable and I am so grateful.
Now I'm able to run pytorch:
❯ python3 -c "import torch; print(torch.cuda.is_available())"
True
@redthing1
Dont worry.
I just let we can play small samples like mnist on gfx803. More complex sample likes Diffusion stable always break gfx803.
Sorry if this is the wrong location for this post: I am trying to install pytorch, but it seems that rocm is not successfully installed (following the steps in this repo's README but on the newer Ubuntu version). Basically fresh install of Ubuntu 22.04.1 LTS, kernel v. 5.15.0-50. The amdgpu-install script installs without errors, but rocminfo and clinfo do not seem to show the right output.
I understand that the gfx800s series is not supported with newer rocm releases, so I'm not sure what kind of output I should be getting from rocminfo or clinfo or if this is expected and its otherwise functional. Has anyone else tried? Is the output from those utils broken on Ubuntu 20.x as well?