xuhuisheng / rocm-gfx803

185 stars 9 forks source link

Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:113.) #9

Closed andreaalf closed 2 years ago

andreaalf commented 2 years ago

I have Ubuntu 20.4.3 Kernel 5.11.0-27-generic python 3.8.10 GPU: radeon FirePro f9300x2 (equivalent as 2 radeon Nano)

Hi, now i can import successfully pythorch, but when i run torch.cuda.is_available() I get this error:

torch.cuda.is_available() /home/fiss/.local/lib/python3.8/site-packages/torch/cuda/init.py:82: UserWarning: HIP initialization: Unexpected error from hipGetDeviceCount(). Did you run some cuda functions before calling NumHipDevices() that might have already set an error? Error 101: hipErrorInvalidDevice (Triggered internally at ../c10/hip/HIPFunctions.cpp:113.) return torch._C._cuda_getDeviceCount() > 0

do you have any idea? thanks a lot for your support!

ligun commented 1 year ago

I have the same issue on my environment.

Additionally, I had to install libopenmpi-dev, miopen-hip, libopenblas-dev, rocm-libs and rocm-dev to import pytorch.

How did you resolve the issue?

xuhuisheng commented 1 year ago

@ligun maybe pci atomic issue, you can run dmesg|grep kfd tko check whether card is added successful.

ligun commented 1 year ago

@xuhuisheng Thank you for the reply. I checked the log. It seems to be added successfully.

$ dmesg|grep kfd
[    3.098492] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.098647] kfd kfd: amdgpu: added device 1002:67df
xuhuisheng commented 1 year ago

You can execute rocminfo to check if rocm can run properly.

Right now, I meet a problem that if I installed latest ROCm-5.2.3 dkms - driver, then I won't run rocminfo on gfx803. So I uninstall dkms, using upstream kernel builtin amdgpu driver, then I can run ROCm properly.

ligun commented 1 year ago

rocminfo always returned error.

$ rocminfo 
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1140
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
xuhuisheng commented 1 year ago

Looks same to me. My suggestion is uninstall amdgpu-dkms and amdgpu-dkms-firmware and reboot. This will fallback to upstream linux kernel builtin amdgpu-driver. Try rocminfo again, maybe pass.

ligun commented 1 year ago

@xuhuisheng Thank you very much for all your advice. Finally rocminfo ran successfully and torch.cuda.is_available() returned true! I can use ROCm.