xuhuisheng / rocm-gfx803

185 stars 9 forks source link

HSA_STATUS_ERROR_OUT_OF_RESOURCES In rocminfo and no devices in clinfo #8

Open tejasraman opened 2 years ago

tejasraman commented 2 years ago

I get the HSA_STATUS_ERROR_OUT_OF_RESOURCES error when I run rocminfo (ROCm 4.5.2) on my computer. A previous install on another drive worked (ROCm 4.5.0) on the same kernel (5.11). When I run clinfo, 0 devices show up under “AMD Accelerated Parallel Processing”.

I have libopenblas and libopenmpi installed already, my PCIe slot supports atomics(no kfd errors). I have the patched ROCBlas and your torch and torchvision. Torch says that there are no CUDA devices(torch treaters HIP as a CUDA device)

OS: Ubuntu 20.04 Kernel: 5.11.0-44-generic ROCm version: 4.5.2(I originally said 5.2 which does not exist, sorry)

xuhuisheng commented 2 years ago

Do you mean ROCm-5.0.2? ROCm-5.2, even ROCm-5.1 didn't release yet.

tejasraman commented 2 years ago

I tripped up, meant 4.5.2 On Mar 19, 2022, 2:38 PM -0600, Xu Huisheng @.***>, wrote:

Do you mean ROCm-5.0.2? ROCm-5.2, even ROCm-5.1 didn't release yet. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>

tejasraman commented 2 years ago

I did try 5.0.2 and apparently torch has no support for it yet, got some amdhip error. Installed 4.5.2 and still having issues with clinfo and rocminfo(the HSA error)

xuhuisheng commented 2 years ago

you can try install rocm-4.5.0's kernel and 5.0.2's rocm-dev and rocm-libs.

tejasraman commented 2 years ago

So do you mean 4.5.0 ROCm-core? Does this mean having 2 repos installed? Both require hwe 5.11 On Mar 20, 2022, 3:25 AM -0600, Xu Huisheng @.***>, wrote:

you can try install rocm-4.5.0's kernel and 5.0.2's rocm-dev and rocm-libs. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>

tejasraman commented 1 year ago

Finally got it working with the latest releas (5.2) sorry. Reminds me of my messed up title.....

xuhuisheng commented 1 year ago

@tejasraman It's weired that I just met HSA_STATUS_ERROR_OUT_OF_RESOURCES on my gfx803 and ubuntu-20.04.4 and hashwell with ROCm-5.2. I have to remove dkms module for amdgpu-dkms, and with upstream kernel amdgpu module, gfx803 work fine.

tejasraman commented 1 year ago

That is weird..... it worked for me, maybe I made a mistake

tejasraman commented 1 year ago

@xuhuisheng I’m still having this issue: 23EF12C6-27AB-42DE-B254-7E65651438A3 Clinfo:

(I’m having the issue again so marked my old post as outdated)