xuhuisheng / rocm-build

build scripts for ROCm
Apache License 2.0
181 stars 35 forks source link

gfx803 still NaN loss even with rocblas patch #17

Closed riaqn closed 2 years ago

riaqn commented 2 years ago

Environment

Hardware description
GPU gfx803 (rx580)
CPU ryzen 3700x
Software version
OS 5.14.2
ROCm 4.3.1
Python 3.9.7
TensorFlow 2.6.0

What is the expected behavior

tensorflow should work as expected

What actually happens

NaN loss when use tensorflow

How to reproduce

  1. Install rocm, with rocblas patch (remove library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml)
  2. run any model
xuhuisheng commented 2 years ago

If you used ubuntu, we can use apt to see the difference versions informations for rocblas. You can see the 8328dcce~dirty part. Because this package is compiled from local and do a patch which isnot committed.

work@ae8ab6747fa6:/opt/rocm/rocblas$ apt search rocblas
Sorting... Done
Full Text Search... Done
rocblas/Ubuntu 2.39.0.40301-59 amd64 [upgradable from: 2.39.0-8328dcce~dirty]
  rocBLAS is AMD's library for BLAS on ROCm. It is implemented in HIP and optimized for AMD GPUs.

rocblas4.3.1/Ubuntu 2.39.0.40301-59 amd64
  rocBLAS is AMD's library for BLAS on ROCm. It is implemented in HIP and optimized for AMD GPUs.

I am not familiar with arch, maybe you can find a way to compare the version from rocblas packages.

BTW, I uploaded rocblas-2.39 and pytorch-1.9 for gfx803. You can have a try. https://github.com/xuhuisheng/rocm-gfx803/releases/tag/rocm43

xuhuisheng commented 2 years ago

There is some differences between offical rocblas and gfx803 patched rocblas.

Go to directory /opt/rocm-4.3.1/rocblas/lib/library

The offcial rocblas has more files for multiple GPU.

-rw-r--r-- 1 root root  22036088 Aug 21 17:51 Kernels.so-000-gfx1010.hsaco
-rw-r--r-- 1 root root  21278328 Aug 21 17:51 Kernels.so-000-gfx1011.hsaco
-rw-r--r-- 1 root root  21278328 Aug 21 17:51 Kernels.so-000-gfx1012.hsaco
-rw-r--r-- 1 root root  20883320 Aug 21 17:51 Kernels.so-000-gfx1030.hsaco
-rw-r--r-- 1 root root  21766128 Aug 21 17:51 Kernels.so-000-gfx803.hsaco
-rw-r--r-- 1 root root  22330912 Aug 21 17:51 Kernels.so-000-gfx900.hsaco
-rw-r--r-- 1 root root  20614864 Aug 21 17:51 Kernels.so-000-gfx906-xnack-.hsaco
-rw-r--r-- 1 root root  20592048 Aug 21 17:51 Kernels.so-000-gfx908-xnack-.hsaco
-rw-r--r-- 1 root root  20716072 Aug 21 17:51 Kernels.so-000-gfx90a-xnack+.hsaco
-rw-r--r-- 1 root root  20703784 Aug 21 17:51 Kernels.so-000-gfx90a-xnack-.hsaco
-rw-r--r-- 1 root root 230109962 Aug 21 17:51 TensileLibrary.dat
-rw-r--r-- 1 root root 112401360 Aug 21 17:51 TensileLibrary_gfx1030.co
-rw-r--r-- 1 root root   3875552 Aug 21 17:51 TensileLibrary_gfx803.co
-rw-r--r-- 1 root root  49228184 Aug 21 17:51 TensileLibrary_gfx900.co
-rw-r--r-- 1 root root 102949336 Aug 21 17:51 TensileLibrary_gfx906.co
-rw-r--r-- 1 root root 304173904 Aug 21 17:51 TensileLibrary_gfx908.co
-rw-r--r-- 1 root root 233813920 Aug 21 17:51 TensileLibrary_gfx90a.co
-rw-r--r-- 1 root root      1349 Aug 21 16:18 TensileManifest.txt

The gfx803 patched rocblas has only gfx803 related fatbin files.

-rw-r--r-- 1 root root 7722904 May 26 13:49 Kernels.so-000-gfx803.hsaco
-rw-r--r-- 1 root root 3942507 May 26 13:49 TensileLibrary.yaml
-rw-r--r-- 1 root root     152 May 26 13:44 TensileManifest.txt
riaqn commented 2 years ago

Nevermind! Turns out it's my code that's wrong. The official quick start program (https://www.tensorflow.org/tutorials/quickstart/beginner) can be run successfully! Thank you very much!

riaqn commented 2 years ago

As a side note: is deleting library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml necessary?

xuhuisheng commented 2 years ago

you can try gfx803 without patch. With ROCm-4.3.1 on text classification, I got a memory access error.

xuhuisheng commented 2 years ago

2 months after last posts, I will close this issue, please reopen if there is any updates.