wilicc / gpu-burn

Multi-GPU CUDA stress test
BSD 2-Clause "Simplified" License
1.42k stars 300 forks source link

AMD GPU support implemented with HIP - Seeking to merge for dual NVIDIA/AMD compatibility #111

Open kerwenwwer opened 2 months ago

kerwenwwer commented 2 months ago

Hello,

I've created an AMD-compatible fork of gpu-burn using HIP (Heterogeneous-Compute Interface for Portability). This version supports both NVIDIA and AMD GPUs, expanding the tool's utility across different hardware platforms.

Key features of the AMD-compatible version:

You can find the AMD-compatible version here: https://github.com/kerwenwwer/amd-gpu-burn

I'm interested in discussing the possibility of merging these changes back into the main repository to provide official support for both NVIDIA and AMD GPUs. I think it's best for the community 😁

wilicc commented 2 months ago

Hi,

This would be very welcome indeed! The reason to use cublas is 2-fold:

  1. At the time I originally wrote this, there weren't any optimized/tuned blas routines that worked on both AMD and NVidia. As you showed, this has now changed.
  2. It is not easy to stress GPU "to the max". Cublas is known to be very efficient (and stressful) on NVidia cards. Not many compute loads manage to stress the HW as efficiently.

Now my question is: Do you happen to know whether the hipblas is as efficient on NVidia as cublas is? One way would be to simply benchmark. If it is not, we might have to maintain 2 codepaths: hipblas for AMD and cublas for NVidia. It would of course be much cleaner if the same implementation was optimal for both vendors.

kerwenwwer commented 2 months ago

Thank you for your reply. I believe that hipblas is quite similar to CUDA in terms of stress pressure on GPU cards. What I've done so far is simply port your benchmark flow from CUDA API to HIP API. The main changes I made were:

  1. Modifying the temperature display method (from nvidia-smi to rocm-smi)
  2. Updating deprecated APIs. For example, we no longer need to set cuParamSetSize before launching a kernel function; instead, we use cuLaunchKernel.

The results of my current tests using hipblas with the same algorithm show that it can easily fully utilize the entire GPU on AMD MI210. So I think that we use hipblas and cublas simultaneously is not the problem (the way to write the code is basically the same, the only difference is the API name).

However, there are some considerations for merging the code base:

  1. The current code base integrates control and compute functions in the same .c file.
  2. The compiler used on the AMD ROCm platform is based on Clang.

Given these factors, if you want to merge the code bases, you may need to refactor gpu-burn. A better solution might involve separating the control and compute functions and using a build system that can accommodate both CUDA and HIP compilations.