wilicc / gpu-burn

Multi-GPU CUDA stress test
BSD 2-Clause "Simplified" License
1.44k stars 299 forks source link

H100 gpus do not reach 100% #114

Open kk0nrad opened 3 weeks ago

kk0nrad commented 3 weeks ago

I'm trying to test a bunch of H100 gpus, but I am unable to reach 100% of utilization.

root@ainode01:~# nvidia-smi Fri Oct 25 11:20:13 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | | N/A 35C P0 134W / 700W | 72790MiB / 81559MiB | 2% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | | N/A 30C P0 109W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | | N/A 30C P0 114W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | | N/A 35C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | | N/A 37C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | | N/A 32C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 34C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 30C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 272544 C ./gpu_burn 72780MiB | | 1 N/A N/A 272720 C ./gpu_burn 72780MiB | | 2 N/A N/A 272722 C ./gpu_burn 72780MiB | | 3 N/A N/A 272724 C ./gpu_burn 72780MiB | | 4 N/A N/A 272726 C ./gpu_burn 72780MiB | | 5 N/A N/A 272728 C ./gpu_burn 72780MiB | | 6 N/A N/A 272730 C ./gpu_burn 72780MiB | | 7 N/A N/A 272732 C ./gpu_burn 72780MiB | +-----------------------------------------------------------------------------------------+ root@ainode01:~#

some seconds later:

root@ainode01:~# nvidia-smi Fri Oct 25 11:21:22 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | | N/A 39C P0 145W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | | N/A 32C P0 140W / 700W | 72790MiB / 81559MiB | 4% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | | N/A 34C P0 140W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | | N/A 41C P0 138W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | | N/A 41C P0 148W / 700W | 72790MiB / 81559MiB | 11% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | | N/A 35C P0 133W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 37C P0 133W / 700W | 72790MiB / 81559MiB | 2% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 32C P0 135W / 700W | 72790MiB / 81559MiB | 11% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 272544 C ./gpu_burn 72780MiB | | 1 N/A N/A 272720 C ./gpu_burn 72780MiB | | 2 N/A N/A 272722 C ./gpu_burn 72780MiB | | 3 N/A N/A 272724 C ./gpu_burn 72780MiB | | 4 N/A N/A 272726 C ./gpu_burn 72780MiB | | 5 N/A N/A 272728 C ./gpu_burn 72780MiB | | 6 N/A N/A 272730 C ./gpu_burn 72780MiB | | 7 N/A N/A 272732 C ./gpu_burn 72780MiB | +-----------------------------------------------------------------------------------------+ root@ainode01:~#

what am I missing?

yamakenjp commented 1 week ago

im using 100%

ubuntu@test:~$ nvidia-smi
Fri Nov  8 17:28:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
| N/A   46C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                    0 |
| N/A   46C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                    0 |
| N/A   51C    P0            702W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   45C    P0            698W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9A:00.0 Off |                    0 |
| N/A   44C    P0            700W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                    0 |
| N/A   47C    P0            699W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:BA:00.0 Off |                    0 |
| N/A   49C    P0            700W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   46C    P0            699W /  700W |   72790MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7445      C   ./gpu_burn                                  72780MiB |
|    1   N/A  N/A      7479      C   ./gpu_burn                                  72780MiB |
|    2   N/A  N/A      7482      C   ./gpu_burn                                  72780MiB |
|    3   N/A  N/A      7484      C   ./gpu_burn                                  72780MiB |
|    4   N/A  N/A      7486      C   ./gpu_burn                                  72780MiB |
|    5   N/A  N/A      7488      C   ./gpu_burn                                  72780MiB |
|    6   N/A  N/A      7490      C   ./gpu_burn                                  72780MiB |
|    7   N/A  N/A      7492      C   ./gpu_burn                                  72780MiB |
+-----------------------------------------------------------------------------------------+

compile option > make COMPUTE=90

./gpu_burn -d 3600

kk0nrad commented 1 week ago

It turned out to be a problem with the license server: nvidia-smi -q|grep -i lic pointed me the right direction (saying the hardware was unregistered): after a reboot, the gpus work for a while. even if they havent been registered. Fixing the issue I had with the license server fixed this forever. BTW: no neet to compile with COMPUTE=90, it just works as it should now.