Open kk0nrad opened 3 weeks ago
im using 100%
ubuntu@test:~$ nvidia-smi
Fri Nov 8 17:28:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:18:00.0 Off | 0 |
| N/A 46C P0 698W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:2A:00.0 Off | 0 |
| N/A 46C P0 698W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:3A:00.0 Off | 0 |
| N/A 51C P0 702W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 45C P0 698W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9A:00.0 Off | 0 |
| N/A 44C P0 700W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:AB:00.0 Off | 0 |
| N/A 47C P0 699W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:BA:00.0 Off | 0 |
| N/A 49C P0 700W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 46C P0 699W / 700W | 72790MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 7445 C ./gpu_burn 72780MiB |
| 1 N/A N/A 7479 C ./gpu_burn 72780MiB |
| 2 N/A N/A 7482 C ./gpu_burn 72780MiB |
| 3 N/A N/A 7484 C ./gpu_burn 72780MiB |
| 4 N/A N/A 7486 C ./gpu_burn 72780MiB |
| 5 N/A N/A 7488 C ./gpu_burn 72780MiB |
| 6 N/A N/A 7490 C ./gpu_burn 72780MiB |
| 7 N/A N/A 7492 C ./gpu_burn 72780MiB |
+-----------------------------------------------------------------------------------------+
compile option > make COMPUTE=90
./gpu_burn -d 3600
It turned out to be a problem with the license server: nvidia-smi -q|grep -i lic pointed me the right direction (saying the hardware was unregistered): after a reboot, the gpus work for a while. even if they havent been registered. Fixing the issue I had with the license server fixed this forever. BTW: no neet to compile with COMPUTE=90, it just works as it should now.
I'm trying to test a bunch of H100 gpus, but I am unable to reach 100% of utilization.
root@ainode01:~# nvidia-smi Fri Oct 25 11:20:13 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | | N/A 35C P0 134W / 700W | 72790MiB / 81559MiB | 2% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | | N/A 30C P0 109W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | | N/A 30C P0 114W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | | N/A 35C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | | N/A 37C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | | N/A 32C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 34C P0 110W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 30C P0 113W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 272544 C ./gpu_burn 72780MiB | | 1 N/A N/A 272720 C ./gpu_burn 72780MiB | | 2 N/A N/A 272722 C ./gpu_burn 72780MiB | | 3 N/A N/A 272724 C ./gpu_burn 72780MiB | | 4 N/A N/A 272726 C ./gpu_burn 72780MiB | | 5 N/A N/A 272728 C ./gpu_burn 72780MiB | | 6 N/A N/A 272730 C ./gpu_burn 72780MiB | | 7 N/A N/A 272732 C ./gpu_burn 72780MiB | +-----------------------------------------------------------------------------------------+ root@ainode01:~#
some seconds later:
root@ainode01:~# nvidia-smi Fri Oct 25 11:21:22 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | | N/A 39C P0 145W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | | N/A 32C P0 140W / 700W | 72790MiB / 81559MiB | 4% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | | N/A 34C P0 140W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | | N/A 41C P0 138W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | | N/A 41C P0 148W / 700W | 72790MiB / 81559MiB | 11% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | | N/A 35C P0 133W / 700W | 72790MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 37C P0 133W / 700W | 72790MiB / 81559MiB | 2% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 32C P0 135W / 700W | 72790MiB / 81559MiB | 11% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 272544 C ./gpu_burn 72780MiB | | 1 N/A N/A 272720 C ./gpu_burn 72780MiB | | 2 N/A N/A 272722 C ./gpu_burn 72780MiB | | 3 N/A N/A 272724 C ./gpu_burn 72780MiB | | 4 N/A N/A 272726 C ./gpu_burn 72780MiB | | 5 N/A N/A 272728 C ./gpu_burn 72780MiB | | 6 N/A N/A 272730 C ./gpu_burn 72780MiB | | 7 N/A N/A 272732 C ./gpu_burn 72780MiB | +-----------------------------------------------------------------------------------------+ root@ainode01:~#
what am I missing?