wookayin / gpustat

📊 A simple command-line utility for querying and monitoring GPU status
https://pypi.python.org/pypi/gpustat
MIT License
4.06k stars 282 forks source link

NVIDIA 555.85 & 555.99 returns garbage data for any nvml queries #170

Closed Gh0stExp10it closed 3 months ago

Gh0stExp10it commented 5 months ago

Describe the bug

I've simply executed gpustatand get the following error response:

Error on querying NVIDIA devices. Use --debug flag to see more details.
'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

When executed with --debug option:

Traceback (most recent call last):
  File "home/xyz/.local/lib/python3.10/site-packages/gpustat/cli.py", line 58, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
  File "home/xyz/.local/lib/python3.10/site-packages/gpustat/core.py", line 603, in new_query
    gpu_info = get_gpu_info(handle)
  File "home/xyz/.local/lib/python3.10/site-packages/gpustat/core.py", line 456, in get_gpu_info
    name = _decode(N.nvmlDeviceGetName(handle))
  File "home/xyz/.local/lib/python3.10/site-packages/pynvml.py", line 2094, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Screenshots or Program Output

gpustat --debug as above.

nvidia-smi screenshot_nvidia_smi_20240522

Environment information:

Additional context

Thank you in advanced!

wookayin commented 5 months ago

Can you try setting a breakpoint and print the value of res (bytes)?

Gh0stExp10it commented 5 months ago

Here is the output of the value "res":

b'\xf8\x95\xa0\x81\x8e\xf8\x91\x80\x81\x89\xf8\x90\x90\x81\x89\xf8\x91\xb0\x80\xa0\xf8\x91\xa0\x81\xa5\xf8\x9c\xa0\x81\xaf\xf8\x99\x90\x81\xa3\xf8\x94\xa0\x80\xa0\xf8\x96\x80\x81\x94\xf8\x8c\xb0\x80\xa0\xf8\x8e\x80\x80\xb00'
wookayin commented 5 months ago

That's pretty strange, looks like a random junk data. Not sure why nvmlDeviceGetName returns that. I think this is a bug of a NVIDIA Driver. Can you try downgrading NVIDIA driver versions?

Also related https://stackoverflow.com/questions/78533132/ray-serve-error-serve-run-throws-utf-8-cant-decode-byte-0xf8-in-position-0-inv

samrickman commented 5 months ago

@wookayin I am experiencing this issue too now on WSL2 Ubuntu 22.04. Everything was fine with my previous cuda version but I updated my Windows drivers to 12.5 yesterday and now when I run gpustat on Ubuntu I get the same error. The output of gpustat --debug is:

Error on querying NVIDIA devices. Use --debug flag to see more details.
'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gpustat/cli.py", line 58, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
  File "/usr/local/lib/python3.10/dist-packages/gpustat/core.py", line 603, in new_query
    gpu_info = get_gpu_info(handle)
  File "/usr/local/lib/python3.10/dist-packages/gpustat/core.py", line 456, in get_gpu_info
    name = _decode(N.nvmlDeviceGetName(handle))
  File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 2094, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

The output of nvidia-smi (on Ubuntu) is:

Wed May 29 09:41:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 4000 Ada Gene...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   25C    P8              6W /  130W |       0MiB /  20475MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Interestingly, the output of nvcc --version (on Ubuntu) is 12.4:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Here is some more session info:

- Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- gpustat==1.1.1

I think you're right it's to do with this version of the drivers. I don't really want to downgrade at the moment. I wonder if it might be fixed by upgrading nvcc to 12.5 on Ubuntu. I'll see if I have the stomach for it at some point. After I upgraded, torch.cuda.is_available() returned False. I was able to get it working again, but if the only thing that doesn't work is gpustat, it's a bit of a loss as it's a really useful utility, but it's not as bad as spending hours having to rebuild my environment.

Gh0stExp10it commented 5 months ago

That's pretty strange, looks like a random junk data. Not sure why nvmlDeviceGetName returns that. I think this is a bug of a NVIDIA Driver. Can you try downgrading NVIDIA driver versions?

Also related https://stackoverflow.com/questions/78533132/ray-serve-error-serve-run-throws-utf-8-cant-decode-byte-0xf8-in-position-0-inv

Thanks for the additional info! I will try it out in the next few days to see if a downgrade of the driver works.

I am also hoping for a customized version of the driver from NVIDIA.

adamyhe commented 5 months ago

I'm experiencing the exact same issue with gpustat. Same versions of nvidia-smi (555.42.03), driver (555.85), and CUDA (12.5). Also on WSL2 and Ubuntu LTS 22.04.04 and an RTX 3080. Here's my nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Pytorch and TF are both running on GPU just fine for me, as is nvidia-smi. It's just gpustat that's having the utf-8 issue.

Gh0stExp10it commented 5 months ago

I have not yet found the time to test a downgrade of the driver. However, I can note that the next update to version 555.99 did not bring any improvement either. Possibly only with the next update to ~556.xx.

Will change the headline once again.

Gh0stExp10it commented 3 months ago

I will now close this issue, as the NVIDIA driver version 560.70 seems to have fixed the problem. Should someone still find an error, a new issue with reference to this one would be the best option. Please check for this update.