xai-org / grok-1

Grok open release
Apache License 2.0
49.64k stars 8.33k forks source link

Segmentation fault in K8s Pod (8x H100's) #164

Open andy108369 opened 8 months ago

andy108369 commented 8 months ago

Hi, I am trying to run it, but the python3 ./run.py process exits eventually after it's been running for about 10 minutes at 800% cpu usage. I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.

image

Not much of the logs: image

I can quickly restart the process now as I am in the pod:

pkill gotty
cd /grok-1
gotty -w python3 ./run.py

Ideas?

Commands used to deploy it

        apt-get update ; apt-get upgrade -y ;
        apt-get install pip wget git -y;
        pip install dm_haiku==0.0.12;
        pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
        pip install numpy==1.26.4;
        pip install sentencepiece==0.2.0;
        pip install -U "huggingface_hub[cli]";
        git clone https://github.com/xai-org/grok-1;
        wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
        tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
        huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
        mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
        mkdir /root/shm;
        sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
        cd /grok-1 && gotty -w python3 ./run.py;

Update 1

I'm trying to run it directly with python3 ./run.py (without gotty right now)

andy108369 commented 8 months ago

什么也看不懂

你真的只是为了写这个才在Github上注册的吗?

image

robvdl commented 8 months ago

Just an observation and I'm probably stating the obvious here, but you should really deploy from the requirements.txt because it will get updated.

And your link with the traceback is super obvious, it's in the last line:

"No space left on device".

Disk is full.

But reading it again something with /dev/shm

andy108369 commented 8 months ago

but you should really deploy from the requirements.txt because it will get updated.

thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I've been asking to review it first, hence I had to use those pip install's

# pip install -r requirements.txt
Requirement already satisfied: dm_haiku==0.0.12 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (0.0.12)
Requirement already satisfied: jax==0.4.25 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.4.25)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 3)) (1.26.4)
Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 4)) (0.2.0)
Requirement already satisfied: absl-py>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: jmp>=0.0.2 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.0.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.9.0)
Requirement already satisfied: flax>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.8.2)
Requirement already satisfied: ml-dtypes>=0.2.0 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (0.3.2)
Requirement already satisfied: opt-einsum in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (1.12.0)
Requirement already satisfied: msgpack in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: optax in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.2.1)
Requirement already satisfied: orbax-checkpoint in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.5.6)
Requirement already satisfied: tensorstore in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.56)
Requirement already satisfied: rich>=11.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (13.7.1)
Requirement already satisfied: typing-extensions>=4.2 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (4.10.0)
Requirement already satisfied: PyYAML>=5.4.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.0.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.17.2)
Requirement already satisfied: chex>=0.1.7 in /usr/local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.85)
Requirement already satisfied: jaxlib>=0.1.37 in /root/.local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.4.25+cuda12.cudnn89)
Requirement already satisfied: etils[epath,epy] in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.7.0)
Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (5.26.0)
Requirement already satisfied: toolz>=0.9.0 in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.12.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (69.1.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2024.3.0)
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.3.1)
Requirement already satisfied: zipp in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.18.1)

And your link with the traceback is super obvious, it's in the last line: "No space left on device".

The / disk has 1TiB of space and only about 300 GiB are used; /dev/shm is set to 640 GiB (tmpfs).

Also, not sure where did you find the "link with the traceback" you referred to?

robvdl commented 8 months ago

I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it's a different repo.

andy108369 commented 8 months ago

Segfault

I've re-tried it again, and the behavior is same - it Segfaults :/

image

image

I guess it's similar issue to https://github.com/xai-org/grok-1/issues/152 now.

Additional info

root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi -L
GPU 0: NVIDIA H100 PCIe (UUID: GPU-50f0ee14-b7a1-f0af-616a-f3bb0825ee7d)
GPU 1: NVIDIA H100 PCIe (UUID: GPU-17201481-5148-0983-539d-10ff0e2cf07f)
GPU 2: NVIDIA H100 PCIe (UUID: GPU-ce315b98-20ff-34fd-307b-fe05646f5913)
GPU 3: NVIDIA H100 PCIe (UUID: GPU-3c330414-9d82-ef1b-65c1-3dad9f294dd1)
GPU 4: NVIDIA H100 PCIe (UUID: GPU-81c9e219-4831-4d68-ccef-badb7f2bc599)
GPU 5: NVIDIA H100 PCIe (UUID: GPU-102d94be-31e5-5809-da4b-a1eeb5fee45b)
GPU 6: NVIDIA H100 PCIe (UUID: GPU-bc4095a5-f436-2dec-af84-44fe954a7e6c)
GPU 7: NVIDIA H100 PCIe (UUID: GPU-b6b73324-7d54-f3c3-a4d9-27fb98f564e9)
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi
Mon Mar 18 20:28:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             48W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               Off |   00000000:00:06.0 Off |                    0 |
| N/A   31C    P0             46W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 PCIe               Off |   00000000:00:07.0 Off |                    0 |
| N/A   40C    P0             51W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 PCIe               Off |   00000000:00:08.0 Off |                    0 |
| N/A   34C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 PCIe               Off |   00000000:00:09.0 Off |                    0 |
| N/A   31C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 PCIe               Off |   00000000:00:0A.0 Off |                    0 |
| N/A   36C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 PCIe               Off |   00000000:00:0B.0 Off |                    0 |
| N/A   30C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 PCIe               Off |   00000000:00:0C.0 Off |                    0 |
| N/A   30C    P0             49W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@grok-1-596d68d5c7-5cq9f:/app# 

root@grok-1-596d68d5c7-5cq9f:/app# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.5.0-21-generic root=UUID=74dd9370-9caa-470b-a711-0d385161522f ro console=tty1 console=ttyS0

root@grok-1-596d68d5c7-5cq9f:/app# uname -a
Linux grok-1-596d68d5c7-5cq9f 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2 x86_64 GNU/Linux
root@grok-1-596d68d5c7-5cq9f:/app# 
yarodevuci commented 8 months ago

wow, so people that even have the right hardware can't even run this normally? haha that is a FAIL!

andy108369 commented 8 months ago

Zblocker64 (on Discord) suggested increasing stack size (ulimit -s) which is set to 8192 by default. I'll try doubling it up to 16384 to see if that helps with the python's segmentation fault.

FWIW: WIndow's the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)

andy108369 commented 8 months ago

I'll try doubling it up to 16384 to see if that helps with the python's segmentation fault.

Doubling the stack size limit up to 16384 didn't fix the issue. On the contrary, python run.py would stop writing any output and it would lock-up immediately, cannot kill it.


It appears the issues with running grok-1 arise mostly when overlay FS is used in Pod (default FS containers use). The issues are:

However, even when running with the ext4 FS directly mounted in the Pod, or even when running on the Host directly:

1. On the host directly:

image

2. In a Pod (image: ubuntu:22.04) - grok-1 mounted over ext4 FS:

The /root/grok-1 was mounted directly from the host (ext4 FS) instead of the overlay FS (!) ; I'm going to test with overlayfs as I have a hunch it might be the cause of issues) :bulb:

        volumeMounts:
        - mountPath: /root/grok-1
          name: grok-volume
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: grok-volume
        hostPath:
          path: /root/grok-1
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"

image

3. In a Pod (image: ubuntu:22.04) - overlay FS

It appears that this issue appears mostly when overlay FS is used.

        volumeMounts:
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"

image

Next time I ran gdp python and (gdb) run run.py process would lockup again at 100% cpu usage. I could Ctrl+C the gdb and the process would be gone.

However, running gdb /root/grok-1/venv/bin/python + (gdb) run run.py -- process would lockup again at 100% cpu usage and this time I cannot Ctrl+C it nor kill -9 <PID>; nvidia-smi -L would print 8 GPUs available on the host, and would just hang instead of exiting as normal. Only host reboot releases the nvidia.

Versions

Nvidia driver: 550.54.15 Linux: Ubuntu 22.04.4 LTT with 6.5.0-26-generic kernel.

We are using nvidia runtime -- https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.14.5

Update 1: I've tried k8s-device-plugin of 0.15.0-rc.2 version - same issues except that it doesn't seem to be locking the process up. It can be killed, nvidia-smi works well, i.e. isn't locking up anymore. Maybe just luck. Will keep monitoring this.


K8s manifest for 3rd case

Pod with overlay FS

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-pod
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      runtimeClassName: nvidia
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - "node1"
      containers:
      - name: app
        image: ubuntu:22.04
        command: ["sleep", "infinity"]
        resources:
          requests:
            cpu: "58"
            ephemeral-storage: "1099511627776"
            memory: "1374389534720"
            nvidia.com/gpu: "8"
          limits:
            cpu: "58"
            ephemeral-storage: "1099511627776"
            memory: "1374389534720"
            nvidia.com/gpu: "8"
        volumeMounts:
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"
andy108369 commented 8 months ago

3. In a Pod (image: ubuntu:22.04) - overlay FS

recorded with asciinema segfault for this case:

asciicast

andy108369 commented 8 months ago

2. In a Pod (image: ubuntu:22.04) - grok-1 mounted over ext4 FS:

This time the python process got hung, even with ext4 FS:

at 14:08 asciicast

and backtrace https://gist.githubusercontent.com/andy108369/b42f07265928ac11a161165f82ce026d/raw/878f644277180efcc1a681217c7dc58230b67c67/backtrace.md

AdaptiveStep commented 8 months ago

This model should be called SIGINT, because thats what will happen when you try to run it.

andy108369 commented 8 months ago

This model should be called SIGINT, because thats what will happen when you try to run it

I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?

The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container) Update: figured that's what max_len is for... increasing that, increases the output.

The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.

robvdl commented 8 months ago

This model should be called SIGINT, because thats what will happen when you try to run it

I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?

The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)

The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.

The user is pulling our leg when they are saying "model should be called SIGINT" they are just making fun of it crashing for them, but not adding anything of value to the ticket.

andy108369 commented 8 months ago

For whoever needs this:

PyTorch version is working in K8s pod (over container's overlay FS; and without /dev/shm requirement) well, no issues!

How-to deploy PyTorch version is here https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953

image

andy108369 commented 8 months ago

It looks like the culprit for the process (python using nvidia GPU and nvidia-smi CLI) lockup was the nvidia driver.

If you have H100 GPU and running your provider with the nvidia driver of version 550.X - make sure you have upgraded it to at least 550.54.15 version which fixes the nvidia driver lockup problem. (where process using nvidia driver would permanently lock-up and nvidia-smi command would permanently hang until server reboot). still crashes

Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349

Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html

Todo

andy108369 commented 8 months ago

In K8s pod (overlay FS) and with the newest:

xai-org/grok-1 still:

At least newest nvidia drivers 550.54.15 don't crash/lock-up the processes. still crashes

Recording: asciicast

I suggest using PyTorch-based grok-1 version as described here, until then: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953

andy108369 commented 8 months ago

At least newest nvidia drivers 550.54.15 don't crash/lock-up the processes.

Unfortunately, that's still not the case with xai-org's grok-1 :/ It still crashes the nvidia drivers and only node reboot fixes this.

image

stack trace:

different node, but the same problem; pid of the python process (xai-org's grok-1)

root@obl-node2:~# cat /proc/1483740/stack
[<0>] uvm_spin_loop+0xf0/0x180 [nvidia_uvm]
[<0>] wait_for_entry_with_spin+0x4d/0x1c0 [nvidia_uvm]
[<0>] uvm_tracker_wait_for_entry+0x94/0xd0 [nvidia_uvm]
[<0>] uvm_push_end_and_wait+0x3e/0x60 [nvidia_uvm]
[<0>] channel_pool_add.constprop.0+0xa29/0x11c0 [nvidia_uvm]
[<0>] uvm_channel_manager_create+0x3c1/0xb50 [nvidia_uvm]
[<0>] uvm_gpu_retain_by_uuid+0xf45/0x2b30 [nvidia_uvm]
[<0>] uvm_va_space_register_gpu+0x4a/0x7f0 [nvidia_uvm]
[<0>] uvm_api_register_gpu+0x77/0xc0 [nvidia_uvm]
[<0>] uvm_ioctl+0xdfb/0x1cd0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[<0>] __x64_sys_ioctl+0xa3/0xf0
[<0>] do_syscall_64+0x5b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

and here is the stack trace for pid 8013 of the nvidia-device-plugin process which was kill -9'ed but doesn't disappear:

root@obl-node2:~# cat /proc/8013/stack
[<0>] uvm_va_space_destroy+0x482/0x710 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0xa5/0x140 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2e/0x40 [nvidia_uvm]
[<0>] __fput+0xfc/0x2c0
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x61/0xa0
[<0>] do_exit+0x2ac/0x6f0
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x8dc/0x940
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] exit_to_user_mode_loop+0x9a/0x130
[<0>] exit_to_user_mode_prepare+0xa5/0xb0
[<0>] syscall_exit_to_user_mode+0x29/0x60
[<0>] do_syscall_64+0x67/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
andy108369 commented 4 months ago

It still crashes the nvidia drivers and only node reboot fixes this.

still the same with NVIDIA H100 PCIe and nvidia driver version 535.183.01 (ubuntu's nvidia-driver-535-server package)

except now there is a stack trace:

(venv) root@grok-1-585644c85-2b7tn:~/grok-1# python run.py
INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
INFO:rank:Initializing mesh for self.local_mesh_config=(1, 8) self.between_hosts_config=(1, 1)...
INFO:rank:Detected 8 devices in mesh
2024-07-28 08:50:02.861309: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.5.82). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
INFO:rank:partition rules: <bound method LanguageModelConfig.partition_rules of LanguageModelConfig(model=TransformerConfig(emb_size=6144, key_size=128, num_q_heads=48, num_kv_heads=8, num_layers=64, vocab_size=131072, widening_factor=8, attn_output_multiplier=0.08838834764831845, name=None, num_experts=8, capacity_factor=1.0, num_selected_experts=2, init_scale=1.0, shard_activations=True, data_axis='data', model_axis='model'), vocab_size=131072, pad_token=0, eos_token=2, sequence_len=8192, model_size=6144, embedding_init_scale=1.0, embedding_multiplier_scale=78.38367176906169, output_multiplier_scale=0.5773502691896257, name=None, fprop_dtype=<class 'jax.numpy.bfloat16'>, model_type=None, init_scale_override=None, shard_embeddings=True)>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:State sharding type: <class 'model.TrainingState'>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0
F0728 08:50:36.312760    8365 pjrt_stream_executor_client.cc:452] Check failed: copy_stream->WaitFor(local_device->compute_stream()).ok() 
*** Check failure stack trace: ***
    @     0x70823bfc76f4  absl::lts_20230802::log_internal::LogMessage::SendToLog()
    @     0x70823bfc75f4  absl::lts_20230802::log_internal::LogMessage::Flush()
    @     0x70823bfc7a99  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7082378ddb6c  xla::AllocateDestinationBuffer()
    @     0x7082378e1b4b  xla::PjRtStreamExecutorClient::BufferFromHostBuffer()
    @     0x7082378e35bd  xla::PjRtStreamExecutorClient::BufferFromHostBuffer()
    @     0x70823784d5d3  pjrt::PJRT_Client_BufferFromHostBuffer()
    @     0x70824ac919cb  xla::PjRtCApiClient::BufferFromHostBufferInternalImpl()
    @     0x70824ac92603  xla::PjRtCApiClient::BufferFromHostBuffer()
    @     0x70824fc4caaf  xla::ifrt::PjRtClient::MakeArrayFromHostBuffer()
    @     0x70824f4f1a0b  absl::lts_20230802::internal_any_invocable::RemoteInvoker<>()
    @     0x70824f4b4a74  xla::PyArray::BatchedDevicePut()
    @     0x70824ab6cbe6  nanobind::detail::func_create<>()::{lambda()#1}::__invoke()
    @     0x708251010a8c  nanobind::detail::nb_func_vectorcall_complex()
    @     0x5ff998d3059a  _PyEval_EvalFrameDefault
Aborted (core dumped)
(venv) root@grok-1-585644c85-2b7tn:~/grok-1# echo $?
134

reproducer

apt update && apt install -y python3-pip virtualenv git
cd /root
git clone https://github.com/xai-org/grok-1.git
cd grok-1
virtualenv --python=python3 venv
source venv/bin/activate
pip install -r requirements.txt
pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install huggingface_hub[hf_transfer]
export HUGGINGFACE_TOKEN=hf_REDACTED
git config --global credential.helper store
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir checkpoints --local-dir-use-symlinks False > hf-download.log 2>&1
python run.py
andy108369 commented 4 months ago

I've tried HBM3 H100's (nvidia 555.42.06) - xai-org's grok-1 is working flawlessly there! I guess this must be something to do with the PCIe H100 ... image

andy108369 commented 4 months ago

Cem from Oblivus suggested to address the warning to see whether it helps on the PCIe H100 system and, lo and behold, it did! :tada: (*almost - see below)

The warning:

# python run.py
...
2024-07-28 12:44:02.558447: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.5.82). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
...

Addressed this way:

export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}


- Use it
> loving how `nvidia-smi` displays different `CUDA version` based on the `cuda-compat-12-5` passed via `LD_LIBRARY_PATH`

(venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# unset LD_LIBRARY_PATH (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# nvidia-smi | grep CUDA | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# echo $LD_LIBRARY_PATH /usr/local/cuda-12.5/compat (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# nvidia-smi | grep CUDA | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.5 | (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# python run.py


![image](https://github.com/user-attachments/assets/e96cdc00-f3cf-4e5f-9ac2-3c4e855fb3a7)

Interestingly, it starts crashing after re-running for more than two times, but at least it does not lockup the nvidia driver:
![image](https://github.com/user-attachments/assets/44c18492-3f9a-4687-a16e-fd357278ce6f)

## asciinema recording
[![asciicast](https://asciinema.org/a/669929.svg)](https://asciinema.org/a/669929)

## Update 1: unhealthy GPU reports

Have then noticed node was reporting less GPU count available, despite `nvidia-smi` not reporting any process to be using GPU:

> :mag: Notice `Allocatable` GPU count is now `6` instead of `8`

$ kubectl describe node node7 ... ... Capacity: cpu: 252 ephemeral-storage: 6707082984Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486158688Ki nvidia.com/gpu: 8 pods: 110 Allocatable: cpu: 252 ephemeral-storage: 6181247667821 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486056288Ki nvidia.com/gpu: 6 pods: 110 ...


and [`nvdp-nvidia-device-plugin` ](https://github.com/NVIDIA/k8s-device-plugin) reported `XidCriticalError` errors marking two GPUs unhealthy:

$ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-pqzqg I0727 21:12:54.732685 1 main.go:178] Starting FS watcher. I0727 21:12:54.741394 1 main.go:185] Starting OS watcher. I0727 21:12:54.742075 1 main.go:200] Starting Plugins. I0727 21:12:54.742155 1 main.go:257] Loading configuration. I0727 21:12:54.744224 1 main.go:265] Updating config with default resource matching patterns. I0727 21:12:54.744559 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "volume-mounts" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0727 21:12:54.744574 1 main.go:279] Retrieving plugins. I0727 21:12:54.745619 1 factory.go:104] Detected NVML platform: found NVML library I0727 21:12:54.745689 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0727 21:13:46.616044 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu' I0727 21:13:46.617485 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0727 21:13:46.622755 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet I0728 16:26:54.645961 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646064 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.646244 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.646331 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646554 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.646581 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.646930 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646963 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647000 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647221 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647241 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647288 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647501 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647529 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647626 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647811 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647843 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648025 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648117 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648131 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648162 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648380 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648406 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648478 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648682 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648712 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648780 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.535732 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.535814 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.535903 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536211 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536229 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536239 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536467 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536481 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536541 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536670 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536686 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536704 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536889 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536904 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536976 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537106 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537120 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537180 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537323 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537336 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537367 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537543 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537556 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537578 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537760 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537771 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537813 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.296998 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297067 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297165 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297442 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297471 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297536 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297679 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297690 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297731 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297893 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297904 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297995 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298109 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298119 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298137 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298326 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298337 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298396 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298510 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298519 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298563 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298739 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298749 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298793 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298924 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298934 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298956 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 20:21:32.064665 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.064802 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.064871 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065087 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065104 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065179 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065339 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065354 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065396 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065591 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065605 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065648 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065840 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065858 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065890 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066088 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066105 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066164 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066336 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066351 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066369 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066579 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066595 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066632 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2


![image](https://github.com/user-attachments/assets/65e906c9-b750-44a1-af91-7b92e1489176)

Restarting `nvdp-nvidia-device-plugin` on that node did not help.
The `nvdp-nvidia-device-plugin` did not report unhealthy (nor healthy) GPU devices anymore.

Yet, the GPU count has decreased from `8` to `6` under the `Capacity` after `nvdp-nvidia-device-plugin` pod restart on that node:

$ kubectl describe node node7 ... Capacity: cpu: 252 ephemeral-storage: 6707082984Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486158688Ki nvidia.com/gpu: 6 pods: 110 Allocatable: cpu: 252 ephemeral-storage: 6181247667821 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486056288Ki nvidia.com/gpu: 6 pods: 110


## Update 2: reset GPU

- 1st attempt to reset GPU shown only `7` GPUs were reset

root@node7:~# nvidia-smi --gpu-reset GPU 00000000:00:05.0 was successfully reset. GPU 00000000:00:06.0 was successfully reset. Error encountered during reset of GPU 00000000:00:07.0: Driver Not Loaded GPU 00000000:00:08.0 was successfully reset. GPU 00000000:00:09.0 was successfully reset. GPU 00000000:00:0A.0 was successfully reset. GPU 00000000:00:0B.0 was successfully reset. GPU 00000000:00:0C.0 was successfully reset.

1 device did not complete reset successfully, and may be in an unstable state. Please reboot your system.


 - 2nd attempt to reset GPU shows only `6` GPUs were reset (instead of the expected `8`)

root@node7:~# nvidia-smi --gpu-reset GPU 00000000:00:05.0 was successfully reset. GPU 00000000:00:06.0 was successfully reset. GPU 00000000:00:09.0 was successfully reset. GPU 00000000:00:0A.0 was successfully reset. GPU 00000000:00:0B.0 was successfully reset. GPU 00000000:00:0C.0 was successfully reset. All done. root@node7:~# echo $? 0


- dmesg logs `An uncorrectable ECC error detected (possible firmware handling failure)`

[Mon Jul 29 07:54:16 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:17 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:18 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:19 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:19 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:19 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:54:20 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:21 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:22 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:23 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:23 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:23 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:54:24 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:25 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:26 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:27 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:27 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:27 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:54:28 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:29 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:30 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:31 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:31 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:31 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:54:52 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:53 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:54 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:55 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:55 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:55 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:54:56 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:57 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:58 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:59 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:54:59 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:54:59 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:55:01 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:02 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:03 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:04 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:04 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:04 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:55:05 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:06 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:07 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:08 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:08 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:08 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:55:29 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:30 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:31 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:32 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:32 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:32 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:55:33 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:34 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:35 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:36 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:36 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:36 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:55:37 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:38 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:39 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:40 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:40 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:40 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:55:41 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:42 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:42 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:43 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:55:43 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:55:43 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:56:39 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:40 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:41 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:42 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:42 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:56:42 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:56:43 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:44 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:45 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:46 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:46 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:56:46 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:56:47 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:48 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:49 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:51 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:51 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:56:51 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:56:52 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:53 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:54 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:55 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:56:55 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:56:55 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:57:15 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:16 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:17 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:18 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:18 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:57:18 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:57:19 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:20 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:22 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:23 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:23 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:57:23 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 07:57:24 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:25 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:26 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:27 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:27 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:57:27 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 07:57:28 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:29 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:30 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:31 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 07:57:31 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 07:57:31 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 08:04:28 2024] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND [Mon Jul 29 08:04:39 2024] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND [Mon Jul 29 08:05:22 2024] workqueue: free_ioctx hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND [Mon Jul 29 08:05:29 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0x55:2404) [Mon Jul 29 08:05:29 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 08:05:30 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0x55:2404) [Mon Jul 29 08:05:30 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 08:05:31 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0x55:2404) [Mon Jul 29 08:05:31 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 08:05:55 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0x55:2404) [Mon Jul 29 08:05:55 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 08:05:56 2024] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0x55:2404) [Mon Jul 29 08:05:56 2024] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2 [Mon Jul 29 08:05:57 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 08:05:58 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 08:05:59 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 08:06:00 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0 [Mon Jul 29 08:06:00 2024] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404) [Mon Jul 29 08:06:00 2024] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3 [Mon Jul 29 08:06:01 2024] NVRM: Xid (PCI:0000:00:08): 140, pid='', name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0


- `nvdp-nvidia-device-plugin` fails to start now

arno@x1:~$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide | grep nvidia-device-plugin | grep -w node7 nvidia-device-plugin nvdp-nvidia-device-plugin-r2rdb 0/1 RunContainerError 0 2m1s 10.233.100.140 node7 arno@x1:~$ kubectl -n nvidia-device-plugin describe pod nvdp-nvidia-device-plugin-r2rdb | tail -2 Warning Failed 37s kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: driver rpc error: timed out: unknown


- `nvidia-smi` takes too long to report the GPU stats

root@node7:~# time nvidia-smi -L GPU 0: NVIDIA H100 PCIe (UUID: GPU-e4cc67c5-b8b0-8362-38e4-b72decfcf87e) GPU 1: NVIDIA H100 PCIe (UUID: GPU-cfe62ad2-f29c-a156-da1b-2d02847c0dff) GPU 2: NVIDIA H100 PCIe (UUID: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354) GPU 3: NVIDIA H100 PCIe (UUID: GPU-3eec7a8d-1d8a-70ea-787a-c21833950688) GPU 4: NVIDIA H100 PCIe (UUID: GPU-5d2026f6-4a5c-8c7e-ece0-d565d06b86c1) GPU 5: NVIDIA H100 PCIe (UUID: GPU-46c3f4f2-4b78-c0b6-a45a-09a1694db707) GPU 6: NVIDIA H100 PCIe (UUID: GPU-f108e1d9-9a22-cd97-c8bf-cef20a1d11fd)

real 0m22.531s user 0m0.000s sys 0m20.280s



Which would likely explain why `nvidia-device-plugin` pod cannot start (due to `nvidia-container-cli: initialization error: driver rpc error: timed out: unknown`)