Open andy108369 opened 8 months ago
什么也看不懂
你真的只是为了写这个才在Github上注册的吗?
Just an observation and I'm probably stating the obvious here, but you should really deploy from the requirements.txt because it will get updated.
And your link with the traceback is super obvious, it's in the last line:
"No space left on device".
Disk is full.
But reading it again something with /dev/shm
but you should really deploy from the requirements.txt because it will get updated.
thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I've been asking to review it first, hence I had to use those pip install'
s
# pip install -r requirements.txt
Requirement already satisfied: dm_haiku==0.0.12 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (0.0.12)
Requirement already satisfied: jax==0.4.25 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.4.25)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 3)) (1.26.4)
Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 4)) (0.2.0)
Requirement already satisfied: absl-py>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: jmp>=0.0.2 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.0.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.9.0)
Requirement already satisfied: flax>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.8.2)
Requirement already satisfied: ml-dtypes>=0.2.0 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (0.3.2)
Requirement already satisfied: opt-einsum in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (1.12.0)
Requirement already satisfied: msgpack in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: optax in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.2.1)
Requirement already satisfied: orbax-checkpoint in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.5.6)
Requirement already satisfied: tensorstore in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.56)
Requirement already satisfied: rich>=11.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (13.7.1)
Requirement already satisfied: typing-extensions>=4.2 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (4.10.0)
Requirement already satisfied: PyYAML>=5.4.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.0.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.17.2)
Requirement already satisfied: chex>=0.1.7 in /usr/local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.85)
Requirement already satisfied: jaxlib>=0.1.37 in /root/.local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.4.25+cuda12.cudnn89)
Requirement already satisfied: etils[epath,epy] in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.7.0)
Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (5.26.0)
Requirement already satisfied: toolz>=0.9.0 in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.12.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (69.1.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2024.3.0)
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.3.1)
Requirement already satisfied: zipp in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.18.1)
And your link with the traceback is super obvious, it's in the last line: "No space left on device".
The /
disk has 1TiB of space and only about 300 GiB are used;
/dev/shm
is set to 640 GiB (tmpfs).
Also, not sure where did you find the "link with the traceback" you referred to?
I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it's a different repo.
I've re-tried it again, and the behavior is same - it Segfaults :/
I guess it's similar issue to https://github.com/xai-org/grok-1/issues/152 now.
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi -L
GPU 0: NVIDIA H100 PCIe (UUID: GPU-50f0ee14-b7a1-f0af-616a-f3bb0825ee7d)
GPU 1: NVIDIA H100 PCIe (UUID: GPU-17201481-5148-0983-539d-10ff0e2cf07f)
GPU 2: NVIDIA H100 PCIe (UUID: GPU-ce315b98-20ff-34fd-307b-fe05646f5913)
GPU 3: NVIDIA H100 PCIe (UUID: GPU-3c330414-9d82-ef1b-65c1-3dad9f294dd1)
GPU 4: NVIDIA H100 PCIe (UUID: GPU-81c9e219-4831-4d68-ccef-badb7f2bc599)
GPU 5: NVIDIA H100 PCIe (UUID: GPU-102d94be-31e5-5809-da4b-a1eeb5fee45b)
GPU 6: NVIDIA H100 PCIe (UUID: GPU-bc4095a5-f436-2dec-af84-44fe954a7e6c)
GPU 7: NVIDIA H100 PCIe (UUID: GPU-b6b73324-7d54-f3c3-a4d9-27fb98f564e9)
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi
Mon Mar 18 20:28:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 33C P0 48W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:00:06.0 Off | 0 |
| N/A 31C P0 46W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 PCIe Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 51W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 PCIe Off | 00000000:00:08.0 Off | 0 |
| N/A 34C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 PCIe Off | 00000000:00:09.0 Off | 0 |
| N/A 31C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 PCIe Off | 00000000:00:0A.0 Off | 0 |
| N/A 36C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 PCIe Off | 00000000:00:0B.0 Off | 0 |
| N/A 30C P0 47W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 PCIe Off | 00000000:00:0C.0 Off | 0 |
| N/A 30C P0 49W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@grok-1-596d68d5c7-5cq9f:/app#
root@grok-1-596d68d5c7-5cq9f:/app# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.5.0-21-generic root=UUID=74dd9370-9caa-470b-a711-0d385161522f ro console=tty1 console=ttyS0
root@grok-1-596d68d5c7-5cq9f:/app# uname -a
Linux grok-1-596d68d5c7-5cq9f 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2 x86_64 GNU/Linux
root@grok-1-596d68d5c7-5cq9f:/app#
wow, so people that even have the right hardware can't even run this normally? haha that is a FAIL!
Zblocker64
(on Discord) suggested increasing stack size (ulimit -s) which is set to 8192
by default.
I'll try doubling it up to 16384
to see if that helps with the python's segmentation fault.
FWIW: WIndow's the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)
I'll try doubling it up to
16384
to see if that helps with the python's segmentation fault.
Doubling the stack size limit up to 16384
didn't fix the issue.
On the contrary, python run.py
would stop writing any output and it would lock-up immediately, cannot kill it.
It appears the issues with running grok-1 arise mostly when overlay
FS is used in Pod (default FS containers use).
The issues are:
Segmentation fault
when running python run.py
;nvidia-smi
or anything that touches the nvidia driver would get locked up too; (we are using latest official nvidia drivers & linux kernel provided for the Ubuntu 22.04)However, even when running with the ext4
FS directly mounted in the Pod, or even when running on the Host directly:
python run.py
output doesn't seem to be complete as you can see in the screenshots & recordings below (despite exit code being 0
);ubuntu:22.04
) - grok-1 mounted over ext4
FS:The /root/grok-1
was mounted directly from the host (ext4
FS) instead of the overlay
FS (!) ; I'm going to test with overlayfs
as I have a hunch it might be the cause of issues) :bulb:
volumeMounts:
- mountPath: /root/grok-1
name: grok-volume
- mountPath: /dev/shm
name: shm
volumes:
- name: grok-volume
hostPath:
path: /root/grok-1
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
ubuntu:22.04
) - overlay
FSIt appears that this issue appears mostly when overlay
FS is used.
volumeMounts:
- mountPath: /dev/shm
name: shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
Next time I ran gdp python
and (gdb) run run.py
process would lockup again at 100% cpu usage.
I could Ctrl+C the gdb and the process would be gone.
However, running gdb /root/grok-1/venv/bin/python
+ (gdb) run run.py
-- process would lockup again at 100% cpu usage and this time I cannot Ctrl+C it nor kill -9 <PID>
; nvidia-smi -L
would print 8 GPUs available on the host, and would just hang instead of exiting as normal.
Only host reboot releases the nvidia.
Nvidia driver: 550.54.15 Linux: Ubuntu 22.04.4 LTT with 6.5.0-26-generic kernel.
We are using nvidia runtime -- https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.14.5
Update 1: I've tried k8s-device-plugin of 0.15.0-rc.2
version - same issues except that it doesn't seem to be locking the process up. It can be killed, nvidia-smi works well, i.e. isn't locking up anymore. Maybe just luck. Will keep monitoring this.
Pod with overlay FS
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-pod
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- "node1"
containers:
- name: app
image: ubuntu:22.04
command: ["sleep", "infinity"]
resources:
requests:
cpu: "58"
ephemeral-storage: "1099511627776"
memory: "1374389534720"
nvidia.com/gpu: "8"
limits:
cpu: "58"
ephemeral-storage: "1099511627776"
memory: "1374389534720"
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /dev/shm
name: shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "640Gi"
2. In a Pod (image:
ubuntu:22.04
) - grok-1 mounted overext4
FS:
This time the python process got hung, even with ext4
FS:
This model should be called SIGINT, because thats what will happen when you try to run it.
This model should be called SIGINT, because thats what will happen when you try to run it
I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?
The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)
Update: figured that's what max_len
is for... increasing that, increases the output.
The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.
This model should be called SIGINT, because thats what will happen when you try to run it
I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?
The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)
The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.
The user is pulling our leg when they are saying "model should be called SIGINT" they are just making fun of it crashing for them, but not adding anything of value to the ticket.
For whoever needs this:
PyTorch version is working in K8s pod (over container's overlay
FS; and without /dev/shm
requirement) well, no issues!
How-to deploy PyTorch version is here https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953
It looks like the culprit for the process (python using nvidia GPU and nvidia-smi CLI) lockup was the nvidia driver.
If you have H100 GPU and running your provider with the nvidia driver of version 550.X
- make sure you have upgraded it to at least 550.54.15
version which fixes the nvidia driver lockup problem. (where process using nvidia driver would permanently lock-up and nvidia-smi command would permanently hang until server reboot). still crashes
Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349
Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html
In K8s pod (overlay FS) and with the newest:
xai-org/grok-1 still:
At least newest nvidia drivers still crashes550.54.15
don't crash/lock-up the processes.
I suggest using PyTorch-based grok-1 version as described here, until then: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953
At least newest nvidia drivers
550.54.15
don't crash/lock-up the processes.
Unfortunately, that's still not the case with xai-org's grok-1 :/ It still crashes the nvidia drivers and only node reboot fixes this.
stack trace:
different node, but the same problem; pid of the python process (xai-org's grok-1)
root@obl-node2:~# cat /proc/1483740/stack [<0>] uvm_spin_loop+0xf0/0x180 [nvidia_uvm] [<0>] wait_for_entry_with_spin+0x4d/0x1c0 [nvidia_uvm] [<0>] uvm_tracker_wait_for_entry+0x94/0xd0 [nvidia_uvm] [<0>] uvm_push_end_and_wait+0x3e/0x60 [nvidia_uvm] [<0>] channel_pool_add.constprop.0+0xa29/0x11c0 [nvidia_uvm] [<0>] uvm_channel_manager_create+0x3c1/0xb50 [nvidia_uvm] [<0>] uvm_gpu_retain_by_uuid+0xf45/0x2b30 [nvidia_uvm] [<0>] uvm_va_space_register_gpu+0x4a/0x7f0 [nvidia_uvm] [<0>] uvm_api_register_gpu+0x77/0xc0 [nvidia_uvm] [<0>] uvm_ioctl+0xdfb/0x1cd0 [nvidia_uvm] [<0>] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm] [<0>] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm] [<0>] __x64_sys_ioctl+0xa3/0xf0 [<0>] do_syscall_64+0x5b/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
and here is the stack trace for pid 8013 of the nvidia-device-plugin process which was kill -9
'ed but doesn't disappear:
root@obl-node2:~# cat /proc/8013/stack
[<0>] uvm_va_space_destroy+0x482/0x710 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0xa5/0x140 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2e/0x40 [nvidia_uvm]
[<0>] __fput+0xfc/0x2c0
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x61/0xa0
[<0>] do_exit+0x2ac/0x6f0
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x8dc/0x940
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] exit_to_user_mode_loop+0x9a/0x130
[<0>] exit_to_user_mode_prepare+0xa5/0xb0
[<0>] syscall_exit_to_user_mode+0x29/0x60
[<0>] do_syscall_64+0x67/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
It still crashes the nvidia drivers and only node reboot fixes this.
still the same with NVIDIA H100 PCIe
and nvidia driver version 535.183.01
(ubuntu's nvidia-driver-535-server
package)
except now there is a stack trace:
(venv) root@grok-1-585644c85-2b7tn:~/grok-1# python run.py
INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
INFO:rank:Initializing mesh for self.local_mesh_config=(1, 8) self.between_hosts_config=(1, 1)...
INFO:rank:Detected 8 devices in mesh
2024-07-28 08:50:02.861309: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.5.82). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
INFO:rank:partition rules: <bound method LanguageModelConfig.partition_rules of LanguageModelConfig(model=TransformerConfig(emb_size=6144, key_size=128, num_q_heads=48, num_kv_heads=8, num_layers=64, vocab_size=131072, widening_factor=8, attn_output_multiplier=0.08838834764831845, name=None, num_experts=8, capacity_factor=1.0, num_selected_experts=2, init_scale=1.0, shard_activations=True, data_axis='data', model_axis='model'), vocab_size=131072, pad_token=0, eos_token=2, sequence_len=8192, model_size=6144, embedding_init_scale=1.0, embedding_multiplier_scale=78.38367176906169, output_multiplier_scale=0.5773502691896257, name=None, fprop_dtype=<class 'jax.numpy.bfloat16'>, model_type=None, init_scale_override=None, shard_embeddings=True)>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:State sharding type: <class 'model.TrainingState'>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0
F0728 08:50:36.312760 8365 pjrt_stream_executor_client.cc:452] Check failed: copy_stream->WaitFor(local_device->compute_stream()).ok()
*** Check failure stack trace: ***
@ 0x70823bfc76f4 absl::lts_20230802::log_internal::LogMessage::SendToLog()
@ 0x70823bfc75f4 absl::lts_20230802::log_internal::LogMessage::Flush()
@ 0x70823bfc7a99 absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
@ 0x7082378ddb6c xla::AllocateDestinationBuffer()
@ 0x7082378e1b4b xla::PjRtStreamExecutorClient::BufferFromHostBuffer()
@ 0x7082378e35bd xla::PjRtStreamExecutorClient::BufferFromHostBuffer()
@ 0x70823784d5d3 pjrt::PJRT_Client_BufferFromHostBuffer()
@ 0x70824ac919cb xla::PjRtCApiClient::BufferFromHostBufferInternalImpl()
@ 0x70824ac92603 xla::PjRtCApiClient::BufferFromHostBuffer()
@ 0x70824fc4caaf xla::ifrt::PjRtClient::MakeArrayFromHostBuffer()
@ 0x70824f4f1a0b absl::lts_20230802::internal_any_invocable::RemoteInvoker<>()
@ 0x70824f4b4a74 xla::PyArray::BatchedDevicePut()
@ 0x70824ab6cbe6 nanobind::detail::func_create<>()::{lambda()#1}::__invoke()
@ 0x708251010a8c nanobind::detail::nb_func_vectorcall_complex()
@ 0x5ff998d3059a _PyEval_EvalFrameDefault
Aborted (core dumped)
(venv) root@grok-1-585644c85-2b7tn:~/grok-1# echo $?
134
dmesg report
[Sun Jul 28 08:50:35 2024] NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
[Sun Jul 28 08:50:35 2024] NVRM: GPU Board Serial Number: 1652923017935
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: CE User Channel (0xb). RST: No, D-RST: No
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000008
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000009
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000a
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000b
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000c
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000d
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000e
[Sun Jul 28 08:50:35 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000f
[Sun Jul 28 08:51:34 2024] NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
[Sun Jul 28 08:51:34 2024] NVRM: GPU Board Serial Number: 1652923018111
[Sun Jul 28 08:51:34 2024] NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
[Sun Jul 28 08:54:28 2024] INFO: task nvidia-smi:944968 blocked for more than 120 seconds.
[Sun Jul 28 08:54:28 2024] Tainted: P OE 6.5.0-45-generic #45~22.04.1-Ubuntu
[Sun Jul 28 08:54:28 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Jul 28 08:54:28 2024] task:nvidia-smi state:D stack:0 pid:944968 ppid:936867 flags:0x00004002
[Sun Jul 28 08:54:28 2024] Call Trace:
[Sun Jul 28 08:54:28 2024] <TASK>
[Sun Jul 28 08:54:28 2024] __schedule+0x2cb/0x750
[Sun Jul 28 08:54:28 2024] schedule+0x63/0x110
[Sun Jul 28 08:54:28 2024] schedule_preempt_disabled+0x15/0x30
[Sun Jul 28 08:54:28 2024] __mutex_lock.constprop.0+0x3f8/0x7a0
[Sun Jul 28 08:54:28 2024] __mutex_lock_slowpath+0x13/0x20
[Sun Jul 28 08:54:28 2024] mutex_lock+0x3c/0x50
[Sun Jul 28 08:54:28 2024] uvm_va_space_destroy+0x44d/0x6c0 [nvidia_uvm]
[Sun Jul 28 08:54:28 2024] uvm_release.constprop.0+0xa5/0x140 [nvidia_uvm]
[Sun Jul 28 08:54:28 2024] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[Sun Jul 28 08:54:28 2024] uvm_release_entry+0x2e/0x40 [nvidia_uvm]
[Sun Jul 28 08:54:28 2024] __fput+0xfc/0x2c0
[Sun Jul 28 08:54:28 2024] ____fput+0xe/0x20
[Sun Jul 28 08:54:28 2024] task_work_run+0x61/0xa0
[Sun Jul 28 08:54:28 2024] exit_to_user_mode_loop+0x105/0x130
[Sun Jul 28 08:54:28 2024] exit_to_user_mode_prepare+0xa5/0xb0
[Sun Jul 28 08:54:28 2024] syscall_exit_to_user_mode+0x29/0x60
[Sun Jul 28 08:54:28 2024] do_syscall_64+0x61/0x90
[Sun Jul 28 08:54:28 2024] ? srso_alias_return_thunk+0x5/0x7f
[Sun Jul 28 08:54:28 2024] ? syscall_exit_to_user_mode+0x37/0x60
[Sun Jul 28 08:54:28 2024] ? srso_alias_return_thunk+0x5/0x7f
[Sun Jul 28 08:54:28 2024] ? do_syscall_64+0x61/0x90
[Sun Jul 28 08:54:28 2024] ? srso_alias_return_thunk+0x5/0x7f
[Sun Jul 28 08:54:28 2024] ? syscall_exit_to_user_mode+0x37/0x60
[Sun Jul 28 08:54:28 2024] ? srso_alias_return_thunk+0x5/0x7f
[Sun Jul 28 08:54:28 2024] ? do_syscall_64+0x61/0x90
[Sun Jul 28 08:54:28 2024] ? do_syscall_64+0x61/0x90
[Sun Jul 28 08:54:28 2024] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[Sun Jul 28 08:54:28 2024] RIP: 0033:0x70f33e914f67
[Sun Jul 28 08:54:28 2024] RSP: 002b:00007ffebf279f18 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[Sun Jul 28 08:54:28 2024] RAX: 0000000000000000 RBX: 000070f33dbd9b00 RCX: 000070f33e914f67
[Sun Jul 28 08:54:28 2024] RDX: 0000000000000000 RSI: 0000000030000002 RDI: 000000000000000c
[Sun Jul 28 08:54:28 2024] RBP: 00007ffebf279f30 R08: 0000000000000000 R09: 0000000000000001
[Sun Jul 28 08:54:28 2024] R10: 000070f33d803878 R11: 0000000000000246 R12: 0000000000000000
[Sun Jul 28 08:54:28 2024] R13: 000070f33dbd9e00 R14: 0000000000000000 R15: 0000000000000000
[Sun Jul 28 08:54:28 2024] </TASK>
nvidia-smi
hangs and never exits (cannot be killed)
root@node4:~# nvidia-smi
Sun Jul 28 08:52:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
...
REDACTED
...
<hangs-here>
consequent python run.py
(attempts to run grok-1) are hanging too
(venv) root@grok-1-585644c85-2b7tn:~/grok-1# python run.py
<just indefinitely hangs here>
apt update && apt install -y python3-pip virtualenv git
cd /root
git clone https://github.com/xai-org/grok-1.git
cd grok-1
virtualenv --python=python3 venv
source venv/bin/activate
pip install -r requirements.txt
pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install huggingface_hub[hf_transfer]
export HUGGINGFACE_TOKEN=hf_REDACTED
git config --global credential.helper store
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir checkpoints --local-dir-use-symlinks False > hf-download.log 2>&1
python run.py
I've tried HBM3 H100's (nvidia 555.42.06
) - xai-org's grok-1
is working flawlessly there!
I guess this must be something to do with the PCIe H100 ...
Cem from Oblivus suggested to address the warning to see whether it helps on the PCIe H100 system and, lo and behold, it did! :tada: (*almost - see below)
The warning:
# python run.py
...
2024-07-28 12:44:02.558447: W external/xla/xla/service/gpu/nvptx_compiler.cc:765] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.5.82). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
...
Addressed this way:
cuda-compat-12-5
to make sure apps requiring newest CUDA 12.5 will run in compatible mode despite the nvidia driver supports the older CUDA (e.g. CUDA 12.2 with nvidia 535)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
apt install ./cuda-keyring_1.1-1_all.deb
apt update
apt -y install cuda-compat-12-5
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
- Use it
> loving how `nvidia-smi` displays different `CUDA version` based on the `cuda-compat-12-5` passed via `LD_LIBRARY_PATH`
(venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# unset LD_LIBRARY_PATH (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# nvidia-smi | grep CUDA | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# echo $LD_LIBRARY_PATH /usr/local/cuda-12.5/compat (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# nvidia-smi | grep CUDA | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.5 | (venv) root@grok-1-6d6cfb5dfb-7fz4j:~/grok-1# python run.py
![image](https://github.com/user-attachments/assets/e96cdc00-f3cf-4e5f-9ac2-3c4e855fb3a7)
Interestingly, it starts crashing after re-running for more than two times, but at least it does not lockup the nvidia driver:
![image](https://github.com/user-attachments/assets/44c18492-3f9a-4687-a16e-fd357278ce6f)
## asciinema recording
[![asciicast](https://asciinema.org/a/669929.svg)](https://asciinema.org/a/669929)
## Update 1: unhealthy GPU reports
Have then noticed node was reporting less GPU count available, despite `nvidia-smi` not reporting any process to be using GPU:
> :mag: Notice `Allocatable` GPU count is now `6` instead of `8`
$ kubectl describe node node7 ... ... Capacity: cpu: 252 ephemeral-storage: 6707082984Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486158688Ki nvidia.com/gpu: 8 pods: 110 Allocatable: cpu: 252 ephemeral-storage: 6181247667821 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486056288Ki nvidia.com/gpu: 6 pods: 110 ...
and [`nvdp-nvidia-device-plugin` ](https://github.com/NVIDIA/k8s-device-plugin) reported `XidCriticalError` errors marking two GPUs unhealthy:
$ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-pqzqg I0727 21:12:54.732685 1 main.go:178] Starting FS watcher. I0727 21:12:54.741394 1 main.go:185] Starting OS watcher. I0727 21:12:54.742075 1 main.go:200] Starting Plugins. I0727 21:12:54.742155 1 main.go:257] Loading configuration. I0727 21:12:54.744224 1 main.go:265] Updating config with default resource matching patterns. I0727 21:12:54.744559 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "mpsRoot": "/run/nvidia/mps", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "volume-mounts" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0727 21:12:54.744574 1 main.go:279] Retrieving plugins. I0727 21:12:54.745619 1 factory.go:104] Detected NVML platform: found NVML library I0727 21:12:54.745689 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0727 21:13:46.616044 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu' I0727 21:13:46.617485 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0727 21:13:46.622755 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet I0728 16:26:54.645961 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646064 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.646244 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.646331 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646554 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.646581 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.646930 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.646963 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647000 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647221 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647241 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647288 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647501 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647529 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.647626 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.647811 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.647843 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648025 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648117 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648131 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648162 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648380 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648406 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648478 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:26:54.648682 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:26:54.648712 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:26:54.648780 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.535732 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.535814 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.535903 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536211 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536229 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536239 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536467 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536481 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536541 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536670 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536686 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536704 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.536889 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.536904 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.536976 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537106 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537120 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537180 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537323 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537336 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537367 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537543 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537556 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537578 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:29:24.537760 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:29:24.537771 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:29:24.537813 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.296998 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:41 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297067 1 health.go:185] XidCriticalError: Xid=41 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297165 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297442 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297471 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297536 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297679 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297690 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297731 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.297893 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.297904 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.297995 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298109 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298119 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298137 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298326 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298337 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298396 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298510 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298519 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298563 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298739 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298749 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298793 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 16:30:31.298924 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 16:30:31.298934 1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-32128bd4-1242-cfa7-6e4f-d52458f69354; marking device as unhealthy. I0728 16:30:31.298956 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354 I0728 20:21:32.064665 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.064802 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.064871 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065087 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065104 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065179 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065339 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065354 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065396 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065591 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065605 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065648 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.065840 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.065858 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.065890 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066088 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066105 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066164 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066336 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066351 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066369 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2 I0728 20:21:32.066579 1 health.go:159] Processing event {Device:{Handle:0x7ecc4f4f7b40} EventType:8 EventData:95 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} I0728 20:21:32.066595 1 health.go:185] XidCriticalError: Xid=95 on Device=GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2; marking device as unhealthy. I0728 20:21:32.066632 1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2
![image](https://github.com/user-attachments/assets/65e906c9-b750-44a1-af91-7b92e1489176)
Restarting `nvdp-nvidia-device-plugin` on that node did not help.
The `nvdp-nvidia-device-plugin` did not report unhealthy (nor healthy) GPU devices anymore.
Yet, the GPU count has decreased from `8` to `6` under the `Capacity` after `nvdp-nvidia-device-plugin` pod restart on that node:
$ kubectl describe node node7 ... Capacity: cpu: 252 ephemeral-storage: 6707082984Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486158688Ki nvidia.com/gpu: 6 pods: 110 Allocatable: cpu: 252 ephemeral-storage: 6181247667821 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1486056288Ki nvidia.com/gpu: 6 pods: 110
## Update 2: reset GPU
- 1st attempt to reset GPU shown only `7` GPUs were reset
root@node7:~# nvidia-smi --gpu-reset GPU 00000000:00:05.0 was successfully reset. GPU 00000000:00:06.0 was successfully reset. Error encountered during reset of GPU 00000000:00:07.0: Driver Not Loaded GPU 00000000:00:08.0 was successfully reset. GPU 00000000:00:09.0 was successfully reset. GPU 00000000:00:0A.0 was successfully reset. GPU 00000000:00:0B.0 was successfully reset. GPU 00000000:00:0C.0 was successfully reset.
1 device did not complete reset successfully, and may be in an unstable state. Please reboot your system.
- 2nd attempt to reset GPU shows only `6` GPUs were reset (instead of the expected `8`)
root@node7:~# nvidia-smi --gpu-reset GPU 00000000:00:05.0 was successfully reset. GPU 00000000:00:06.0 was successfully reset. GPU 00000000:00:09.0 was successfully reset. GPU 00000000:00:0A.0 was successfully reset. GPU 00000000:00:0B.0 was successfully reset. GPU 00000000:00:0C.0 was successfully reset. All done. root@node7:~# echo $? 0
- dmesg logs `An uncorrectable ECC error detected (possible firmware handling failure)`
[Mon Jul 29 07:54:16 2024] NVRM: Xid (PCI:0000:00:07): 140, pid='
- `nvdp-nvidia-device-plugin` fails to start now
arno@x1:~$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide | grep nvidia-device-plugin | grep -w node7
nvidia-device-plugin nvdp-nvidia-device-plugin-r2rdb 0/1 RunContainerError 0 2m1s 10.233.100.140 node7
- `nvidia-smi` takes too long to report the GPU stats
root@node7:~# time nvidia-smi -L GPU 0: NVIDIA H100 PCIe (UUID: GPU-e4cc67c5-b8b0-8362-38e4-b72decfcf87e) GPU 1: NVIDIA H100 PCIe (UUID: GPU-cfe62ad2-f29c-a156-da1b-2d02847c0dff) GPU 2: NVIDIA H100 PCIe (UUID: GPU-32128bd4-1242-cfa7-6e4f-d52458f69354) GPU 3: NVIDIA H100 PCIe (UUID: GPU-3eec7a8d-1d8a-70ea-787a-c21833950688) GPU 4: NVIDIA H100 PCIe (UUID: GPU-5d2026f6-4a5c-8c7e-ece0-d565d06b86c1) GPU 5: NVIDIA H100 PCIe (UUID: GPU-46c3f4f2-4b78-c0b6-a45a-09a1694db707) GPU 6: NVIDIA H100 PCIe (UUID: GPU-f108e1d9-9a22-cd97-c8bf-cef20a1d11fd)
real 0m22.531s user 0m0.000s sys 0m20.280s
Which would likely explain why `nvidia-device-plugin` pod cannot start (due to `nvidia-container-cli: initialization error: driver rpc error: timed out: unknown`)
Hi, I am trying to run it, but the
python3 ./run.py
process exits eventually after it's been running for about 10 minutes at 800% cpu usage. I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.Not much of the logs:
I can quickly restart the process now as I am in the pod:
Ideas?
Commands used to deploy it
Update 1
I'm trying to run it directly with
python3 ./run.py
(without gotty right now)