Segmentation fault in K8s Pod (8x H100's)

xai-org / grok-1

Grok open release

Apache License 2.0

49.15k stars 8.31k forks source link

Segmentation fault in K8s Pod (8x H100's) #164

Open andy108369 opened 3 months ago

andy108369 commented 3 months ago

Hi, I am trying to run it, but the python3 ./run.py process exits eventually after it's been running for about 10 minutes at 800% cpu usage. I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.

Not much of the logs:

I can quickly restart the process now as I am in the pod:

pkill gotty
cd /grok-1
gotty -w python3 ./run.py

Ideas?

Commands used to deploy it

        apt-get update ; apt-get upgrade -y ;
        apt-get install pip wget git -y;
        pip install dm_haiku==0.0.12;
        pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
        pip install numpy==1.26.4;
        pip install sentencepiece==0.2.0;
        pip install -U "huggingface_hub[cli]";
        git clone https://github.com/xai-org/grok-1;
        wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
        tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
        huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
        mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
        mkdir /root/shm;
        sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
        cd /grok-1 && gotty -w python3 ./run.py;

Update 1

I'm trying to run it directly with python3 ./run.py (without gotty right now)

andy108369 commented 3 months ago

什么也看不懂

你真的只是为了写这个才在Github上注册的吗？

robvdl commented 3 months ago

Just an observation and I'm probably stating the obvious here, but you should really deploy from the requirements.txt because it will get updated.

And your link with the traceback is super obvious, it's in the last line:

"No space left on device".

Disk is full.

But reading it again something with /dev/shm

andy108369 commented 3 months ago

but you should really deploy from the requirements.txt because it will get updated.

thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I've been asking to review it first, hence I had to use those pip install's

# pip install -r requirements.txt
Requirement already satisfied: dm_haiku==0.0.12 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (0.0.12)
Requirement already satisfied: jax==0.4.25 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.4.25)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 3)) (1.26.4)
Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 4)) (0.2.0)
Requirement already satisfied: absl-py>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: jmp>=0.0.2 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.0.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.9.0)
Requirement already satisfied: flax>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.8.2)
Requirement already satisfied: ml-dtypes>=0.2.0 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (0.3.2)
Requirement already satisfied: opt-einsum in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (1.12.0)
Requirement already satisfied: msgpack in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: optax in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.2.1)
Requirement already satisfied: orbax-checkpoint in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.5.6)
Requirement already satisfied: tensorstore in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.56)
Requirement already satisfied: rich>=11.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (13.7.1)
Requirement already satisfied: typing-extensions>=4.2 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (4.10.0)
Requirement already satisfied: PyYAML>=5.4.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.0.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.17.2)
Requirement already satisfied: chex>=0.1.7 in /usr/local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.85)
Requirement already satisfied: jaxlib>=0.1.37 in /root/.local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.4.25+cuda12.cudnn89)
Requirement already satisfied: etils[epath,epy] in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.7.0)
Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (5.26.0)
Requirement already satisfied: toolz>=0.9.0 in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.12.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (69.1.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2024.3.0)
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.3.1)
Requirement already satisfied: zipp in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.18.1)

And your link with the traceback is super obvious, it's in the last line: "No space left on device".

The / disk has 1TiB of space and only about 300 GiB are used; /dev/shm is set to 640 GiB (tmpfs).

Also, not sure where did you find the "link with the traceback" you referred to?

robvdl commented 3 months ago

I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it's a different repo.

andy108369 commented 3 months ago

Segfault

I've re-tried it again, and the behavior is same - it Segfaults :/

I guess it's similar issue to https://github.com/xai-org/grok-1/issues/152 now.

Additional info

root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi -L
GPU 0: NVIDIA H100 PCIe (UUID: GPU-50f0ee14-b7a1-f0af-616a-f3bb0825ee7d)
GPU 1: NVIDIA H100 PCIe (UUID: GPU-17201481-5148-0983-539d-10ff0e2cf07f)
GPU 2: NVIDIA H100 PCIe (UUID: GPU-ce315b98-20ff-34fd-307b-fe05646f5913)
GPU 3: NVIDIA H100 PCIe (UUID: GPU-3c330414-9d82-ef1b-65c1-3dad9f294dd1)
GPU 4: NVIDIA H100 PCIe (UUID: GPU-81c9e219-4831-4d68-ccef-badb7f2bc599)
GPU 5: NVIDIA H100 PCIe (UUID: GPU-102d94be-31e5-5809-da4b-a1eeb5fee45b)
GPU 6: NVIDIA H100 PCIe (UUID: GPU-bc4095a5-f436-2dec-af84-44fe954a7e6c)
GPU 7: NVIDIA H100 PCIe (UUID: GPU-b6b73324-7d54-f3c3-a4d9-27fb98f564e9)
root@grok-1-596d68d5c7-5cq9f:/app# nvidia-smi
Mon Mar 18 20:28:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             48W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               Off |   00000000:00:06.0 Off |                    0 |
| N/A   31C    P0             46W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 PCIe               Off |   00000000:00:07.0 Off |                    0 |
| N/A   40C    P0             51W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 PCIe               Off |   00000000:00:08.0 Off |                    0 |
| N/A   34C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 PCIe               Off |   00000000:00:09.0 Off |                    0 |
| N/A   31C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 PCIe               Off |   00000000:00:0A.0 Off |                    0 |
| N/A   36C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 PCIe               Off |   00000000:00:0B.0 Off |                    0 |
| N/A   30C    P0             47W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 PCIe               Off |   00000000:00:0C.0 Off |                    0 |
| N/A   30C    P0             49W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@grok-1-596d68d5c7-5cq9f:/app# 

root@grok-1-596d68d5c7-5cq9f:/app# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.5.0-21-generic root=UUID=74dd9370-9caa-470b-a711-0d385161522f ro console=tty1 console=ttyS0

root@grok-1-596d68d5c7-5cq9f:/app# uname -a
Linux grok-1-596d68d5c7-5cq9f 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2 x86_64 GNU/Linux
root@grok-1-596d68d5c7-5cq9f:/app#

yarodevuci commented 3 months ago

wow, so people that even have the right hardware can't even run this normally? haha that is a FAIL!

andy108369 commented 3 months ago

Zblocker64 (on Discord) suggested increasing stack size (ulimit -s) which is set to 8192 by default. I'll try doubling it up to 16384 to see if that helps with the python's segmentation fault.

FWIW: WIndow's the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)

andy108369 commented 3 months ago

I'll try doubling it up to 16384 to see if that helps with the python's segmentation fault.

Doubling the stack size limit up to 16384 didn't fix the issue. On the contrary, python run.py would stop writing any output and it would lock-up immediately, cannot kill it.

It appears the issues with running grok-1 arise mostly when overlay FS is used in Pod (default FS containers use). The issues are:

the Segmentation fault when running python run.py;
sometimes it would lock-up the process, it can't be killed unless reboot; nvidia-smi or anything that touches the nvidia driver would get locked up too; (we are using latest official nvidia drivers & linux kernel provided for the Ubuntu 22.04)

However, even when running with the ext4 FS directly mounted in the Pod, or even when running on the Host directly:

the python run.py output doesn't seem to be complete as you can see in the screenshots & recordings below (despite exit code being 0);

1. On the host directly:

2. In a Pod (image: `ubuntu:22.04`) - grok-1 mounted over `ext4` FS:

The /root/grok-1 was mounted directly from the host (ext4 FS) instead of the overlay FS (!) ; I'm going to test with overlayfs as I have a hunch it might be the cause of issues) :bulb:

        volumeMounts:
        - mountPath: /root/grok-1
          name: grok-volume
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: grok-volume
        hostPath:
          path: /root/grok-1
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"

3. In a Pod (image: `ubuntu:22.04`) - `overlay` FS

It appears that this issue appears mostly when overlay FS is used.

        volumeMounts:
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"

Next time I ran gdp python and (gdb) run run.py process would lockup again at 100% cpu usage. I could Ctrl+C the gdb and the process would be gone.

However, running gdb /root/grok-1/venv/bin/python + (gdb) run run.py -- process would lockup again at 100% cpu usage and this time I cannot Ctrl+C it nor kill -9 <PID>; nvidia-smi -L would print 8 GPUs available on the host, and would just hang instead of exiting as normal. Only host reboot releases the nvidia.

Versions

Nvidia driver: 550.54.15 Linux: Ubuntu 22.04.4 LTT with 6.5.0-26-generic kernel.

We are using nvidia runtime -- https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.14.5

Update 1: I've tried k8s-device-plugin of 0.15.0-rc.2 version - same issues except that it doesn't seem to be locking the process up. It can be killed, nvidia-smi works well, i.e. isn't locking up anymore. Maybe just luck. Will keep monitoring this.

K8s manifest for 3rd case

Pod with overlay FS

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-pod
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      runtimeClassName: nvidia
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - "node1"
      containers:
      - name: app
        image: ubuntu:22.04
        command: ["sleep", "infinity"]
        resources:
          requests:
            cpu: "58"
            ephemeral-storage: "1099511627776"
            memory: "1374389534720"
            nvidia.com/gpu: "8"
          limits:
            cpu: "58"
            ephemeral-storage: "1099511627776"
            memory: "1374389534720"
            nvidia.com/gpu: "8"
        volumeMounts:
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "640Gi"

andy108369 commented 3 months ago

3. In a Pod (image: ubuntu:22.04) - overlay FS

recorded with asciinema segfault for this case:

andy108369 commented 3 months ago

2. In a Pod (image: ubuntu:22.04) - grok-1 mounted over ext4 FS:

This time the python process got hung, even with ext4 FS:

at 14:08

and backtrace https://gist.githubusercontent.com/andy108369/b42f07265928ac11a161165f82ce026d/raw/878f644277180efcc1a681217c7dc58230b67c67/backtrace.md

AdaptiveStep commented 3 months ago

This model should be called SIGINT, because thats what will happen when you try to run it.

andy108369 commented 3 months ago

This model should be called SIGINT, because thats what will happen when you try to run it

I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?

~~The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)~~ Update: figured that's what max_len is for... increasing that, increases the output.

The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.

robvdl commented 3 months ago

This model should be called SIGINT, because thats what will happen when you try to run it

I'm just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?

The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)

The second issue is that it can't seem to run well in K8s pod, sometimes it runs sometimes it won't. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.

The user is pulling our leg when they are saying "model should be called SIGINT" they are just making fun of it crashing for them, but not adding anything of value to the ticket.

andy108369 commented 3 months ago

For whoever needs this:

PyTorch version is working in K8s pod (over container's overlay FS; and without /dev/shm requirement) well, no issues!

How-to deploy PyTorch version is here https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953

andy108369 commented 3 months ago

It looks like the culprit for the process (python using nvidia GPU and nvidia-smi CLI) lockup was the nvidia driver.

If you have H100 GPU and running your provider with the nvidia driver of version 550.X - make sure you have upgraded it to at least 550.54.15 version which fixes the nvidia driver lockup problem. (where process using nvidia driver would permanently lock-up and nvidia-smi command would permanently hang until server reboot).

Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349

Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html

Todo

[x] re-test xai-org's grok-1 in K8s pod (with overlay fs)

andy108369 commented 3 months ago

In K8s pod (overlay FS) and with the newest:

nvidia driver 550.54.15
linux kernel 6.5.0-26-generic

xai-org/grok-1 still:

sometimes won't print anything, just uses 200% cpu, 84GiB RAM.. until I Ctrl+C it
when it prints stuff (but not the result), it exits with exit code 139 - Segmentation fault

At least newest nvidia drivers 550.54.15 don't crash/lock-up the processes.

Recording:

I suggest using PyTorch-based grok-1 version as described here, until then: https://github.com/xai-org/grok-1/issues/274#issuecomment-2015415953

andy108369 commented 3 months ago

At least newest nvidia drivers 550.54.15 don't crash/lock-up the processes.

Unfortunately, that's still not the case with xai-org's grok-1 :/ It still crashes the nvidia drivers and only node reboot fixes this.

stack trace:

different node, but the same problem; pid of the python process (xai-org's grok-1)

root@obl-node2:~# cat /proc/1483740/stack
[<0>] uvm_spin_loop+0xf0/0x180 [nvidia_uvm]
[<0>] wait_for_entry_with_spin+0x4d/0x1c0 [nvidia_uvm]
[<0>] uvm_tracker_wait_for_entry+0x94/0xd0 [nvidia_uvm]
[<0>] uvm_push_end_and_wait+0x3e/0x60 [nvidia_uvm]
[<0>] channel_pool_add.constprop.0+0xa29/0x11c0 [nvidia_uvm]
[<0>] uvm_channel_manager_create+0x3c1/0xb50 [nvidia_uvm]
[<0>] uvm_gpu_retain_by_uuid+0xf45/0x2b30 [nvidia_uvm]
[<0>] uvm_va_space_register_gpu+0x4a/0x7f0 [nvidia_uvm]
[<0>] uvm_api_register_gpu+0x77/0xc0 [nvidia_uvm]
[<0>] uvm_ioctl+0xdfb/0x1cd0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[<0>] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[<0>] __x64_sys_ioctl+0xa3/0xf0
[<0>] do_syscall_64+0x5b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

and here is the stack trace for pid 8013 of the nvidia-device-plugin process which was kill -9'ed but doesn't disappear:

root@obl-node2:~# cat /proc/8013/stack
[<0>] uvm_va_space_destroy+0x482/0x710 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0xa5/0x140 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2e/0x40 [nvidia_uvm]
[<0>] __fput+0xfc/0x2c0
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x61/0xa0
[<0>] do_exit+0x2ac/0x6f0
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x8dc/0x940
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] exit_to_user_mode_loop+0x9a/0x130
[<0>] exit_to_user_mode_prepare+0xa5/0xb0
[<0>] syscall_exit_to_user_mode+0x29/0x60
[<0>] do_syscall_64+0x67/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

xai-org / grok-1

Segmentation fault in K8s Pod (8x H100's) #164

Commands used to deploy it

Update 1

Segfault

Additional info

1. On the host directly:

2. In a Pod (image: ubuntu:22.04) - grok-1 mounted over ext4 FS:

3. In a Pod (image: ubuntu:22.04) - overlay FS

Versions

K8s manifest for 3rd case

3. In a Pod (image: ubuntu:22.04) - overlay FS

2. In a Pod (image: ubuntu:22.04) - grok-1 mounted over ext4 FS:

Todo

2. In a Pod (image: `ubuntu:22.04`) - grok-1 mounted over `ext4` FS:

3. In a Pod (image: `ubuntu:22.04`) - `overlay` FS

3. In a Pod (image: `ubuntu:22.04`) - `overlay` FS

2. In a Pod (image: `ubuntu:22.04`) - grok-1 mounted over `ext4` FS: