zonca / jupyterhub-deploy-kubernetes-jetstream

Configuration files for my tutorials on deploying JupyterHub on top of Kubernetes on XSEDE Jetstream (Openstack)
https://zonca.dev/categories/#jetstream
23 stars 14 forks source link

Pods crash on GPU nodes with Ubuntu 22 #73

Closed zonca closed 3 months ago

zonca commented 7 months ago

On Ubuntu 20 GPU nodes work fine, however on Ubuntu 22, all system pods intermittently fail. If a node is rebooted, they seem to be working fine for a few minutes then crash.

NAMESPACE     NAME                                           READY   STATUS             RESTARTS         AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
kube-system   coredns-588bb58b94-8jdjw                       0/1     CrashLoopBackOff   12 (10s ago)     51m   10.233.65.49   kubejetstream-k8s-node-1   <none>           <none>
kube-system   csi-cinder-controllerplugin-648ffdc6db-88b2v   0/6     CrashLoopBackOff   72 (61s ago)     50m   10.233.65.47   kubejetstream-k8s-node-1   <none>           <none>
kube-system   csi-cinder-nodeplugin-tccts                    0/3     CrashLoopBackOff   37 (23s ago)     50m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   kube-flannel-x85zd                             1/1     Running            13 (3m53s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   kube-proxy-hq9nf                               0/1     CrashLoopBackOff   13 (36s ago)     52m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nginx-proxy-kubejetstream-k8s-node-1           1/1     Running            14 (6m39s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nodelocaldns-kn4r6                             0/1     CrashLoopBackOff   12 (2m25s ago)   51m   10.0.74.64     kubejetstream-k8s-node-1   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-h7khg           0/1     CrashLoopBackOff   5 (2m14s ago)    11m   10.233.65.46   kubejetstream-k8s-node-1   <none>           <none>
kube-system   snapshot-controller-7d445c66c9-v9z66           0/1     CrashLoopBackOff   11 (4m40s ago)   50m   10.233.65.45   kubejetstream-k8s-node-1   <none>           <none>

See minimal debugging performed here: https://github.com/zonca/jetstream_kubespray/pull/29#issuecomment-1935148755

zonca commented 7 months ago

tested a CPU-only deployment with Ubuntu 22 nodes and it worked fine. it seems like something specific to GPU.

zonca commented 4 months ago

@julienchastang noticed node crashes on GPU recently?

julienchastang commented 4 months ago

Do you have any additional information on why the pods are crashing? We've definitely seen node pressure issues on GPU nodes which was sort of the impetus for running JupyterHubs with more minimalist Linux distributions. In those cases I actually mounted an external disk to accommodate containerd files. Not sure this is the same issue, though. cc @ana-v-espinoza

zonca commented 3 months ago

the crashes on Ubuntu 22 were due to issues in the driver. Now with the new driver Ubuntu 22 works fine, so I am closing this issue.