Closed zonca closed 5 months ago
tested a CPU-only deployment with Ubuntu 22 nodes and it worked fine. it seems like something specific to GPU.
@julienchastang noticed node crashes on GPU recently?
Do you have any additional information on why the pods are crashing? We've definitely seen node pressure issues on GPU nodes which was sort of the impetus for running JupyterHubs with more minimalist Linux distributions. In those cases I actually mounted an external disk to accommodate containerd files. Not sure this is the same issue, though. cc @ana-v-espinoza
the crashes on Ubuntu 22 were due to issues in the driver. Now with the new driver Ubuntu 22 works fine, so I am closing this issue.
On Ubuntu 20 GPU nodes work fine, however on Ubuntu 22, all system pods intermittently fail. If a node is rebooted, they seem to be working fine for a few minutes then crash.
See minimal debugging performed here: https://github.com/zonca/jetstream_kubespray/pull/29#issuecomment-1935148755