How to deal with a 'hub' pod that is stuck in "container creating" status?

ktyle commented 9 months ago

Hi, wonder if @zonca , @ana-v-espinoza and/or @julienchastang might have insight on this. For the last several days our JS2 binderhub has been non-functional due to the hub pod being stuck in a ContainerCreating status. I've restarted the master and worker nodes but the status remains stuck in that status.

The cluster had been running without issues for the last ~8 months or so.

Might it be time to rebuild/redeploy with the most recent terraform recipe?

Below is the output from the kubectl get pods command:

NAME                                                         READY   STATUS              RESTARTS       AGE
binder-855d85bbbf-d7j6d                                      1/1     Terminating         0              5d16h
binder-855d85bbbf-h94zd                                      1/1     Running             0              14m
binderhub-dind-h2ckl                                         1/1     Running             2 (227d ago)   272d
binderhub-dind-m7rcz                                         1/1     Running             4 (20m ago)    272d
binderhub-dind-mtx9z                                         1/1     Running             3 (188d ago)   272d
binderhub-dind-spfs9                                         1/1     Running             3 (20m ago)    193d
binderhub-image-cleaner-489kw                                1/1     Running             2 (20m ago)    169d
binderhub-image-cleaner-g9x9s                                1/1     Running             0              63d
binderhub-image-cleaner-rxctt                                1/1     Running             2 (227d ago)   272d
binderhub-image-cleaner-sbvx6                                1/1     Running             5 (20m ago)    272d
binderhub-proxy-nginx-ingress-controller-697dd69db8-q5x76    1/1     Running             4 (20m ago)    272d
hub-7d7494b966-gjfzb                                         0/1     ContainerCreating   0              14m
hub-7d7494b966-rvqvj                                         0/1     Terminating         0              5d16h
jupyter-projectpythia-2dc-2dns-2daws-2dcookbook-2d9kfzfrnl   1/1     Running             1 (20m ago)    5d18h
jupyter-projectpythia-2de-2dactive-2dcookbook-2daifq0096     1/1     Running             1 (20m ago)    5d17h
jupyter-projectpythia-2de-2dactive-2dcookbook-2dhop4x0ae     1/1     Running             1 (20m ago)    5d18h
jupyter-projectpythia-2de-2dactive-2dcookbook-2doqx3af76     1/1     Running             1 (20m ago)    5d17h
jupyter-projectpythia-2de-2dactive-2dcookbook-2dqwb85bc8     0/1     Evicted             0              6d18h
jupyter-projectpythia-2dhrrr-2daws-2dcookbook-2d227lrxmt     1/1     Running             1 (20m ago)    5d17h
jupyter-projectpythia-2dkerchunk-2dcookbook-2dhgwxi6y2       1/1     Running             1 (20m ago)    9d
jupyter-projectpythia-2dradar-2dcookbook-2dmoh5m3e2          1/1     Running             1 (20m ago)    5d18h
proxy-7555796b5f-kswkr                                       1/1     Running             4 (20m ago)    272d
user-scheduler-5b75968f48-pw2h8                              1/1     Running             0              14m
user-scheduler-5b75968f48-twzg5                              1/1     Terminating         0              5d16h
user-scheduler-5b75968f48-vqplb                              1/1     Running             13 (20m ago)   272d

zonca commented 9 months ago

what does describe pod says?

ktyle commented 9 months ago

kubectl describe pod hub-7d7494b966-gjfzb
Name:           hub-7d7494b966-gjfzb
Namespace:      binderhub
Priority:       0
Node:           projectpythia-k8s-node-1/10.0.0.111
Start Time:     Mon, 12 Feb 2024 20:12:59 +0000
Labels:         app=jupyterhub
                component=hub
                hub.jupyter.org/network-access-proxy-api=true
                hub.jupyter.org/network-access-proxy-http=true
                hub.jupyter.org/network-access-singleuser=true
                pod-template-hash=7d7494b966
                release=binderhub
Annotations:    checksum/config-map: 58d417bf1e34984e76d29278f76cffd9ce5ea5031d5ca302674cfbff7282c281
                checksum/secret: 853d08046f64a310ba36430b3ef639f973f9f4f8c9b19d9d3990f133b413773a
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/hub-7d7494b966
Containers:
  hub:
    Container ID:
    Image:         jupyterhub/k8s-hub:2.0.0
    Image ID:
    Port:          8081/TCP
    Host Port:     0/TCP
    Args:
      jupyterhub
      --config
      /usr/local/etc/jupyterhub/jupyterhub_config.py
      --upgrade-db
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:http/hub/health delay=300s timeout=3s period=10s #success=1 #failure=30
    Readiness:      http-get http://:http/hub/health delay=0s timeout=1s period=2s #success=1 #failure=1000
    Environment:
      PYTHONUNBUFFERED:        1
      HELM_RELEASE_NAME:       binderhub
      POD_NAMESPACE:           binderhub (v1:metadata.namespace)
      CONFIGPROXY_AUTH_TOKEN:  <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'>  Optional: false
    Mounts:
      /srv/jupyterhub from pvc (rw)
      /usr/local/etc/jupyterhub/config/ from config (rw)
      /usr/local/etc/jupyterhub/jupyterhub_config.py from config (rw,path="jupyterhub_config.py")
      /usr/local/etc/jupyterhub/secret/ from secret (rw)
      /usr/local/etc/jupyterhub/z2jh.py from config (rw,path="z2jh.py")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rcrjp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      hub
    Optional:  false
  secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub
    Optional:    false
  pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hub-db-dir
    ReadOnly:   false
  kube-api-access-rcrjp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 hub.jupyter.org/dedicated=core:NoSchedule
                             hub.jupyter.org_dedicated=core:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    26m                   default-scheduler  Successfully assigned binderhub/hub-7d7494b966-gjfzb to projectpythia-k8s-node-1
  Warning  FailedMount  15m (x3 over 24m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[pvc kube-api-access-rcrjp config secret]: timed out waiting for the condition
  Warning  FailedMount  10m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[kube-api-access-rcrjp config secret pvc]: timed out waiting for the condition
  Warning  FailedMount  5m59s (x2 over 12m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[secret pvc kube-api-access-rcrjp config]: timed out waiting for the condition
  Warning  FailedMount  3m43s (x4 over 21m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[config secret pvc kube-api-access-rcrjp]: timed out waiting for the condition
  Warning  FailedMount  3m36s (x19 over 26m)  kubelet            MountVolume.WaitForAttach failed for volume "pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1" : volume attachment is being deleted

zonca commented 9 months ago

try to understand what is wrong with the volume hub-db-dir

kubectl describe pvc hub-db-dir

or something similar

zonca commented 9 months ago

it is possible the volume is stuck and Jetstream support can fix it

ana-v-espinoza commented 9 months ago

Hey Kevin,

We've ran into this a few times, and we've never really been able to get to the bottom of why this happens, but here are a few things that we do whenever we run into a situation like this:

See if Kubernetes thinks the Persistent Volume (the Kubernetes resource) is attached kubectl get volumeattachments

See if openstack thinks the volume (the openstack resource) is attached to something openstack volume show <pvc-name>

See if openstack thinks that the server has the volume attached to it (as per kubectl get volumeattachments) openstack server show <node-name>

We've noticed that sometimes different components of the cluster disagree on whether or not the volume is attached to a node. If openstack doesn't think the volume is attached to anything, you can do kubectl delete volumeattachment <va-name> and potentially fix the problem. If openstack volume show and openstack server show tell you differing things, you may need to get JS2 involved.

ktyle commented 9 months ago

Thanks @ana-v-espinoza ! Based on what I see below, it looks like k8s thinks the pvc is attached to one of the worker nodes, but the two openstack commands think it's not. Can you please confirm my interpretation and advise if you think it's okay to proceed with the kubectl delete volumeattachment step?

kubectl get volumeattachments

csi-aa5a1281cdc776529a0019401319839cc56e314e78b7d24cd39d0c5462fb45b1   cinder.csi.openstack.org   pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1   projectpythia-k8s-node-1   true       64d

openstack volume show pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1

+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                        | Value                                                                                                                                                                                                                                                                                                     |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attachments                  | [{'id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'attachment_id': 'ff3f71cd-10ba-4d08-8d52-0d83ae2ae5bd', 'volume_id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'server_id': 'fdd3cb15-1ad8-49be-9054-c72b399388ed', 'host_name': None, 'device': '/dev/sdb', 'attached_at': '2023-12-10T21:09:40.000000'}] |
| availability_zone            | nova                                                                                                                                                                                                                                                                                                      |
| bootable                     | false                                                                                                                                                                                                                                                                                                     |
| consistencygroup_id          | None                                                                                                                                                                                                                                                                                                      |
| created_at                   | 2023-05-16T02:43:01.000000                                                                                                                                                                                                                                                                                |
| description                  | Created by OpenStack Cinder CSI driver                                                                                                                                                                                                                                                                    |
| encrypted                    | False                                                                                                                                                                                                                                                                                                     |
| id                           | dd669efe-8f93-4066-bef7-2262822f0546                                                                                                                                                                                                                                                                      |
| multiattach                  | False                                                                                                                                                                                                                                                                                                     |
| name                         | pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1                                                                                                                                                                                                                                                                  |
| os-vol-tenant-attr:tenant_id | 261a8729f00f46c9a2009c33e5112623                                                                                                                                                                                                                                                                          |
| properties                   | cinder.csi.openstack.org/cluster='kubernetes', csi.storage.k8s.io/pv/name='pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1', csi.storage.k8s.io/pvc/name='hub-db-dir', csi.storage.k8s.io/pvc/namespace='binderhub'                                                                                              |
| replication_status           | None                                                                                                                                                                                                                                                                                                      |
| size                         | 1                                                                                                                                                                                                                                                                                                         |
| snapshot_id                  | None                                                                                                                                                                                                                                                                                                      |
| source_volid                 | None                                                                                                                                                                                                                                                                                                      |
| status                       | in-use                                                                                                                                                                                                                                                                                                    |
| type                         | __DEFAULT__                                                                                                                                                                                                                                                                                               |
| updated_at                   | 2024-02-07T04:27:49.000000                                                                                                                                                                                                                                                                                |
| user_id                      | 7c41cf39de9f49bda9c8fab272730c90

openstack server show projectpythia-k8s-node-1

| Field                               | Value                                                                                                                                                                         |
+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                                                                        |
| OS-EXT-AZ:availability_zone         | nova                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:host                | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:hostname            | projectpythia-k8s-node-1                                                                                                                                                      |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:instance_name       | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:kernel_id           | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:launch_index        | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:ramdisk_id          | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:reservation_id      | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:root_device_name    | None                                                                                                                                                                          |
| OS-EXT-SRV-ATTR:user_data           | None                                                                                                                                                                          |
| OS-EXT-STS:power_state              | Running                                                                                                                                                                       |
| OS-EXT-STS:task_state               | None                                                                                                                                                                          |
| OS-EXT-STS:vm_state                 | active                                                                                                                                                                        |
| OS-SCH-HNT:scheduler_hints          | None                                                                                                                                                                          |
| OS-SRV-USG:launched_at              | 2023-06-19T14:23:27.000000                                                                                                                                                    |
| OS-SRV-USG:terminated_at            | None                                                                                                                                                                          |
| accessIPv4                          |                                                                                                                                                                               |
| accessIPv6                          |                                                                                                                                                                               |
| access_ipv4                         |                                                                                                                                                                               |
| access_ipv6                         |                                                                                                                                                                               |
| addresses                           | projectpythia-network=10.0.0.111, 149.165.172.12                                                                                                                              |
| adminPass                           | None                                                                                                                                                                          |
| admin_password                      | None                                                                                                                                                                          |
| attached_volumes                    | []                                                                                                                                                                            |
| availability_zone                   | nova                                                                                                                                                                          |
| block_device_mapping                | None                                                                                                                                                                          |
| block_device_mapping_v2             | None                                                                                                                                                                          |
| compute_host                        | None                                                                                                                                                                          |
| config_drive                        |                                                                                                                                                                               |
| created                             | 2023-05-04T17:49:22Z                                                                                                                                                          |
| created_at                          | 2023-05-04T17:49:22Z                                                                                                                                                          |
| description                         | projectpythia-k8s-node-1                                                                                                                                                      |
| disk_config                         | MANUAL                                                                                                                                                                        |
| fault                               | None                                                                                                                                                                          |
| flavor                              | m3.xl (m3.xl)                                                                                                                                                                 |
| flavorRef                           | None                                                                                                                                                                          |
| flavor_id                           | None                                                                                                                                                                          |
| has_config_drive                    |                                                                                                                                                                               |
| hostId                              | 5d18d40884e5926203ceddeb354729b4087325622d0c1dc7ae727340                                                                                                                      |
| host_id                             | 5d18d40884e5926203ceddeb354729b4087325622d0c1dc7ae727340                                                                                                                      |
| host_status                         | None                                                                                                                                                                          |
| hostname                            | projectpythia-k8s-node-1                                                                                                                                                      |
| hypervisor_hostname                 | None                                                                                                                                                                          |
| id                                  | fdd3cb15-1ad8-49be-9054-c72b399388ed                                                                                                                                          |
| image                               | 6a7c5c72-23f3-412c-8d15-bdd5756f660f                                                                                                                                          |
| imageRef                            | None                                                                                                                                                                          |
| image_id                            | None                                                                                                                                                                          |
| instance_name                       | None                                                                                                                                                                          |
| interface_ip                        |                                                                                                                                                                               |
| is_locked                           | False                                                                                                                                                                         |
| kernel_id                           | None                                                                                                                                                                          |
| key_name                            | kubernetes-projectpythia                                                                                                                                                      |
| launch_index                        | None                                                                                                                                                                          |
| launched_at                         | 2023-06-19T14:23:27.000000                                                                                                                                                    |
| location                            | Munch({'cloud': '', 'region_name': 'IU', 'zone': 'nova', 'project': Munch({'id': '261a8729f00f46c9a2009c33e5112623', 'name': None, 'domain_id': None, 'domain_name': None})}) |
| locked                              | False                                                                                                                                                                         |
| max_count                           | None                                                                                                                                                                          |
| min_count                           | None                                                                                                                                                                          |
| name                                | projectpythia-k8s-node-1                                                                                                                                                      |
| networks                            | None                                                                                                                                                                          |
| power_state                         | 1                                                                                                                                                                             |
| private_v4                          |                                                                                                                                                                               |
| private_v6                          |                                                                                                                                                                               |
| progress                            | 0                                                                                                                                                                             |
| project_id                          | 261a8729f00f46c9a2009c33e5112623                                                                                                                                              |
| properties                          | depends_on=' 392a5a08-243c-4b22-8c9b-c8b22cbd4d50 ', kubespray_groups='kube_node,k8s_cluster,', ssh_user='ubuntu', use_access_ip='0'                                          |
| public_v4                           |                                                                                                                                                                               |
| public_v6                           |                                                                                                                                                                               |
| ramdisk_id                          | None                                                                                                                                                                          |
| reservation_id                      | None                                                                                                                                                                          |
| root_device_name                    | None                                                                                                                                                                          |
| scheduler_hints                     | None                                                                                                                                                                          |
| security_groups                     | name='projectpythia-k8s-worker'                                                                                                                                               |
|                                     | name='projectpythia-k8s'                                                                                                                                                      |
| server_groups                       | []                                                                                                                                                                            |
| status                              | ACTIVE                                                                                                                                                                        |
| tags                                |                                                                                                                                                                               |
| task_state                          | None                                                                                                                                                                          |
| terminated_at                       | None                                                                                                                                                                          |
| trusted_image_certificates          | None                                                                                                                                                                          |
| updated                             | 2024-02-12T20:06:23Z                                                                                                                                                          |
| updated_at                          | 2024-02-12T20:06:23Z                                                                                                                                                          |
| user_data                           | None                                                                                                                                                                          |
| user_id                             | 7c41cf39de9f49bda9c8fab272730c90                                                                                                                                              |
| vm_state                            | active                                                                                                                                                                        |
| volumes                             | []                                                                                                                                                                            |
| volumes_attached                    |                                                                                                                                                                               |

ana-v-espinoza commented 9 months ago

I just took a second glance at your kubectl get pods output, and it looks like you have a few Pods that have been "Terminating" for the last 5 days. As you can imagine, that's unusual. Can you confirm whether or not the node that those pods are running on (they're not quite yet deleted) is operational?

kubectl get pods -o wide | grep Terminating followed by a kubectl get nodes -o wide

The status of that node should be "Ready".

It may also be worthwhile to poke around that node (check that the disk isn't full, check that it's online and reachable, etc.). SSH into that worker node by jumping through the main node of the cluster: ssh -J ubuntu@<main-node-IP> ubuntu@<node-private-ip>

ana-v-espinoza commented 9 months ago

@ktyle Thanks for that output! Yep, looks like openstack thinks everything is normal. This helps narrow things down.

I took the public IP of your worker node from the openstack server show output and tried to ping it, with no luck:

ana@ubuntu:~$ ping xxxxxxxxxx
PING xxxxxxxxxx (xxxxxxxxxx) 56(84) bytes of data.
^C
--- xxxxxxxxxxx ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10236ms

This further helps to narrow down the problem. The node appears to be unreachable. While openstack thinks/knows that the volume isn't attached to this node, Kubernetes doesn't have this information, as the node is apparently offline.

Deleting the Kubernetes volume attachment (you may need to add the --force option, i.e. kubectl delete volumeattachment --force <va-name>) should at the very least tell Kubernetes to disassociate that volume from that node, and allow the Hub pod to be rescheduled on another available worker node.

Unfortunately, I wouldn't know what may have caused the node to go offline to begin with. I believe there was some JS2 maintenance last week?

ktyle commented 9 months ago

Ok thanks @ana-v-espinoza ! Do I need to run an openstack server stop on that worker node as well (openstack server stop projectpythia-k8s-node-1 )?

ana-v-espinoza commented 9 months ago

A restart of that server probably wouldn't hurt.

Although on closer inspection of your output, it does seem that openstack says that volume thinks it's attached, while the server doesn't see any volumes attached to it.

From your openstack volume show (note the server_id):

+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                        | Value                                                                                                                                                                                                                                                                                                     |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attachments                  | [{'id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'attachment_id': 'ff3f71cd-10ba-4d08-8d52-0d83ae2ae5bd', 'volume_id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'server_id': 'fdd3cb15-1ad8-49be-9054-c72b399388ed', 'host_name': None, 'device': '/dev/sdb', 'attached_at': '2023-12-10T21:09:40.000000'}] |

From your openstack server show (again, note the same server id, but an empty volumes_attached field):

...
| id                                  | fdd3cb15-1ad8-49be-9054-c72b399388ed       
...
| volumes_attached                    |

This may need some JS2 intervention after all. Apologies for not picking up on that earlier. Ironically I missed that detail because it was at the very top of the output. Although it does still seem like there's something more going on here, since the node appears to be unreachable. Can you confirm this?

ktyle commented 9 months ago

I had just restarted the node and I can indeed ssh to it via jumping to the private IP from the public IP on the master node. The kubectl delete volumeattachment produces a message that the attachment is deleted, but the command does not complete ... upon ctrl-c'ing the volume persists.

Looks like JS2 support would need to get involved, but given that it's been a long time since I've deployed, I might try a new deployment, perhaps taking advantage of the scaling work you've done in #71?

ana-v-espinoza commented 9 months ago

I'll leave that decision up to you! However I would suggest you to ask JS2 to take a look first. I'm not convinced that this problem, which Julien and I have referred to as "volumes being in an inconsistent state" (as different openstack commands tell you different things), is related to the lifetime of a cluster (or individual instances), as we've had this happen with relatively young clusters.

If they could fix this, and everything functions as expected afterwards, it would save you having to reprovision everything.

Lastly, the "soft scaling" that I mention in that issue and subsequent blog post is meant to scale existing clusters up/down with "pre-provisioned" nodes. In other words, you can soft-scale your existing cluster. I'm not sure if that was unclear, or if I'm misinterpreting what you mean by taking advantage of my scaling work!

Best of luck, and please let me know when this comes back online or if you need any more help troubleshooting.

ktyle commented 9 months ago

Thanks so much for the quick and detailed troubleshooting suggestions, @ana-v-espinoza ! I will take your advice and open a ticket on JS2.

zonca / jupyterhub-deploy-kubernetes-jetstream

How to deal with a 'hub' pod that is stuck in "container creating" status? #74