Closed ktyle closed 9 months ago
what does describe pod
says?
kubectl describe pod hub-7d7494b966-gjfzb
Name: hub-7d7494b966-gjfzb
Namespace: binderhub
Priority: 0
Node: projectpythia-k8s-node-1/10.0.0.111
Start Time: Mon, 12 Feb 2024 20:12:59 +0000
Labels: app=jupyterhub
component=hub
hub.jupyter.org/network-access-proxy-api=true
hub.jupyter.org/network-access-proxy-http=true
hub.jupyter.org/network-access-singleuser=true
pod-template-hash=7d7494b966
release=binderhub
Annotations: checksum/config-map: 58d417bf1e34984e76d29278f76cffd9ce5ea5031d5ca302674cfbff7282c281
checksum/secret: 853d08046f64a310ba36430b3ef639f973f9f4f8c9b19d9d3990f133b413773a
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/hub-7d7494b966
Containers:
hub:
Container ID:
Image: jupyterhub/k8s-hub:2.0.0
Image ID:
Port: 8081/TCP
Host Port: 0/TCP
Args:
jupyterhub
--config
/usr/local/etc/jupyterhub/jupyterhub_config.py
--upgrade-db
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:http/hub/health delay=300s timeout=3s period=10s #success=1 #failure=30
Readiness: http-get http://:http/hub/health delay=0s timeout=1s period=2s #success=1 #failure=1000
Environment:
PYTHONUNBUFFERED: 1
HELM_RELEASE_NAME: binderhub
POD_NAMESPACE: binderhub (v1:metadata.namespace)
CONFIGPROXY_AUTH_TOKEN: <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'> Optional: false
Mounts:
/srv/jupyterhub from pvc (rw)
/usr/local/etc/jupyterhub/config/ from config (rw)
/usr/local/etc/jupyterhub/jupyterhub_config.py from config (rw,path="jupyterhub_config.py")
/usr/local/etc/jupyterhub/secret/ from secret (rw)
/usr/local/etc/jupyterhub/z2jh.py from config (rw,path="z2jh.py")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rcrjp (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: hub
Optional: false
secret:
Type: Secret (a volume populated by a Secret)
SecretName: hub
Optional: false
pvc:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: hub-db-dir
ReadOnly: false
kube-api-access-rcrjp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=core:NoSchedule
hub.jupyter.org_dedicated=core:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26m default-scheduler Successfully assigned binderhub/hub-7d7494b966-gjfzb to projectpythia-k8s-node-1
Warning FailedMount 15m (x3 over 24m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[pvc kube-api-access-rcrjp config secret]: timed out waiting for the condition
Warning FailedMount 10m kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[kube-api-access-rcrjp config secret pvc]: timed out waiting for the condition
Warning FailedMount 5m59s (x2 over 12m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[secret pvc kube-api-access-rcrjp config]: timed out waiting for the condition
Warning FailedMount 3m43s (x4 over 21m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[config secret pvc kube-api-access-rcrjp]: timed out waiting for the condition
Warning FailedMount 3m36s (x19 over 26m) kubelet MountVolume.WaitForAttach failed for volume "pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1" : volume attachment is being deleted
try to understand what is wrong with the volume hub-db-dir
kubectl describe pvc hub-db-dir
or something similar
it is possible the volume is stuck and Jetstream support can fix it
Hey Kevin,
We've ran into this a few times, and we've never really been able to get to the bottom of why this happens, but here are a few things that we do whenever we run into a situation like this:
See if Kubernetes thinks the Persistent Volume (the Kubernetes resource) is attached
kubectl get volumeattachments
See if openstack thinks the volume (the openstack resource) is attached to something
openstack volume show <pvc-name>
See if openstack thinks that the server has the volume attached to it (as per kubectl get volumeattachments
)
openstack server show <node-name>
We've noticed that sometimes different components of the cluster disagree on whether or not the volume is attached to a node. If openstack doesn't think the volume is attached to anything, you can do kubectl delete volumeattachment <va-name>
and potentially fix the problem. If openstack volume show
and openstack server show
tell you differing things, you may need to get JS2 involved.
Thanks @ana-v-espinoza ! Based on what I see below, it looks like k8s thinks the pvc is attached to one of the worker nodes, but the two openstack commands think it's not. Can you please confirm my interpretation and advise if you think it's okay to proceed with the kubectl delete volumeattachment
step?
kubectl get volumeattachments
csi-aa5a1281cdc776529a0019401319839cc56e314e78b7d24cd39d0c5462fb45b1 cinder.csi.openstack.org pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1 projectpythia-k8s-node-1 true 64d
openstack volume show pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attachments | [{'id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'attachment_id': 'ff3f71cd-10ba-4d08-8d52-0d83ae2ae5bd', 'volume_id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'server_id': 'fdd3cb15-1ad8-49be-9054-c72b399388ed', 'host_name': None, 'device': '/dev/sdb', 'attached_at': '2023-12-10T21:09:40.000000'}] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2023-05-16T02:43:01.000000 |
| description | Created by OpenStack Cinder CSI driver |
| encrypted | False |
| id | dd669efe-8f93-4066-bef7-2262822f0546 |
| multiattach | False |
| name | pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1 |
| os-vol-tenant-attr:tenant_id | 261a8729f00f46c9a2009c33e5112623 |
| properties | cinder.csi.openstack.org/cluster='kubernetes', csi.storage.k8s.io/pv/name='pvc-9362424f-4e32-4815-9154-d70cfcf2d7f1', csi.storage.k8s.io/pvc/name='hub-db-dir', csi.storage.k8s.io/pvc/namespace='binderhub' |
| replication_status | None |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | in-use |
| type | __DEFAULT__ |
| updated_at | 2024-02-07T04:27:49.000000 |
| user_id | 7c41cf39de9f49bda9c8fab272730c90
openstack server show projectpythia-k8s-node-1
| Field | Value |
+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hostname | projectpythia-k8s-node-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | None |
| OS-EXT-SRV-ATTR:kernel_id | None |
| OS-EXT-SRV-ATTR:launch_index | None |
| OS-EXT-SRV-ATTR:ramdisk_id | None |
| OS-EXT-SRV-ATTR:reservation_id | None |
| OS-EXT-SRV-ATTR:root_device_name | None |
| OS-EXT-SRV-ATTR:user_data | None |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SCH-HNT:scheduler_hints | None |
| OS-SRV-USG:launched_at | 2023-06-19T14:23:27.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| access_ipv4 | |
| access_ipv6 | |
| addresses | projectpythia-network=10.0.0.111, 149.165.172.12 |
| adminPass | None |
| admin_password | None |
| attached_volumes | [] |
| availability_zone | nova |
| block_device_mapping | None |
| block_device_mapping_v2 | None |
| compute_host | None |
| config_drive | |
| created | 2023-05-04T17:49:22Z |
| created_at | 2023-05-04T17:49:22Z |
| description | projectpythia-k8s-node-1 |
| disk_config | MANUAL |
| fault | None |
| flavor | m3.xl (m3.xl) |
| flavorRef | None |
| flavor_id | None |
| has_config_drive | |
| hostId | 5d18d40884e5926203ceddeb354729b4087325622d0c1dc7ae727340 |
| host_id | 5d18d40884e5926203ceddeb354729b4087325622d0c1dc7ae727340 |
| host_status | None |
| hostname | projectpythia-k8s-node-1 |
| hypervisor_hostname | None |
| id | fdd3cb15-1ad8-49be-9054-c72b399388ed |
| image | 6a7c5c72-23f3-412c-8d15-bdd5756f660f |
| imageRef | None |
| image_id | None |
| instance_name | None |
| interface_ip | |
| is_locked | False |
| kernel_id | None |
| key_name | kubernetes-projectpythia |
| launch_index | None |
| launched_at | 2023-06-19T14:23:27.000000 |
| location | Munch({'cloud': '', 'region_name': 'IU', 'zone': 'nova', 'project': Munch({'id': '261a8729f00f46c9a2009c33e5112623', 'name': None, 'domain_id': None, 'domain_name': None})}) |
| locked | False |
| max_count | None |
| min_count | None |
| name | projectpythia-k8s-node-1 |
| networks | None |
| power_state | 1 |
| private_v4 | |
| private_v6 | |
| progress | 0 |
| project_id | 261a8729f00f46c9a2009c33e5112623 |
| properties | depends_on=' 392a5a08-243c-4b22-8c9b-c8b22cbd4d50 ', kubespray_groups='kube_node,k8s_cluster,', ssh_user='ubuntu', use_access_ip='0' |
| public_v4 | |
| public_v6 | |
| ramdisk_id | None |
| reservation_id | None |
| root_device_name | None |
| scheduler_hints | None |
| security_groups | name='projectpythia-k8s-worker' |
| | name='projectpythia-k8s' |
| server_groups | [] |
| status | ACTIVE |
| tags | |
| task_state | None |
| terminated_at | None |
| trusted_image_certificates | None |
| updated | 2024-02-12T20:06:23Z |
| updated_at | 2024-02-12T20:06:23Z |
| user_data | None |
| user_id | 7c41cf39de9f49bda9c8fab272730c90 |
| vm_state | active |
| volumes | [] |
| volumes_attached | |
I just took a second glance at your kubectl get pods
output, and it looks like you have a few Pods that have been "Terminating" for the last 5 days. As you can imagine, that's unusual. Can you confirm whether or not the node that those pods are running on (they're not quite yet deleted) is operational?
kubectl get pods -o wide | grep Terminating
followed by a
kubectl get nodes -o wide
The status of that node should be "Ready".
It may also be worthwhile to poke around that node (check that the disk isn't full, check that it's online and reachable, etc.). SSH into that worker node by jumping through the main node of the cluster:
ssh -J ubuntu@<main-node-IP> ubuntu@<node-private-ip>
@ktyle
Thanks for that output! Yep, looks like openstack
thinks everything is normal. This helps narrow things down.
I took the public IP of your worker node from the openstack server show
output and tried to ping it, with no luck:
ana@ubuntu:~$ ping xxxxxxxxxx
PING xxxxxxxxxx (xxxxxxxxxx) 56(84) bytes of data.
^C
--- xxxxxxxxxxx ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10236ms
This further helps to narrow down the problem. The node appears to be unreachable. While openstack
thinks/knows that the volume isn't attached to this node, Kubernetes doesn't have this information, as the node is apparently offline.
Deleting the Kubernetes volume attachment (you may need to add the --force option
, i.e. kubectl delete volumeattachment --force <va-name>
) should at the very least tell Kubernetes to disassociate that volume from that node, and allow the Hub pod to be rescheduled on another available worker node.
Unfortunately, I wouldn't know what may have caused the node to go offline to begin with. I believe there was some JS2 maintenance last week?
Ok thanks @ana-v-espinoza ! Do I need to run an openstack server stop on that worker node as well (openstack server stop projectpythia-k8s-node-1
)?
A restart of that server probably wouldn't hurt.
Although on closer inspection of your output, it does seem that openstack says that volume thinks it's attached, while the server doesn't see any volumes attached to it.
From your openstack volume show
(note the server_id
):
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attachments | [{'id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'attachment_id': 'ff3f71cd-10ba-4d08-8d52-0d83ae2ae5bd', 'volume_id': 'dd669efe-8f93-4066-bef7-2262822f0546', 'server_id': 'fdd3cb15-1ad8-49be-9054-c72b399388ed', 'host_name': None, 'device': '/dev/sdb', 'attached_at': '2023-12-10T21:09:40.000000'}] |
From your openstack server show
(again, note the same server id, but an empty volumes_attached
field):
...
| id | fdd3cb15-1ad8-49be-9054-c72b399388ed
...
| volumes_attached |
This may need some JS2 intervention after all. Apologies for not picking up on that earlier. Ironically I missed that detail because it was at the very top of the output. Although it does still seem like there's something more going on here, since the node appears to be unreachable. Can you confirm this?
I had just restarted the node and I can indeed ssh to it via jumping to the private IP from the public IP on the master node. The kubectl delete volumeattachment produces a message that the attachment is deleted, but the command does not complete ... upon ctrl-c'ing the volume persists.
Looks like JS2 support would need to get involved, but given that it's been a long time since I've deployed, I might try a new deployment, perhaps taking advantage of the scaling work you've done in #71?
I'll leave that decision up to you! However I would suggest you to ask JS2 to take a look first. I'm not convinced that this problem, which Julien and I have referred to as "volumes being in an inconsistent state" (as different openstack commands tell you different things), is related to the lifetime of a cluster (or individual instances), as we've had this happen with relatively young clusters.
If they could fix this, and everything functions as expected afterwards, it would save you having to reprovision everything.
Lastly, the "soft scaling" that I mention in that issue and subsequent blog post is meant to scale existing clusters up/down with "pre-provisioned" nodes. In other words, you can soft-scale your existing cluster. I'm not sure if that was unclear, or if I'm misinterpreting what you mean by taking advantage of my scaling work!
Best of luck, and please let me know when this comes back online or if you need any more help troubleshooting.
Thanks so much for the quick and detailed troubleshooting suggestions, @ana-v-espinoza ! I will take your advice and open a ticket on JS2.
Hi, wonder if @zonca , @ana-v-espinoza and/or @julienchastang might have insight on this. For the last several days our JS2 binderhub has been non-functional due to the hub pod being stuck in a ContainerCreating status. I've restarted the master and worker nodes but the status remains stuck in that status.
The cluster had been running without issues for the last ~8 months or so.
Might it be time to rebuild/redeploy with the most recent terraform recipe?
Below is the output from the
kubectl get pods
command: