zonca / jupyterhub-deploy-kubernetes-jetstream

Configuration files for my tutorials on deploying JupyterHub on top of Kubernetes on XSEDE Jetstream (Openstack)
https://zonca.dev/categories/#jetstream
23 stars 14 forks source link

k8s cluser creation stuck at "create-in-progress" #32

Closed fengggli closed 4 years ago

fengggli commented 4 years ago

Hi Andrea, Thanks for providing the nice tutorial on creating k8s cluster in Jetstream. However I did meet some issues, I tried to troubleshoots myself, but it seems more complex than I could expect... It can be great if you can throw some lights!

After I run the create_cluster.sh, it stayed at the CREATE_IN_PROGRESS and later got CREATE_FAILED due to 60-min timeout. Below is the cluster info:

ubuntu@spark-master:~/Workspace/zipper-runtime/extern/jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum$ openstack coe cluster show k8s
+----------------------+------------------------------------------------------------+
| Field                | Value                                                      |
+----------------------+------------------------------------------------------------+
| status               | CREATE_IN_PROGRESS                                         |
| health_status        | None                                                       |
| cluster_template_id  | 42db572f-55cf-4773-b3c6-ce6c219617a3                       |
| node_addresses       | []                                                         |
| uuid                 | db736cd6-c966-4457-a148-15af9140f2f6                       |
| stack_id             | 4e63db99-e061-4ca3-b851-fa408c0efd60                       |
| status_reason        | None                                                       |
| created_at           | 2020-06-01T21:22:51+00:00                                  |
| updated_at           | 2020-06-01T21:22:57+00:00                                  |
| coe_version          | None                                                       |
| labels               | {'cloud-provider-enabled': 'true'}                         |
| faults               |                                                            |
| keypair              | mypi                                                       |
| api_address          | None                                                       |
| master_addresses     | []                                                         |
| create_timeout       | 60                                                         |
| node_count           | 1                                                          |
| discovery_url        | https://discovery.etcd.io/c8c5cf836d832f0e4ad32c8186f7028d |
| master_count         | 1                                                          |
| container_version    | None                                                       |
| name                 | k8s                                                        |
| master_flavor_id     | m1.medium                                                  |
| flavor_id            | m1.medium                                                  |
| health_status_reason | {}                                                         |
| project_id           | 48a4b84a042042828fa4afa651c5e81b                           |
+----------------------+------------------------------------------------------------+

I logged into the Horizon GUI, and found out the kube_masters got stuck image

Also, only the master instance is up. (the minor node is not)

I logged into the master, and saw this error information from dockerd

Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Started Docker Application Container Engine.
Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.101424812Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.415989532Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-container-agent"
Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.416886600Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-container-agent"
Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.467206659Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550048528Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd"
Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550486697Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd"
Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.673877918Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.799536186Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet"
Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.800102895Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet"
Jun 01 21:02:53 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:53.982935059Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.078450338Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver"
Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.079004833Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver"
Jun 01 21:03:17 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:17.953455397Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041471159Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-controller-manager"
Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041974312Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-controller-manager"
Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.401609795Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.491895559Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler"
Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.492413865Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler"
Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.329987496Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n"
Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476156843Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy"
Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476492662Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy"
Jun 01 21:03:49 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: docker.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Stopping Docker Application Container Engine...
Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:51.497069056Z" level=info msg="Processing signal 'terminated'"

Please let me know if you can give advises on trouble-shooting this~ Thanks! Feng Li

zonca commented 4 years ago

bad timing! ;) right this week there is an issue on Jetstream that is making magnum fail. I contacted the XSEDE support, they confirmed the issue and they will let me know when it is fixed. I'll let you know.

On Mon, Jun 1, 2020 at 2:48 PM Feng Li notifications@github.com wrote:

Hi Andrea, Thanks for providing the nice tutorial https://zonca.dev/2019/06/kubernetes-jupyterhub-jetstream-magnum.html on creating k8s cluster in Jetstream. However I did meet some issues, I tried to troubleshoots myself, but it seems more complex than I could expect... It can be great if you can throw some lights!

After I run the create_cluster.sh, it stayed at the CREATE_IN_PROGRESS and later got CREATE_FAILED due to 60-min timeout. Below is the cluster info:

ubuntu@spark-master:~/Workspace/zipper-runtime/extern/jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum$ openstack coe cluster show k8s +----------------------+------------------------------------------------------------+ | Field | Value | +----------------------+------------------------------------------------------------+ | status | CREATE_IN_PROGRESS | | health_status | None | | cluster_template_id | 42db572f-55cf-4773-b3c6-ce6c219617a3 | | node_addresses | [] | | uuid | db736cd6-c966-4457-a148-15af9140f2f6 | | stack_id | 4e63db99-e061-4ca3-b851-fa408c0efd60 | | status_reason | None | | created_at | 2020-06-01T21:22:51+00:00 | | updated_at | 2020-06-01T21:22:57+00:00 | | coe_version | None | | labels | {'cloud-provider-enabled': 'true'} | | faults | | | keypair | mypi | | api_address | None | | master_addresses | [] | | create_timeout | 60 | | node_count | 1 | | discovery_url | https://discovery.etcd.io/c8c5cf836d832f0e4ad32c8186f7028d | | master_count | 1 | | container_version | None | | name | k8s | | master_flavor_id | m1.medium | | flavor_id | m1.medium | | health_status_reason | {} | | project_id | 48a4b84a042042828fa4afa651c5e81b | +----------------------+------------------------------------------------------------+

I logged into the Horizon GUI, and found out the kube_masters got stuck [image: image] https://user-images.githubusercontent.com/10205325/83456669-1cfcc580-a42e-11ea-9c5d-bc0a83fd3b7d.png

Also, only the master instance is up. (the minor node is not)

I logged into the master, and saw this error information from dockerd

Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Started Docker Application Container Engine. Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.101424812Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.415989532Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-contai> Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.416886600Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-contai> Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.467206659Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550048528Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd" Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550486697Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.673877918Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.799536186Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.800102895Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet" Jun 01 21:02:53 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:53.982935059Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.078450338Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver" Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.079004833Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver" Jun 01 21:03:17 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:17.953455397Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041471159Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-con> Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041974312Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-con> Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.401609795Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.491895559Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler" Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.492413865Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.329987496Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476156843Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476492662Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy" Jun 01 21:03:49 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: docker.service: Current command vanished from the unit file, execution of the command list won't be resumed. Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Stopping Docker Application Container Engine... Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:51.497069056Z" level=info msg="Processing signal 'terminated'"

Please let me know if you can give advises on trouble-shooting this~ Thanks! Feng Li

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/32, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4QODV5YI47BVUGRX4DRUQOZHANCNFSM4NQFL44A .

fengggli commented 4 years ago

Hi Andrea,

Is this the the issue you mentioned:[UPDATE] OpenStack commands failing on tacc.jetstream-cloud.org ?

It seems that it get fixed, but I rerun cluster creation scripts, but still got same error. Is it also happening to you currently if you are using JetStream?

Also, I also saw another of your post using kubespay in jestream, is that still supported in current Jetstream environment? (I will try it later too)

Thanks for you help! Feng

zonca commented 4 years ago

No, I think it is another issue just related to Magnum. Yes, Kubespray should work Julien Chastang regularly uses it.

zonca commented 4 years ago

@julienchastang have you deployed Kubernetes with Kubespray recently? everything working fine?

Unfortunately this problem with Magnum still persists, if there is no fix by Wednesday, I'll switch back to deploying with Kubespray.

/cc @pibion

fengggli commented 4 years ago

@zonca , I tried to deploy using Kubespray method yesterday in Jetstream(following https://zonca.dev/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html), and it works fine in my side.

zonca commented 4 years ago

@fengggli excellent! thank you, I would also like to update that to a newer version of Kubespray, so that we get a newer Kubernetes.

zonca commented 4 years ago

not fixed yet, I added a notice on the tutorial, https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html.

I have deployed myself with https://zonca.dev/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html and works fine. It currently installs Kubernetes 1.12.5.

julienchastang commented 4 years ago

Apologies for not responding earlier. Yes kubespray works fine with the only wrinkle that the deployment of the cert is handled a little differently than originally described. I think we covered that in another issue.

julienchastang commented 4 years ago

(I thought I left this comment last night, but I cannot find it.) With the caveat that I did have to go back to an earlier version of this project to get things working. If Magnum is having problems and Kubespray is a good alternative, maybe we should look into that.

zonca commented 4 years ago

ok, this issue is fixed, but the Magnum cluster has scheduling issues due to availability zones, see "Note about availability zones" at https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html.

I created a fix for this, see https://github.com/zonca/magnum/pull/2/files, asked to the Jetstream team if they can apply it.

zonca commented 4 years ago

as of today, this is fixed, my Magnum tutorial works fine https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html