Closed fengggli closed 4 years ago
bad timing! ;) right this week there is an issue on Jetstream that is making magnum fail. I contacted the XSEDE support, they confirmed the issue and they will let me know when it is fixed. I'll let you know.
On Mon, Jun 1, 2020 at 2:48 PM Feng Li notifications@github.com wrote:
Hi Andrea, Thanks for providing the nice tutorial https://zonca.dev/2019/06/kubernetes-jupyterhub-jetstream-magnum.html on creating k8s cluster in Jetstream. However I did meet some issues, I tried to troubleshoots myself, but it seems more complex than I could expect... It can be great if you can throw some lights!
After I run the create_cluster.sh, it stayed at the CREATE_IN_PROGRESS and later got CREATE_FAILED due to 60-min timeout. Below is the cluster info:
ubuntu@spark-master:~/Workspace/zipper-runtime/extern/jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum$ openstack coe cluster show k8s +----------------------+------------------------------------------------------------+ | Field | Value | +----------------------+------------------------------------------------------------+ | status | CREATE_IN_PROGRESS | | health_status | None | | cluster_template_id | 42db572f-55cf-4773-b3c6-ce6c219617a3 | | node_addresses | [] | | uuid | db736cd6-c966-4457-a148-15af9140f2f6 | | stack_id | 4e63db99-e061-4ca3-b851-fa408c0efd60 | | status_reason | None | | created_at | 2020-06-01T21:22:51+00:00 | | updated_at | 2020-06-01T21:22:57+00:00 | | coe_version | None | | labels | {'cloud-provider-enabled': 'true'} | | faults | | | keypair | mypi | | api_address | None | | master_addresses | [] | | create_timeout | 60 | | node_count | 1 | | discovery_url | https://discovery.etcd.io/c8c5cf836d832f0e4ad32c8186f7028d | | master_count | 1 | | container_version | None | | name | k8s | | master_flavor_id | m1.medium | | flavor_id | m1.medium | | health_status_reason | {} | | project_id | 48a4b84a042042828fa4afa651c5e81b | +----------------------+------------------------------------------------------------+
I logged into the Horizon GUI, and found out the kube_masters got stuck [image: image] https://user-images.githubusercontent.com/10205325/83456669-1cfcc580-a42e-11ea-9c5d-bc0a83fd3b7d.png
Also, only the master instance is up. (the minor node is not)
I logged into the master, and saw this error information from dockerd
Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Started Docker Application Container Engine. Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.101424812Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.415989532Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-contai> Jun 01 20:59:31 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T20:59:31.416886600Z" level=error msg="Handler for GET /v1.26/containers/heat-container-agent/json returned error: No such container: heat-contai> Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.467206659Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550048528Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd" Jun 01 21:01:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:01:18.550486697Z" level=error msg="Handler for GET /v1.26/containers/etcd/json returned error: No such container: etcd" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.673877918Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.799536186Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet" Jun 01 21:02:05 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:05.800102895Z" level=error msg="Handler for GET /v1.26/containers/kubelet/json returned error: No such container: kubelet" Jun 01 21:02:53 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:53.982935059Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.078450338Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver" Jun 01 21:02:54 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:02:54.079004833Z" level=error msg="Handler for GET /v1.26/containers/kube-apiserver/json returned error: No such container: kube-apiserver" Jun 01 21:03:17 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:17.953455397Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041471159Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-con> Jun 01 21:03:18 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:18.041974312Z" level=error msg="Handler for GET /v1.26/containers/kube-controller-manager/json returned error: No such container: kube-con> Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.401609795Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.491895559Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler" Jun 01 21:03:29 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:29.492413865Z" level=error msg="Handler for GET /v1.26/containers/kube-scheduler/json returned error: No such container: kube-scheduler" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.329987496Z" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.18.0\n" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476156843Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy" Jun 01 21:03:38 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:38.476492662Z" level=error msg="Handler for GET /v1.26/containers/kube-proxy/json returned error: No such container: kube-proxy" Jun 01 21:03:49 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: docker.service: Current command vanished from the unit file, execution of the command list won't be resumed. Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal systemd[1]: Stopping Docker Application Container Engine... Jun 01 21:03:51 k8s-ptxgozv3pfpp-master-0.novalocal dockerd-current[1264]: time="2020-06-01T21:03:51.497069056Z" level=info msg="Processing signal 'terminated'"
Please let me know if you can give advises on trouble-shooting this~ Thanks! Feng Li
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/32, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4QODV5YI47BVUGRX4DRUQOZHANCNFSM4NQFL44A .
Hi Andrea,
Is this the the issue you mentioned:[UPDATE] OpenStack commands failing on tacc.jetstream-cloud.org ?
It seems that it get fixed, but I rerun cluster creation scripts, but still got same error. Is it also happening to you currently if you are using JetStream?
Also, I also saw another of your post using kubespay in jestream, is that still supported in current Jetstream environment? (I will try it later too)
Thanks for you help! Feng
No, I think it is another issue just related to Magnum. Yes, Kubespray should work Julien Chastang regularly uses it.
@julienchastang have you deployed Kubernetes with Kubespray recently? everything working fine?
Unfortunately this problem with Magnum still persists, if there is no fix by Wednesday, I'll switch back to deploying with Kubespray.
/cc @pibion
@zonca , I tried to deploy using Kubespray method yesterday in Jetstream(following https://zonca.dev/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html), and it works fine in my side.
@fengggli excellent! thank you, I would also like to update that to a newer version of Kubespray, so that we get a newer Kubernetes.
not fixed yet, I added a notice on the tutorial, https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html.
I have deployed myself with https://zonca.dev/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html and works fine. It currently installs Kubernetes 1.12.5.
Apologies for not responding earlier. Yes kubespray works fine with the only wrinkle that the deployment of the cert is handled a little differently than originally described. I think we covered that in another issue.
(I thought I left this comment last night, but I cannot find it.) With the caveat that I did have to go back to an earlier version of this project to get things working. If Magnum is having problems and Kubespray is a good alternative, maybe we should look into that.
ok, this issue is fixed, but the Magnum cluster has scheduling issues due to availability zones, see "Note about availability zones" at https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html.
I created a fix for this, see https://github.com/zonca/magnum/pull/2/files, asked to the Jetstream team if they can apply it.
as of today, this is fixed, my Magnum tutorial works fine https://zonca.dev/2020/05/kubernetes-jupyterhub-jetstream-magnum.html
Hi Andrea, Thanks for providing the nice tutorial on creating k8s cluster in Jetstream. However I did meet some issues, I tried to troubleshoots myself, but it seems more complex than I could expect... It can be great if you can throw some lights!
After I run the create_cluster.sh, it stayed at the CREATE_IN_PROGRESS and later got CREATE_FAILED due to 60-min timeout. Below is the cluster info:
I logged into the Horizon GUI, and found out the kube_masters got stuck
Also, only the master instance is up. (the minor node is not)
I logged into the master, and saw this error information from dockerd
Please let me know if you can give advises on trouble-shooting this~ Thanks! Feng Li