Cannot use wksctl to start ignite multi-master Firekube

chanwit commented 4 years ago

When using wksctl to start a multi-master Firekube cluster, the cluster ended up like this when the 2nd or the 3rd master started to join the cluster. This failure is deterministic and always reproducible.

time="2019-10-19T10:52:19Z" level=debug msg="running command: sudo -n -- sh -c 'cat /etc/machine-id 2>/dev/null || cat /var/lib/dbus/machine-id 2>/dev/null'"
7a98435980b347bf8935da14e206654f
time="2019-10-19T10:52:19Z" level=debug msg="running command: sudo -n -- sh -c 'cat /sys/class/dmi/id/product_uuid 2>/dev/null || cat /etc/machine-id 2>/dev/null'"
7a98435980b347bf8935da14e206654f
time="2019-10-19T10:52:19Z" level=debug msg="running command: sudo -n -- sh -c 'command -v -- \"selinuxenabled\" >/dev/null 2>&1'"
time="2019-10-19T10:52:19Z" level=debug msg="running command: sudo -n -- sh -c 'selinuxenabled'"
time="2019-10-19T10:52:19Z" level=debug msg="running command: sudo -n -- sh -c 'cat /proc/1/environ'"
time="2019-10-19T10:52:19Z" level=debug msg="the following env-specific configuration will be used" config="&{0 true [] true true true weavek8sops }"
time="2019-10-19T10:52:19Z" level=info msg="cordon node \"ff308594e2a1c34e\""
time="2019-10-19T10:52:21Z" level=warning msg="ignoring DaemonSet-managed Pods: kube-system/kube-proxy-9xbjm, weavek8sops/weave-net-2m8l2"
time="2019-10-19T10:52:24Z" level=debug msg="5 pod(s) to be evicted from ff308594e2a1c34e"
time="2019-10-19T10:52:24Z" level=fatal msg="<nil>"

After that, kubectl could not connect to the API server any more.

Here's the config.yaml:

# This file contains high level configuration parameters. The setup.sh script
# takes this file as input and creates lower level manifests.

# backend defines how the machines underpinning Kubernetes nodes are created.
#  - docker: use containers as "VMs" using footloose:
#            https://github.com/weaveworks/footloose
#  - ignite: use footloose with ignite and firecracker to create real VMs using:
#            the ignite backend only works on linux as it requires KVM.
#            https://github.com/weaveworks/ignite.
backend: ignite

# Number of nodes allocated for the Kubernetes control plane and workers.
controlPlane:
  nodes: 3
workers:
  nodes: 1

chanwit commented 4 years ago

When I changed configuration to 1 master 3 workers, the cluster is able to operate. So the problem is narrowed down to the process of forming multi-master only.

chanwit commented 4 years ago

I'm quite sure that the master joining process inside wks-controller does not work properly.

I did many combinations of machines.yaml, committed and pushed to check how the joining process worked. They ended up the same way. The 1st master was gone after the 2nd joined.

jrryjcksn commented 4 years ago

This works for me (on docker) see below... Does docker work for you? Also, is there any more you can tell me about your environment?

NAME    STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION     CONTAINER-RUNTIME
node0   Ready    master   37m     v1.14.1   172.17.0.2    <none>        CentOS Linux 7 (Core)   4.9.184-linuxkit   docker://18.9.7
node1   Ready    master   21m     v1.14.1   172.17.0.3    <none>        CentOS Linux 7 (Core)   4.9.184-linuxkit   docker://18.9.7
node2   Ready    master   10m     v1.14.1   172.17.0.4    <none>        CentOS Linux 7 (Core)   4.9.184-linuxkit   docker://18.9.7
node3   Ready    <none>   3m37s   v1.14.1   172.17.0.5    <none>        CentOS Linux 7 (Core)   4.9.184-linuxkit   docker://18.9.7

wk-quickstart on  make-TRACK-switchable [$?] on ☁️  us-east-1 took 2s 
❯ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
kube-system   coredns-fb8b8dccf-c4lth           1/1     Running   0          37m     10.32.0.5    node0   <none>           <none>
kube-system   coredns-fb8b8dccf-fvgnd           1/1     Running   0          37m     10.32.0.6    node0   <none>           <none>
kube-system   etcd-node0                        1/1     Running   0          36m     172.17.0.2   node0   <none>           <none>
kube-system   etcd-node1                        1/1     Running   0          21m     172.17.0.3   node1   <none>           <none>
kube-system   etcd-node2                        1/1     Running   0          9m48s   172.17.0.4   node2   <none>           <none>
kube-system   kube-apiserver-node0              1/1     Running   0          36m     172.17.0.2   node0   <none>           <none>
kube-system   kube-apiserver-node1              1/1     Running   1          21m     172.17.0.3   node1   <none>           <none>
kube-system   kube-apiserver-node2              1/1     Running   1          10m     172.17.0.4   node2   <none>           <none>
kube-system   kube-controller-manager-node0     1/1     Running   1          36m     172.17.0.2   node0   <none>           <none>
kube-system   kube-controller-manager-node1     1/1     Running   0          20m     172.17.0.3   node1   <none>           <none>
kube-system   kube-controller-manager-node2     1/1     Running   0          10m     172.17.0.4   node2   <none>           <none>
kube-system   kube-proxy-5l6vr                  1/1     Running   0          3m44s   172.17.0.5   node3   <none>           <none>
kube-system   kube-proxy-fzjmm                  1/1     Running   0          21m     172.17.0.3   node1   <none>           <none>
kube-system   kube-proxy-rhqr7                  1/1     Running   0          10m     172.17.0.4   node2   <none>           <none>
kube-system   kube-proxy-z8qx6                  1/1     Running   0          37m     172.17.0.2   node0   <none>           <none>
kube-system   kube-scheduler-node0              1/1     Running   1          36m     172.17.0.2   node0   <none>           <none>
kube-system   kube-scheduler-node1              1/1     Running   0          20m     172.17.0.3   node1   <none>           <none>
kube-system   kube-scheduler-node2              1/1     Running   0          10m     172.17.0.4   node2   <none>           <none>
weavek8sops   flux-5675c5d88-djjq9              1/1     Running   0          37m     10.32.0.2    node0   <none>           <none>
weavek8sops   memcached-6bc6886f9f-sksx6        1/1     Running   0          37m     10.32.0.3    node0   <none>           <none>
weavek8sops   weave-net-22v2w                   2/2     Running   0          21m     172.17.0.3   node1   <none>           <none>
weavek8sops   weave-net-295gb                   2/2     Running   1          3m44s   172.17.0.5   node3   <none>           <none>
weavek8sops   weave-net-j8m2c                   2/2     Running   0          37m     172.17.0.2   node0   <none>           <none>
weavek8sops   weave-net-kkq5n                   2/2     Running   1          10m     172.17.0.4   node2   <none>           <none>
weavek8sops   wks-controller-8668fcbdb9-hkjjs   1/1     Running   0          37m     10.32.0.4    node0   <none>           <none>

config.yaml:

# This file contains high level configuration parameters. The setup.sh script
# takes this file as input and creates lower level manifests.

# backend defines how the machines underpinning Kubernetes nodes are created.
#  - docker: use containers as "VMs" using footloose:
#            https://github.com/weaveworks/footloose
#  - ignite: use footloose with ignite and firecracker to create real VMs using:
#            the ignite backend only works on linux as it requires KVM.
#            https://github.com/weaveworks/ignite.
backend: docker

# Number of nodes allocated for the Kubernetes control plane and workers.
controlPlane:
  nodes: 3
workers:
  nodes: 1

jrryjcksn commented 4 years ago

Also, could I see one of your machines.yaml files?

chanwit commented 4 years ago

I haven't tried it with Docker yet. I tested only with Ignite mode. I'll use Docker mode and get back to you.
My environment is Packet Bare Metal. Ubuntu 1804. Here's the repo I used to start the cluster. https://github.com/chanwit/firekube-profile-demo

chanwit commented 4 years ago

You could also find generated machines.yaml file there: https://github.com/chanwit/firekube-profile-demo/blob/master/machines.yaml

chanwit commented 4 years ago

@jrryjcksn confirmed that the joining process worked on Docker backend. My conclusion is now invalid.

Jerry, which quickstart repo you are using to up the cluster?

chanwit commented 4 years ago

I tested with backend: docker and the cluster up and working really fine. There should be something wrong which I really don't understand. Thank you @jrryjcksn !

chanwit commented 4 years ago

I recorded asciinema for another weird behaviour. This time the eviction process completed fast enough. So the cluster's API server not broken. But 2 masters gone as their roles disappeared and marked scheduled disabled.

palemtnrider commented 4 years ago

We have isolated this to just the ignite environment. It works with a docker backend and with baremetal ec2 machines.

palemtnrider commented 4 years ago

Moving out of the current milestone. Will replan for a future sprint.

jrryjcksn commented 4 years ago

Fixed by PR: https://github.com/weaveworks/wksctl/pull/118

palemtnrider commented 4 years ago

Changed estimate to 1 when pulling into the release.

weaveworks / wksctl

Cannot use wksctl to start ignite multi-master Firekube #102