weaveworks / wksctl

Open Source Weaveworks Kubernetes System
Apache License 2.0
395 stars 55 forks source link

If 'kubeadm reset' fails to remove an etcd member, then that node will not recover automatically #284

Open bboreham opened 4 years ago

bboreham commented 4 years ago

Symptom is kubeadm repeatedly failing like this:

time="2020-07-28T02:45:41Z" level=info msg=Applying resource="kubeadm:join"
time="2020-07-28T02:45:41Z" level=info msg="joining Kubernetes cluster"
time="2020-07-28T02:45:41Z" level=debug msg="running command: ..."
[preflight] Running pre-flight checks
...
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: dial tcp 172.31.71.248:2379: connect: connection refused
time="2020-07-28T02:45:50Z" level=error msg="failed to join cluster" stdouterr="..."

The reason it is failing is that etcd thinks it has three members, but only two of them are alive, and the missing one was running on this node until wks-controller shut it down via kubeadm reset.

kubeadm does have code to remove from etcd, but it seems on this occasion it failed (might have been because we had a problem earlier and the kubeadm certs expired)

time="2020-07-28T02:38:46Z" level=debug msg="running command: sudo -n -- sh -c 'kubeadm reset --force'"
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
W0728 02:38:46.733486   22923 reset.go:73] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get node registration: failed to get node name from kubelet config: open /etc/kubernetes/kubelet.conf: no such file or directory
W0728 02:38:46.733675   22923 reset.go:234] [reset] No kubeadm config, using etcd pod spec to get data directory
[preflight] Running pre-flight checks
[reset] No etcd config found. Assuming external etcd
[reset] Please manually reset etcd to prevent further issues