weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Allow the expected size of the cluster to be specified #2602

Open awh opened 7 years ago

awh commented 7 years ago

From @bboreham on September 8, 2016 11:10

Currently we have a bit of a risk of cliques forming at startup.

Suppose you fire up 10 nodes A, B, ... J. When A runs kube-peers it gets back A, B, and when J runs kube-peers it gets back all 10. It is possible (though rare) for A to form consensus with B, and for J to form consensus with E, F, G, H and I.

If the user can configure their expectation that there will be 10 nodes, we can pass this through and avoid the A, B clique.

Copied from original issue: weaveworks/weave-kube#2

awh commented 7 years ago

From @zilman on September 8, 2016 15:47

So another concern I didn't mention when we discussed this, as follows:

There's a window where nodes are marked 'ready' but the Weave overlay might not be operational yet either cluster wide because consensus has yet to be achieved or the Weave DaemonSet hasn't been deployed fully on a node yet.

I think at the very least there should be a Service defined with a readiness check that only flips on as appropriate. Otherwise a user might end up deploying things unto a cluster in an incoherent state.

P.S. - The ideal would be of course to hook into the node lifecycle.

awh commented 7 years ago

From @bboreham on September 8, 2016 15:53

OK; that's really a separate issue. I don't think the state can be temporarily incoherent, but it can be that the network is not ready yet, and k8s will retry periodically for all pods that are supposed to be attached.

A readiness check for "ipam ready" should be straightforward to add.

awh commented 7 years ago

From @zilman on September 8, 2016 16:12

Cluster of initial size 10.

Nodes come online, become 'Ready', all receive the Weave DaemonSet, quorum is reached, ipam ready check is ok, Weave Service is ready. Conditionally on that we can deploy things into the cluster now, great, better than before.

We go up to 11, as it were. Node11 is 'Ready', receives a pod, Job, or anything else, prior to the Weave DaemonSet. Pod starts without networking being ready, unexpected behavior ensues.

Shortly thereafter the networking will start working on that Node (and for the pod hopefully) but in the meantime? (Yes, most things I can think of would be resilient to that, at worst fail and retry, but it seems brittle)

Also, what about things that were not written with our Weave service in mind and don't know to rely on that readiness check?

awh commented 7 years ago

From @bboreham on September 9, 2016 8:44

Pod starts without networking being ready

This can't happen. If the Weave daemonset hasn't installed the CNI config file yet, Kubelet will refuse to create the pod.

If it has, Kubelet fires the Weave CNI plugin; either it succeeds or it waits or it fails. If it fails kubelet will destroy the pod and try again later.

awh commented 7 years ago

From @zilman on September 9, 2016 9:29

Ah, aces! That's exactly how it should be, didn't know that plumbing is there.