weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 668 forks source link

Pending connections with encrypted fast datapath on large clusters #3904

Open Cynovski opened 3 years ago

Cynovski commented 3 years ago

What you expected to happen?

No pending connections on a cluster of 100+ nodes using both fast datapath and encryption on Kubernetes.

What happened?

We tried to scale a cluster of 50 machines using encryption and weave fast datapath by adding another 50 machines (for a total of 100). When we got to around 70 nodes, we started to see some "pending" connections (using weave status command) that weren't resolving. We continued until we reached 100 nodes. On the 9,900 supposed connections (100 machines needing to connect to the 99 others), only 9,000 were established and the rest remained pending. There is no obvious logic between pending connections, some servers connect correctly to others when others don't. Even servers on the first 50 ones can't connect to each others after reaching the ~70 nodes limit. Selection_025

So we tried to turn off the encryption and there we had a solid 9,900 established connections. We also tried to force the use of sleeve with the environment variable WEAVE_NO_FASTDP=true and same result, all connections were established correctly.

We tried the exact same setup on 100 ec2 instances and same result : At around ~70 nodes, pending connections start to appear.

How to reproduce it?

Setup a 100+ nodes Kubernetes cluster and install weave.

kubectl create secret -n kube-system generic weave-passwd --from-file=weave-passwd=<(pwgen -1 50)
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&password-secret=weave-passwd"

Then, check on any weave pod the result of the status command. You can use status peers to get more details.

kubectl exec -n kube-system weave-net-zbwvn -c weave -- /home/weave/weave --local status

You should see some pending connections.

Anything else we need to know?

k8s worker : t2.medium k8s control plane (3 nodes) : t2.xlarge Tested on a fully opened iptables setup

Versions:

$ weave version
2.8.1
$ docker version
20.10.6
$ uname -a
Linux weavetest-1 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux (bare metal + ec2)
Linux weavetest-1 5.10.0-0.bpo.5-cloud-amd64 #1 SMP Debian 5.10.24-1~bpo10+1 (2021-03-29) x86_64 GNU/Linux (ec2)
$ kubectl version
1.18.10

Logs:

There are the logs between 2 hosts that can't establish connection : weavetest-7 <----> weavetest-53 Logs of weavetest-53 :

INFO: 2021/05/25 15:08:12.735708 ->[x.x.x.x:6783|da:67:4f:70:22:cf(weavetest-7)]: connection ready; using protocol version 2
INFO: 2021/05/25 15:08:12.735785 overlay_switch ->[da:67:4f:70:22:cf(weavetest-7)] using fastdp
INFO: 2021/05/25 15:08:12.821543 ->[x.x.x.x:6783|da:67:4f:70:22:cf(weavetest-7)]: connection added (new peer)
INFO: 2021/05/25 15:08:12.824714 Setting up IPsec between ce:ae:64:b3:0a:a5(weavetest-53) and da:67:4f:70:22:cf(weavetest-7)
DEBU: 2021/05/25 15:08:17.554670 fastdp ->[x.x.x.x:6784|da:67:4f:70:22:cf(weavetest-7)]: confirmed

Logs of weavetest-7:

INFO: 2021/05/25 15:08:12.728336 ->[x.x.x.x:50991|ce:ae:64:b3:0a:a5(weavetest-53)]: connection ready; using protocol version 2
INFO: 2021/05/25 15:08:12.728454 overlay_switch ->[ce:ae:64:b3:0a:a5(weavetest-53)] using fastdp
INFO: 2021/05/25 15:08:12.728481 ->[x.x.x.x:50991|ce:ae:64:b3:0a:a5(weavetest-53)]: connection added (restarted peer)
INFO: 2021/05/25 15:08:12.728528 Setting up IPsec between da:67:4f:70:22:cf(weavetest-7) and ce:ae:64:b3:0a:a5(weavetest-53)
DEBU: 2021/05/25 15:08:12.783721 fastdp ->[x.x.x.x:6784|ce:ae:64:b3:0a:a5(weavetest-53)]: confirmed
INFO: 2021/05/25 15:08:19.808635 Discovered remote MAC ce:ae:64:b3:0a:a5 at ce:ae:64:b3:0a:a5(weavetest-53)

DEBU: 2021/05/25 15:19:04.568360 Expired MAC ce:ae:64:b3:0a:a5 at ce:ae:64:b3:0a:a5(weavetest-53)

On healthy connection, there's an endless Heartbeat between peers. For example, here are the logs of weavetest-7 to weavetest-54:

.
.
DEBU: 2021/05/25 15:37:30.610530 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: handleVxlanSpecialPacket
DEBU: 2021/05/25 15:37:30.610540 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: Got Heartbeat Ack from peer
DEBU: 2021/05/25 15:37:30.844400 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: sending Heartbeat to peer
DEBU: 2021/05/25 15:37:39.709031 sleeve ->[x.x.x.x:6783|be:f4:d0:de:b2:01(weavetest-54)]: handleHeartbeat
DEBU: 2021/05/25 15:37:40.258665 sleeve ->[x.x.x.x:6783|be:f4:d0:de:b2:01(weavetest-54)]: sendHeartbeat
DEBU: 2021/05/25 15:37:40.611157 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: handleVxlanSpecialPacket
DEBU: 2021/05/25 15:37:40.611166 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: Got Heartbeat Ack from peer
DEBU: 2021/05/25 15:37:40.844657 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: sending Heartbeat to peer
DEBU: 2021/05/25 15:37:49.709060 sleeve ->[x.x.x.x:6783|be:f4:d0:de:b2:01(weavetest-54)]: handleHeartbeat
DEBU: 2021/05/25 15:37:50.258721 sleeve ->[x.x.x.x:6783|be:f4:d0:de:b2:01(weavetest-54)]: sendHeartbeat
DEBU: 2021/05/25 15:37:50.611359 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: handleVxlanSpecialPacket
DEBU: 2021/05/25 15:37:50.611368 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: Got Heartbeat Ack from peer
DEBU: 2021/05/25 15:37:50.844976 fastdp ->[x.x.x.x:6784|be:f4:d0:de:b2:01(weavetest-54)]: sending Heartbeat to peer
.
.

Note : It seems to be some sleeve logs even if the connection is established with fast datapath

-> x.x.x.x:6783    established encrypted   fastdp be:f4:d0:de:b2:01(weavetest-54) encrypted=truemtu=1376

On peers that can't connect to others, we see some non null numbers in Recv-Q and Send-Q columns. Here is an example :

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp    63216 202192 172.31.46.200:46683     X.X.X.X:6783       ESTABLISHED 0          30225      3780/weaver   # Non working connection   
tcp        0      0 172.31.46.200:44133     Y.Y.Y.Y:6783       ESTABLISHED 0          30182      3780/weaver   # Fully functional       
tcp    63419 273672 172.31.46.200:37893     Z.Z.Z.Z:6783       ESTABLISHED 0          30219      3780/weaver   # Non working connection
Cynovski commented 3 years ago

Any ETA to take a look at this issue ? Thanks

GabMgt commented 3 years ago

If you are searching a solution to this issue, there is none ATM I think. Weave is a good CNI for Kubernetes but it's hard to have encryption and +50 nodes. It is very sad that Weave is not responding to this issue because we love Weave and are using it since +4 years.

The solution for us: we are using Cilium. It is a really good alternative for Kubernetes CNI with encryption. It use eBPF and allow code execution inside the kernel for better performance and reliability. It is very efficient and allow networking even if Cilium is not fully working (in a rolling update for example).

We are using it with 140 nodes since a week and no problem since.

Bonus information: Cilium allows you to create a cluster mesh to connect multiple clusters if you need to.