Open Cynovski opened 3 years ago
Any ETA to take a look at this issue ? Thanks
If you are searching a solution to this issue, there is none ATM I think. Weave is a good CNI for Kubernetes but it's hard to have encryption and +50 nodes. It is very sad that Weave is not responding to this issue because we love Weave and are using it since +4 years.
The solution for us: we are using Cilium. It is a really good alternative for Kubernetes CNI with encryption. It use eBPF and allow code execution inside the kernel for better performance and reliability. It is very efficient and allow networking even if Cilium is not fully working (in a rolling update for example).
We are using it with 140 nodes since a week and no problem since.
Bonus information: Cilium allows you to create a cluster mesh to connect multiple clusters if you need to.
What you expected to happen?
No pending connections on a cluster of 100+ nodes using both
fast datapath
andencryption
on Kubernetes.What happened?
We tried to scale a cluster of 50 machines using
encryption
and weavefast datapath
by adding another 50 machines (for a total of 100). When we got to around 70 nodes, we started to see some "pending" connections (usingweave status
command) that weren't resolving. We continued until we reached 100 nodes. On the 9,900 supposed connections (100 machines needing to connect to the 99 others), only 9,000 were established and the rest remained pending. There is no obvious logic between pending connections, some servers connect correctly to others when others don't. Even servers on the first 50 ones can't connect to each others after reaching the ~70 nodes limit.So we tried to turn off the encryption and there we had a solid 9,900 established connections. We also tried to force the use of
sleeve
with the environment variableWEAVE_NO_FASTDP=true
and same result, all connections were established correctly.We tried the exact same setup on 100 ec2 instances and same result : At around ~70 nodes, pending connections start to appear.
How to reproduce it?
Setup a 100+ nodes Kubernetes cluster and install weave.
Then, check on any weave pod the result of the
status
command. You can usestatus peers
to get more details.You should see some pending connections.
Anything else we need to know?
k8s worker : t2.medium k8s control plane (3 nodes) : t2.xlarge Tested on a fully opened iptables setup
Versions:
Logs:
There are the logs between 2 hosts that can't establish connection : weavetest-7 <----> weavetest-53 Logs of weavetest-53 :
Logs of weavetest-7:
On healthy connection, there's an endless
Heartbeat
between peers. For example, here are the logs ofweavetest-7
toweavetest-54
:Note : It seems to be some
sleeve
logs even if the connection is established withfast datapath
On peers that can't connect to others, we see some non null numbers in
Recv-Q
andSend-Q
columns. Here is an example :