weave on kubernetes - communication to other nodes no longer possible

vernimmen-textkernel commented 4 years ago

What you expected to happen?

We expect communication between pods on different kubernetes nodes not to break

What happened?

Symptoms: pods on 1 kubernetes worker node stop being able to communicate with pods on other worker nodes. All other worker nodes remain fine. To work around the problem, we delete the weave pod on the affected worker node. Once the new pod is up, everything returns to normal. After a while (anywhere between 24 and 96 hours) the problem happens again. Sometimes with the same worker node, sometimes with a different worker node. When looking at the connections, some or all connections are using sleeve instead of fastdp.

How to reproduce it?

It is happening about 1x per 48 hours for us currently. We do not yet have a way to trigger the problem To try and trigger the problem we disconnected the network on one of the worker nodes for a few seconds, but that did not do anything.

Anything else we need to know?

$ kubectl get nodes -o wide
NAME              STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
kubem-01.p.nl01   Ready    master   45d   v1.14.6   10.30.12.1    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7
kubem-02.p.nl01   Ready    master   45d   v1.14.6   10.30.12.2    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7
kubem-03.p.nl01   Ready    master   45d   v1.14.6   10.30.12.3    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7
kubew-01.p.nl01   Ready    <none>   45d   v1.14.6   10.30.12.4    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7
kubew-02.p.nl01   Ready    <none>   45d   v1.14.6   10.30.12.5    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7
kubew-03.p.nl01   Ready    <none>   45d   v1.14.6   10.30.12.6    <none>        CentOS Linux 7 (Core)   3.10.0-1127.el7.x86_64   docker://18.9.7

created by kubespray 2.11 This runs in VMs on 3 hypervisors on-prem.

In my eyes the symptoms of this issue resembles https://github.com/weaveworks/weave/issues/3641 and https://github.com/weaveworks/weave/issues/3773

Versions:

$ weave version
        Version: 2.5.2 (version check update disabled)

        Service: router
       Protocol: weave 1..2
           Name: 7e:a4:fd:ed:fb:eb(kubew-03.p.nl01)
     Encryption: enabled
  PeerDiscovery: enabled
        Targets: 6
    Connections: 6 (5 established, 1 failed)
          Peers: 6 (with 30 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.233.64.0/18
  DefaultSubnet: 10.233.64.0/18

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-21T14:51:23Z", GoVersion:"go1.14.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Logs:

from this moment the communication problem started:

INFO: 2020/06/25 22:52:21.822433 overlay_switch ->[6e:d6:5e:a3:11:e7(kubem-03.p.nl01)] using sleeve
INFO: 2020/06/25 22:52:21.831501 overlay_switch ->[7a:ed:82:5e:d8:49(kubew-02.p.nl01)] using sleeve
INFO: 2020/06/25 22:52:21.840212 overlay_switch ->[b2:d9:ec:8b:d8:ff(kubem-01.p.nl01)] using sleeve
INFO: 2020/06/25 22:52:21.862142 overlay_switch ->[fa:b4:e9:63:60:33(kubem-02.p.nl01)] using sleeve
INFO: 2020/06/25 22:52:24.123512 overlay_switch ->[b6:55:d4:04:30:d3(kubew-01.p.nl01)] using sleeve

full logs of that worker node's weave pod are in https://gist.github.com/vernimmen-textkernel/110a8219a7ea33eeeea3997adf18bf6c

vernimmen-textkernel commented 4 years ago

The problem reoccurred this time after 3 days. I've now enabled debug logging to get more information when it happens again.

vernimmen-textkernel commented 4 years ago

And happened again this morning. The debug log for the broken pod (and some other details) is here: https://gist.github.com/vernimmen-textkernel/7b99aa7c076b4458684669dea4092c3f

vernimmen-textkernel commented 4 years ago

And another one: https://gist.github.com/vernimmen-textkernel/a8e3959f2c856ca9519c05640eba7ab0 I have now applied the automatic weave pod restart when there are too many sleave connections as mentioned in issue 3773, so we probably won't notice more of this problem. I hope the above logs and debug logs are enough to find the cause of the problem.

bboreham commented 4 years ago

Hi @vernimmen-textkernel

Where you get a message like this:

INFO: 2020/06/25 22:51:21.442962 ->[10.30.12.2:6783|fa:b4:e9:63:60:33(kubem-02.p.nl01)]: connection shutting down due to error: read tcp4 10.30.12.6:46468->10.30.12.2:6783: read: connection reset by peer

we need the logs of the other side, to see why it dropped the connection.

Could you please do weave status connections at the time of the outage to show what is and isn't working.

There are no errors or connection drops in the other two gists. (Generally we can see what happened from INFO logs and don't need DEBUG)

nesc58 commented 4 years ago

I have thought that my networks problems are related to this issue.

In my case: Weave had never uses the sleeve mode. After investigating a lot of issues on github I have found the solution for my problem:

Connection between nodes was cancalled and broken because the iptables rules were incorrect. It took me a lot of time to solve and understand this. My kube-proxy.yaml not contains the xtables.lock-File. So, weave-net uses the /run/xtables.lock file but kube-proxy not. So the two applications had a kind of race conditions while manipulating the iptables rules....

wizard580 commented 4 years ago

Just checked. In my case (KOPS managed cluster) kube-proxy manifest have xtables.lock-File mount. Same for weave. @nesc58 Can you show pls exact part you missed?

wizard580 commented 4 years ago

for my case if I would be able to check connectivity (from the weave pod) to the cluster services I would be able to setup liveness check. The tricky part for me now - how to do this with weave since it's privileged pod with host networking. and I should do such checks over weave provided layer. Anyone have ideas?

nesc58 commented 4 years ago

Hi, apologies for the late reply.

The weave daemonset contains the following mounts. I have removed a lot of other lines (metadata, hostnetwork and so on).

containers:
- name: weave
  image: weaveworks/weave-kube:2.6.5
  ...
  volumeMounts:
    ...
    - name: xtables-lock
      mountPath: /run/xtables.lock
      readOnly: false
- name: weave-npc
  image: weaveworks/weave-npc:2.6.5
  ...
  volumeMounts:
  - name: xtables-lock
    mountPath: /run/xtables.lock
    readOnly: false
volumes:
...
- name: xtables-lock
   hostPath:
   path: /run/xtables.lock
  type: FileOrCreate

These mounts must also available to the kube-proxy containers. (for me located at /etc/kubernetes/manifests/kube-proxy.yaml, I dont know were the static pod files stored on your system)

apiVersion: v1
kind: Pod
...
spec:
  hostNetwork: true
  containers:
    - name: kube-proxy
      image: gcr.io/google-containers/kube-proxy-amd64:v1.18.6
      command:
        - kube-proxy
        - --config=/var/lib/kubelet/kube-proxy.config
      securityContext:
        privileged: true
      volumeMounts:
        ...
        - mountPath: /run/xtables.lock
          name: iptableslock
          readOnly: false
  volumes:
     ...
    - hostPath:
        path: /run/xtables.lock
        type: FileOrCreate
      name: iptableslock

The mounts for kube-proxy missed on my system. When these mounts already set on your system, iptables modifications should work fine.

We had a lot of configuration issues. These issues resulted in a unstable running cluster. So I think that our problems differs a lot (i think).

Maybe kops created all parameters correctly and there are no manually fixes needed.

Here's what configurations I changed: We use debian linux distribution. So cgroups are managed by systemd by default. On other distributions the work is done by cgroupfs. So I had to change the docker daemon to use systemd as cgroup driver. (default is cgroupfs). Therefore the configuration of kubelet must also changed to use the systemd cgroup driver. (https://github.com/kubernetes/kubeadm/issues/1394) After that I had to change the system.conf. I created a system.conf file /etc/systemd/system.conf.d/accounting.conf with the following content to ensure that cpu, memory and block-io accounting is enabled by default.

[Manager]
DefaultCPUAccounting=yes
DefaultMemoryAccounting=yes
DefaultBlockIOAccounting=yes

Next, there seems to be a bug in cgroup handling when a lot of containers started/recreated. In this case we had to change some kernel parameters. I added cgroup_enable=memory cgroup.memory=nokmem to the GRUB_CMDLINE_LINUX (grub is the default bootloader for debian systems). (https://github.com/kubernetes/kubernetes/issues/70324#issuecomment-433612120 -> referenced issues)

On our clusters there were no crashed since these changes. I would say: The fix works for me but I can't say this is the universal solution for all problems. I'm unable to figure out all side effects of these changes.

I am sorry that i cannot help you furthermore.

Edit: Today Debian 10.5 released with new kernel version with many fixes backported. e.g. a lot of memory fixes (mm/slub, cgroups and so on). I hope this will solve a lot of issues and memory leaks.

weaveworks / weave