Update from 2.2.0 to 2.2.1 leaves Weave pods in CrashLoopBackOff

carlosedp commented 6 years ago

What you expected to happen?

Have Weave Net pods updated

What happened?

I fetched the newest manifest to update from 2.2.0 to 2.2.1 using: kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

After this, the Weave pods in the cluster restarted and went to CrashLoopBackOff. The logs asked for a Weave reset.

I did a weave reset in all nodes, pods went to Running but all communication to my application pods were lost (although weave status showed that the nodes were connected. I needed a full reboot in all nodes since the rules were wiped.

How to reproduce it?

This happened two times when updating Weave pods.

Versions:

Linux kubemaster1 4.4.77-rockchip-ayufan-136 #1 SMP Thu Oct 12 09:14:48 UTC 2017 aarch64 GNU/Linux

$kubectl version
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/arm64"}

$ docker version
Client:
 Version:       17.12.1-ce
 API version:   1.35
 Go version:    go1.9.4
 Git commit:    7390fc6
 Built: Tue Feb 27 22:12:26 2018
 OS/Arch:       linux/arm64

Server:
 Engine:
  Version:      17.12.1-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   7390fc6
  Built:        Tue Feb 27 22:10:35 2018
  OS/Arch:      linux/arm64
  Experimental: false

$ weave version
weave script 2.2.0
weave 2.2.1

$ weave status connections
-> 192.168.1.55:6783     established sleeve a6:b4:8e:f4:d3:1c(kubenode1) mtu=1438
<- 192.168.1.56:52794    established sleeve 86:82:1a:3d:5b:64(kubenode2) mtu=1438
-> 192.168.1.50:6783     failed      cannot connect to ourself, retry: never

$ weave status peers
a6:b4:8e:f4:d3:1c(kubenode1)
   <- 192.168.1.50:53428    6a:97:aa:86:2d:76(kubemaster1)        established
   <- 192.168.1.56:33020    86:82:1a:3d:5b:64(kubenode2)          established
86:82:1a:3d:5b:64(kubenode2)
   -> 192.168.1.55:6783     a6:b4:8e:f4:d3:1c(kubenode1)          established
   -> 192.168.1.50:6783     6a:97:aa:86:2d:76(kubemaster1)        established
6a:97:aa:86:2d:76(kubemaster1)
   -> 192.168.1.55:6783     a6:b4:8e:f4:d3:1c(kubenode1)          established
   <- 192.168.1.56:52794    86:82:1a:3d:5b:64(kubenode2)          established

Logs:

Pod description:

Pod: weave-net-b5hcj

Name:           weave-net-b5hcj
Namespace:      kube-system
Node:           kubenode1/192.168.1.55
Start Time:     Thu, 15 Mar 2018 10:37:54 -0500
Labels:         controller-revision-hash=2458412186
                name=weave-net
                pod-template-generation=2
Annotations:    <none>
Status:         Running
IP:             192.168.1.55
Controlled By:  DaemonSet/weave-net
Containers:
  weave:
    Container ID:  docker://7c9758fbd2579b66656c436a0e1a96772c3ca2a3efcfe2c58b85cfa507768d02
    Image:         weaveworks/weave-kube:2.2.1
    Image ID:      docker-pullable://weaveworks/weave-kube@sha256:801f6e87b6d0f87d1391d8d7e5eeb5f9fbee81012dbcbf488809f48ef3479ce2
    Port:          <none>
    Command:
      /home/weave/launch.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 15 Mar 2018 10:38:36 -0500
      Finished:     Thu, 15 Mar 2018 10:38:37 -0500
    Ready:          False
    Restart Count:  1
    Requests:
      cpu:     10m
    Liveness:  http-get http://127.0.0.1:6784/status delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:
      HOSTNAME:   (v1:spec.nodeName)
    Mounts:
      /host/etc from cni-conf (rw)
      /host/home from cni-bin2 (rw)
      /host/opt from cni-bin (rw)
      /host/var/lib/dbus from dbus (rw)
      /lib/modules from lib-modules (rw)
      /run/xtables.lock from xtables-lock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from weave-net-token-5wjnr (ro)
      /weavedb from weavedb (rw)
  weave-npc:
    Container ID:   docker://dcf0e68ddc63adb40f47d9330c38eabf7fbda389a8381fec9984d543c02fc479
    Image:          weaveworks/weave-npc:2.2.1
    Image ID:       docker-pullable://weaveworks/weave-npc@sha256:6a39dfa58f1a3a0afc103ab751fd75b26dd6cc591e93c4f738675f50f59b391a
    Port:           <none>
    State:          Running
      Started:      Thu, 15 Mar 2018 10:38:33 -0500
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:  10m
    Environment:
      HOSTNAME:   (v1:spec.nodeName)
    Mounts:
      /run/xtables.lock from xtables-lock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from weave-net-token-5wjnr (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  weavedb:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/weave
    HostPathType:
  cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /opt
    HostPathType:
  cni-bin2:
    Type:          HostPath (bare host directory volume)
    Path:          /home
    HostPathType:
  cni-conf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc
    HostPathType:
  dbus:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/dbus
    HostPathType:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:
  weave-net-token-5wjnr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  weave-net-token-5wjnr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                 Age                From                Message
  ----     ------                 ----               ----                -------
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "cni-bin"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "lib-modules"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "xtables-lock"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "dbus"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "weavedb"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "cni-conf"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "weave-net-token-5wjnr"
  Normal   Pulling                53s                kubelet, kubenode1  pulling image "weaveworks/weave-kube:2.2.1"
  Normal   SuccessfulMountVolume  53s                kubelet, kubenode1  MountVolume.SetUp succeeded for volume "cni-bin2"
  Normal   Pulling                27s                kubelet, kubenode1  pulling image "weaveworks/weave-npc:2.2.1"
  Normal   Pulled                 27s                kubelet, kubenode1  Successfully pulled image "weaveworks/weave-kube:2.2.1"
  Normal   Pulled                 16s                kubelet, kubenode1  Successfully pulled image "weaveworks/weave-npc:2.2.1"
  Normal   Created                16s                kubelet, kubenode1  Created container
  Normal   Started                15s                kubelet, kubenode1  Started container
  Normal   Created                13s (x2 over 27s)  kubelet, kubenode1  Created container
  Normal   Pulled                 13s                kubelet, kubenode1  Container image "weaveworks/weave-kube:2.2.1" already present on machine
  Normal   Started                12s (x2 over 27s)  kubelet, kubenode1  Started container
  Warning  BackOff                4s (x3 over 9s)    kubelet, kubenode1  Back-off restarting failed container

Pod Logs:

rock64@kubemaster1:~/repos/kubernetes-ARM (kubearm:kube-system) $ klog weave-net-b5hcj weave
Pod: weave-net-b5hcj

modprobe: module br_netfilter not found in modules.dep
Ignore the error if "br_netfilter" is built-in in the kernel
DEBU: 2018/03/15 15:38:37.647853 [kube-peers] Checking peer "a6:b4:8e:f4:d3:1c" against list &{[{6a:97:aa:86:2d:76 kubemaster1} {a6:b4:8e:f4:d3:1c kubenode1} {86:82:1a:3d:5b:64 kubenode2}]}
INFO: 2018/03/15 15:38:37.869850 Command line options: map[ipalloc-init:consensus=3 expect-npc:true host-root:/host ipalloc-range:10.32.0.0/12 port:6783 datapath:datapath nickname:kubenode1 no-dns:true name:a6:b4:8e:f4:d3:1c status-addr:0.0.0.0:6782 conn-limit:30 db-prefix:/weavedb/weave-net docker-api: http-addr:127.0.0.1:6784]
INFO: 2018/03/15 15:38:37.870036 weave  2.2.1
FATA: 2018/03/15 15:38:37.870839 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again
rpc error: code = Unknown desc = Error: No such container: 7c9758fbd2579b66656c436a0e1a96772c3ca2a3efcfe2c58b85cfa507768d02

Network:

IPTables after update:

rock64@kubemaster1:~/repos/kubernetes-ARM (kubearm:kube-system) $ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             /* kubernetes service portals */
KUBE-FIREWALL  all  --  anywhere             anywhere

Chain FORWARD (policy DROP)
target     prot opt source               destination
WEAVE-NPC  all  --  anywhere             anywhere             /* NOTE: this must go before '-j KUBE-FORWARD' */
NFLOG      all  --  anywhere             anywhere             state NEW nflog-group 86
DROP       all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forward rules */
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             /* kubernetes service portals */
KUBE-FIREWALL  all  --  anywhere             anywhere

Chain DOCKER (1 references)
target     prot opt source               destination

Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere             /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x4000/0x4000

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
REJECT     tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns-tcp has no endpoints */ tcp dpt:domain reject-with icmp-port-unreachable
REJECT     tcp  --  anywhere             anywhere             /* monitoring/prometheus-k8s:web has no endpoints */ ADDRTYPE match dst-type LOCAL tcp dpt:30900 reject-with icmp-port-unreachable
REJECT     tcp  --  anywhere             10.96.233.162        /* monitoring/prometheus-k8s:web has no endpoints */ tcp dpt:9090 reject-with icmp-port-unreachable
REJECT     udp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns has no endpoints */ udp dpt:domain reject-with icmp-port-unreachable

Chain WEAVE-NPC (1 references)
target     prot opt source               destination

Chain WEAVE-NPC-DEFAULT (0 references)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             match-set weave-E.1.0W^NGSp]0_t5WwH/]gX@L dst /* DefaultAllow isolation for namespace: default */
ACCEPT     all  --  anywhere             anywhere             match-set weave-0EHD/vdN#O4]V?o4Tx7kS;APH dst /* DefaultAllow isolation for namespace: kube-public */
ACCEPT     all  --  anywhere             anywhere             match-set weave-?b%zl9GIe0AET1(QI^7NWe*fO dst /* DefaultAllow isolation for namespace: kube-system */
ACCEPT     all  --  anywhere             anywhere             match-set weave-nnn:05q%gSR8#E0t4|T#A$Mu1 dst /* DefaultAllow isolation for namespace: logging */
ACCEPT     all  --  anywhere             anywhere             match-set weave-~zf6jE)_J7*w!=f=0gxUC[x*H dst /* DefaultAllow isolation for namespace: mediaserver */
ACCEPT     all  --  anywhere             anywhere             match-set weave-:FcGPRhYdfA7EiC42@F@|EzU7 dst /* DefaultAllow isolation for namespace: metallb-system */
ACCEPT     all  --  anywhere             anywhere             match-set weave-SpmnUYN8K)#H:z3~=6U00i5dN dst /* DefaultAllow isolation for namespace: monitoring */
ACCEPT     all  --  anywhere             anywhere             match-set weave-eD49/JG8ppm|2{LA7)TC.N8nY dst /* DefaultAllow isolation for namespace: openfaas */
ACCEPT     all  --  anywhere             anywhere             match-set weave-9E2erS!C3iw}]:W2Nt0=V.5K dst /* DefaultAllow isolation for namespace: openfaas-fn */

zetaab commented 6 years ago

I am trying to do clean installation using same commands. The result is

% kubectl logs -n kube-system -c weave weave-net-gkltc
Failed to get peers

Now I am trying to different version.

carlosedp commented 6 years ago

@zetaab I had similar problems while deploying Kubernetes where I needed to reset it all between installs. I made a GIST with the commands to reset it all: https://gist.github.com/carlosedp/5040f4a1b2c97c1fa260a3409b5f14f9

Start from line 10 for resetting Weave.

zetaab commented 6 years ago

this has worked like one year, but now the ansible configuration is broken :(

brb commented 6 years ago

@carlosedp Thanks for the issue.

The error FATA: 2018/03/15 15:38:37.870839 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again indicates that before the update you were running Weave with fastdp disabled. Did you do it by intention? If not, do you have logs of weave-kube before the update?

I did a weave reset in all nodes, pods went to Running but all communication to my application pods were lost (although weave status showed that the nodes were connected.

It's expected that existing connections to pods are lost after you reset Weave Net. However, rebooting machine is not required to re-connect. Does your client applications implement any connection re-try mechanism?

brb commented 6 years ago

@zetaab Could you please open a separate issue?

carlosedp commented 6 years ago

@brb, I haven't disabled fastdp, just deployed using the default manifests. They are connecting in sleeve mode by default:

$ weave status connections
<- 192.168.1.56:41907    established sleeve 86:82:1a:3d:5b:64(kubenode2) mtu=1438
<- 192.168.1.55:45352    established sleeve a6:b4:8e:f4:d3:1c(kubenode1) mtu=1438
-> 192.168.1.50:6783     failed      cannot connect to ourself, retry: never

Maybe any missing parameter on my Kernel? I'm running on ARM64 platform:

$ uname -a
Linux kubemaster1 4.4.77-rockchip-ayufan-136 #1 SMP Thu Oct 12 09:14:48 UTC 2017 aarch64 GNU/Linux

carlosedp commented 6 years ago

I found that my Kernel does not have OVS modules compiled. Will build them and try again to see if fastdp gets enabled.

carlosedp commented 6 years ago

Just updated the kernel on my nodes to one that contains the openvswitch module and now the weave nodes connect on fastdp mode:

rock64@kubemaster1:~ (kubearm:kube-system) $ weave status connections
-> 192.168.1.55:6783     established fastdp a6:b4:8e:f4:d3:1c(kubenode1) mtu=1376
<- 192.168.1.56:34049    established fastdp 86:82:1a:3d:5b:64(kubenode2) mtu=1376
-> 192.168.1.50:6783     failed      cannot connect to ourself, retry: never

I will follow up if the problem happens again. Will close the issue.

brb commented 6 years ago

@carlosedp Thanks for the update.

I find it strange that previously weave has created the bridge of the bridged_fastdp type even if your machine didn't have the required OVS modules.

jmreicha commented 6 years ago

@carlosedp Which kernel did you update to?

carlosedp commented 6 years ago

@jmreicha I compiled myself the kernel 4.4.114 for Pine64 Rock64 boards from Ayufan's repository (https://github.com/ayufan-rock64/linux-build). Added the modules to the config. It's pretty stable for a while. No hangs, freezes or dumps. Maybe his latest versions are stable like this but I don't want to change what's working :)

$ uname -a
Linux kubemaster1 4.4.114-rockchip-ayufan-1 #1 SMP Thu Mar 22 16:02:29 UTC 2018 aarch64 GNU/Linux
$ uptime
 11:27:12 up 40 days, 14:59,  1 user,  load average: 4.22, 4.73, 5.60

jmreicha commented 6 years ago

@carlosedp Interesting, what version of Kubernetes are you running? I was having some stability issues on Kubernetes 1.10.x and newer kernels.

carlosedp commented 6 years ago

K8s 1.9.7. Tried to update but faced some timeout problems with Kubeadm. I had stability issues and kernel dumps before but since this Kernel it's very stable. Not sure what fixed tho. In case you want to try, I've uploaded the files to: https://we.tl/ivPbhswOY2

jmreicha commented 6 years ago

@carlosedp Awesome thanks 🎉

weaveworks / weave