Overlay network stops working

oscar-martin commented 4 years ago

We are using weave v2.6.0 as the CNI in our Kuberentes environment for some months in a cluster with more than 30 VMs. From time to time (approx once per week), containers in a VM stops being able to connect to containers in different VMs. But they can connect to containers running within the same VM.

Seen problems are Destination Host Unreachable, Unknown Host Exception and related networking issues.

Looking at the weave status connections for this VM, it says it is using sleeve mode with all the rest of the peers.

The only way to make it work again was to restart the weave pod in that VM.

Troubleshooting

We spent a couple of days looking into it to find out the reason and this is what we found. A netlink "socket" connected to weave process was dropping all packages:

cat /proc/net/netlink
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     **Drops**     Inode
...
ffffa1479c048000 16  4294963118 00000000 214440   0        0 2        **37259969** 70732
...

Then, we looked for inode 70732 in the weave process:

$ sudo ls -l /proc/7986/fd | grep 70732
lrwx------ 1 root root 64 Jan 24 11:35 8 -> socket:[70732]

So file descriptor 8 was being used in user space to read/write to it. Additionally, we listed the open netlink sockets:

 ss -f netlink | grep genl
Recv-Q Send-Q Local Address:Port                 Peer Address:Port
0      0            genl:-4180                          *
0      0            genl:weaver/7986                    *
0      0            genl:kernel                         *
214440 0            genl:-4178                          *
0      0            genl:-4179                          *
0      0            genl:-4177                          *
0      0            genl:-4180                          *
0      0            genl:-4179                          *

And found out the Rmem and the Recv-Q matches (214440). So everything pointed out the read queue was full and new packages were dropped. The question then was what is being sent by that socket.

We forced dumping the goroutines stack traces for weave and we looked for goroutines that were reading from fd 8 (0x8). We found no one.. We tried to do the same with another VM where weave was working fine and we got this nice stack trace:

goroutine 91 [syscall]:
syscall.Syscall6(0x2d, **0x8**, 0xc000642000, 0x10000, 0x0, 0xc001bc7d48, 0xc001bc7d3c, 0x1a8b468, 0xc001bc7ce8, 0x1)
        /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(**0x8**, 0xc000642000, 0x10000, 0x10000, 0x0, 0xc001bc7d48, 0xc001bc7d3c, 0x100000000, 0x1c79a60, 0x8)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1618 +0xa3
syscall.Recvfrom(**0x8**, 0xc000642000, 0x10000, 0x10000, 0x0, 0x87595d, 0xc0004548f0, 0xc0000f6428, 0x2a, 0x31)
        /usr/local/go/src/syscall/syscall_unix.go:273 +0xaf
github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.(*NetlinkSocket).recv(0xc0005b8540, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp/netlink.go:552 +0x67
github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.(*NetlinkSocket).Receive(0xc0005b8540, 0xc001bc7f20, 0xc000026f40, 0x40c2d5)
        /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp/netlink.go:575 +0x8f
github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.(*NetlinkSocket).consume(0xc0005b8540, 0x7f8b98511218, 0xc0004548f0, 0xc001bc7f80)
        /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp/netlink.go:680 +0x70
github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.DatapathHandle.consumeMisses(0xc0000e3680, 0x6, 0x1c79a20, 0xc0004548f0, 0xc0000e3720)
        /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp/packet.go:124 +0xba
created by github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp.DatapathHandle.ConsumeMisses
        /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/go-odp/odp/packet.go:74 +0x3ac

So, odp.DatapathHandle.ConsumeMisses was not executing in our problematic VM. We dived into the code and understood that when the datapath is not able to find a Flow for a packet, it sent that (missed) packet to user space to handle it (so a new Flow can be created for such a packet). And also understood that the goroutine in charge of receiving those packets is created here: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/packet.go#L74.

So it seems like the fact of not having that goroutine reading missing packages causes weave to not being able to create new Flow in the datapath so packets are "lost" in openvswitch module.

ovs-dpctl show
system@datapath:
        lookups: hit:3665353374 missed:**43031265** lost:**37274598**
        flows: 0
        port 0: datapath (internal)
        port 1: vethwe-datapath
        port 2: vxlan-6784 (vxlan: df_default=false, ttl=0)

EDIT This is not true (as @bboreham pointed out here)

Finally, we saw a reason why this goroutine could end: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/netlink.go#L693-L696

So in case of an error coming out the NetlinkSocket.Receive the goroutine ends so nothing will then read from that socket causing weave to not processing "upcalls" from openvswitch any more. /EDIT

What you expected to happen?

Somehow the aforementioned goroutine should not finalize or a new one is spawn for reading from this netlink socket again.

What happened?

This is already described in above

How to reproduce it?

We have not found ways to reproduce it. It just happens. We do not know how to "force" netlink issues that can cause the goroutine to end.

Additional information

We also tried restarting another peer to see how the new connection is handled. Watching the weave status connections we saw it starts with fastdp but after the heartbeat failed (exactly 1 minute), it felt back to sleeve.

It is using bridged_fastdp as the bridge type and the communication between peers is unencrypted.

Versions:

weave version
weave script 2.6.0
weave 2.6.0

docker version
Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea
  Built:            Wed Nov 13 07:24:18 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

uname -a
Linux - 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

murali-reddy commented 4 years ago

@oscar-martin thanks for reporting the issue, and for the great troubleshooting that was done to figure the root cause.

Somehow the aforementioned goroutine should not finalize or a new one is spawn for reading from this netlink socket again.

agree. go routine handling the misses should be running or recreated

Looking at the weave status connections for this VM, it says it is using sleeve mode with all the rest of the peers.

While fastdp which use ODP is not working, sleeve mode is used as fallback. Pods should have connectivity with other pods other nodes. On the nodes with VM's which lost connectivity did weave status indicates sleeve mode with reset of the nodes in the cluster?

oscar-martin commented 4 years ago

While fastdp which use ODP is not working, sleeve mode is used as fallback. Pods should have connectivity with other pods other nodes. On the nodes with VM's which lost connectivity did weave status indicates sleeve mode with reset of the nodes in the cluster?

We initially thought that. Then we realized that although it did fallback to sleeve for the overlay network, the configuration for connecting containers to the Weave overlay kept using the datapath because our bridge configuration was set to BridgedFastdp: https://github.com/weaveworks/weave/blob/v2.6.0/net/bridge.go#L38. So we assumed that containers within the problematic VM can talk among them because they do not need to reach datapath "device" but when a container tries to talk to a container from another VM, then it must go through the datapath (even though sleeve mode is setup for the overlay network) which does not have setup Flows for it to work.

And yes, the weave status connections among the problematic VM and the rest was using sleeve mode.

As a separated note, we also saw that arp cache was not able to fill MACs from external containers and it was working fine for "local" containers within the affected VM.

wizard580 commented 4 years ago

We experience the same issues in all our clusters, including production. But I wasn't that good in debugging the issue. Would love someone to fix that bug.

oscar-martin commented 4 years ago

We are lately suffering from this more often (twice/thrice per day)...so we applied a workaround to the weave DaemonSet by adding a livenessProbe like this:

livenessProbe:
  exec:
    command:
      - sh
      - -c
      - |
         sleeveCount=$(/home/weave/weave --local status connections | grep 'sleeve' | wc -l)
         if [ "$sleeveCount" -gt "15" ]; then exit 1; fi
  periodSeconds: 20
  failureThreshold: 1
  timeoutSeconds: 5

We know that when all of peer's connections change to sleeve, weave is suffering this issue locally and the only way to get it work again is by restarting it. So we set up this probe that will restart the container when it detects the number of sleeve connections is greater than 15 (our cluster has 40 nodes).

We consider that reaching 15 means weave is misbehaving locally so we need to recover networking asap (that's the reason we do not wait until all connections are using sleeve mode). That value (threshold) should not be too low: bear in mind that once a weave instance is changing the connections to its peers to sleeve mode, the rest of the weave peers will also switch to sleeve so you need to account for that while setting this value to avoid restarting a weave instance that is working fine.

wizard580 commented 4 years ago

Is there any progress?

timboven commented 4 years ago

Since last week we are also noticing this issue in our production environment. We didn't implement the livenessProbe yet but did a manual restart of all the weave pods this morning and now validating if it is back up&running. Could somebody share the latest on this?

bboreham commented 4 years ago

@timboven are you sure you have the exact same symptoms?
It would be better to open a new issue with your details so we don't get confused. If you could add the cat /proc/net/netlink result, and ideally logs from the time it went wrong (the suggested problem at https://github.com/weaveworks/weave/issues/3773#issuecomment-587425607 would have a message in the log).

bboreham commented 4 years ago

If anyone here has a message in the logs beginning "Error while listening on ODP datapath:", please could I see it.

bboreham commented 4 years ago

Finally, we saw a reason why this goroutine could end: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/netlink.go#L693-L696

@oscar-martin that should lead to https://github.com/weaveworks/weave/blob/v2.6.0/router/fastdp.go#L1104 which will exit the program. So unless I'm missing something, this cannot be the problem.

Please can you attach your logs.

oscar-martin commented 4 years ago

@bboreham You're right; I overlooked that Error function. Sorry for adding confusion.

I will provide logs as soon as I cleanse them. Is it enough to provide the logs from one of the nodes?

bboreham commented 4 years ago

@oscar-martin sure, one node that showed the "netlink not being read" symptom should be enough. Also the complete goroutine dump would be of interest.

If it's easier you can email logs to support at weave dot works; reference this issue.

anshul-sehgal22 commented 3 years ago

Hi,

I think we are facing same issue with our setup. k8s version: 1.17.4.(3 Masters and 56 Worker Nodes) Weave version: 2.6.5

$ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 743025 326287820

$ top -p 2892 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2892 root 20 0 1915756 74848 15356 S 6.7 0.2 249:29.03 weaver

$ ls -l /proc/2892/fd | grep 326287820 lrwx------ 1 root root 64 Nov 25 14:59 8 -> socket:[326287820]

$ ss -f netlink | grep genl 0 0 genl:systemd/1 0 0 genl:kernel 0 0 genl:-3411315 0 0 genl:-3411313 0 0 genl:weaver/2892 213759 0 genl:-3411314 0 0 genl:-3411316 0 0 genl:-3411316 0 0 genl:-3411315 *

Drops count keep increasing on problematic node. $ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 743939 326287820

$ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 744006 326287820

After upgrading from 2.6.0 to 2.6.5. this issue was not observed fo around 2 months. But it again started appearing.

We have already checked weave status which is ready. $ /usr/local/bin/weave status

    Version: 2.6.5 (failed to check latest version - see logs; next check at 2020/11/30 14:41:16)

    Service: router
   Protocol: weave 1..2
       Name: c6:35:98:71:f9:4b(elk-cluster-2)
 Encryption: disabled

PeerDiscovery: enabled Targets: 58 Connections: 58 (56 established, 2 failed) Peers: 57 (with 3192 established connections) TrustedSubnets: none

    Service: ipam
     Status: ready
      Range: 10.32.0.0/12

DefaultSubnet: 10.32.0.0/12

Note: 2 Failed connection is because two worker nodes are not reachable. $ /usr/local/bin/weave status connections | grep -i failed -> 10.151.16.75:6783 failed dial tcp :0->10.151.16.75:6783: connect: connection refused, retry: 2020-11-30 08:29:13.054336435 +0000 UTC m=+428402.331273771 -> 10.151.16.126:6783 failed dial tcp :0->10.151.16.126:6783: connect: connection refused, retry: 2020-11-30 08:28:16.737857259 +0000 UTC m=+428346.014794648

If I do "ping 10.32.0.1 -I weave" from non working node, it doesn't work.(10.32.0.1 is IP address of weave interface on mater node. From working node, this works fine) And if we try to ping any pod's which is on same node, that works fine.(So connection between pods of same node works fine.) $ ping 10.32.0.1 -I weave PING 10.32.0.1 (10.32.0.1) from 10.45.0.0 weave: 56(84) bytes of data. ^C --- 10.32.0.1 ping statistics --- 7 packets transmitted, 0 received, 100% packet loss, time 5999ms

$ ping 10.45.0.1 -I weave PING 10.45.0.1 (10.45.0.1) from 10.45.0.0 weave: 56(84) bytes of data. 64 bytes from 10.45.0.1: icmp_seq=1 ttl=64 time=0.237 ms 64 bytes from 10.45.0.1: icmp_seq=2 ttl=64 time=0.119 ms 64 bytes from 10.45.0.1: icmp_seq=3 ttl=64 time=0.056 ms ^C --- 10.45.0.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.056/0.137/0.237/0.075 ms

We have deployed weave-scope in our cluster. By which we get problematic nodes(Highlighted as Unmanaged). On those nodes, sysctl config also have ip forwarding enabled.

sysctl --system

Applying /usr/lib/sysctl.d/00-system.conf ... net.bridge.bridge-nf-call-ip6tables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-arptables = 0
Applying /usr/lib/sysctl.d/10-default-yama-scope.conf ... kernel.yama.ptrace_scope = 0
Applying /usr/lib/sysctl.d/50-default.conf ... kernel.sysrq = 16 kernel.core_uses_pid = 1 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.conf.all.accept_source_route = 0 net.ipv4.conf.default.promote_secondaries = 1 net.ipv4.conf.all.promote_secondaries = 1 fs.protected_hardlinks = 1 fs.protected_symlinks = 1
Applying /etc/sysctl.d/99-sysctl.conf ... net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1
Applying /etc/sysctl.conf ... net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1

After restarting the weave pod of that particular node, problem goes off.

We also have a production setup with less number of nodes(3 Master, 6 Worker) where we haven't faced this issue. So I think this problem is more often when your cluster have high number of nodes.

@oscar-martin: Very nice finding sir. If there is any update on this issue, request you to please publish same here.

aviav commented 2 years ago

We are observing this in rather small clusters with a dynamic number of nodes, so for our case, I adjusted @oscar-martin's liveness probe.

Key differences of the modified version:

It waits until all connections of a node are in sleeve mode before restarting the weave container
It works regardless of the number of nodes in the cluster

The connectivity issue happens rarely for us, so I have yet to see a restart due to failure of this probe.

livenessProbe:                                                                        
  exec:                                                                               
    command:                                                                          
    - sh                                                                              
    - -c                                                                              
    - weaveConnections=$(/home/weave/weave --local status connections)                
      sleeveCount=$(echo "$weaveConnections" | grep 'sleeve' | wc -l)                 
      establishedCount=$(echo "$weaveConnections" | grep 'established' | wc -l)       
      if [ "$sleeveCount" -gt 1 && "$sleeveCount" -eq "$establishedCount" ]; then exi\
t 1; fi                                                                                       
  failureThreshold: 1                                                                 
  periodSeconds: 20                                                                   
  successThreshold: 1                                                                 
  timeoutSeconds: 5

weaveworks / weave