Open oscar-martin opened 4 years ago
@oscar-martin thanks for reporting the issue, and for the great troubleshooting that was done to figure the root cause.
Somehow the aforementioned goroutine should not finalize or a new one is spawn for reading from this netlink socket again.
agree. go routine handling the misses should be running or recreated
Looking at the weave status connections for this VM, it says it is using sleeve mode with all the rest of the peers.
While fastdp
which use ODP is not working, sleeve
mode is used as fallback. Pods should have connectivity with other pods other nodes. On the nodes with VM's which lost connectivity did weave status indicates sleeve
mode with reset of the nodes in the cluster?
While fastdp which use ODP is not working, sleeve mode is used as fallback. Pods should have connectivity with other pods other nodes. On the nodes with VM's which lost connectivity did weave status indicates sleeve mode with reset of the nodes in the cluster?
We initially thought that. Then we realized that although it did fallback to sleeve
for the overlay network, the configuration for connecting containers to the Weave overlay kept using the datapath
because our bridge configuration was set to BridgedFastdp
: https://github.com/weaveworks/weave/blob/v2.6.0/net/bridge.go#L38. So we assumed that containers within the problematic VM can talk among them because they do not need to reach datapath
"device" but when a container tries to talk to a container from another VM, then it must go through the datapath
(even though sleeve
mode is setup for the overlay network) which does not have setup Flows for it to work.
And yes, the weave status connections
among the problematic VM and the rest was using sleeve
mode.
As a separated note, we also saw that arp
cache was not able to fill MACs from external containers and it was working fine for "local" containers within the affected VM.
We experience the same issues in all our clusters, including production. But I wasn't that good in debugging the issue. Would love someone to fix that bug.
We are lately suffering from this more often (twice/thrice per day)...so we applied a workaround to the weave DaemonSet by adding a livenessProbe
like this:
livenessProbe:
exec:
command:
- sh
- -c
- |
sleeveCount=$(/home/weave/weave --local status connections | grep 'sleeve' | wc -l)
if [ "$sleeveCount" -gt "15" ]; then exit 1; fi
periodSeconds: 20
failureThreshold: 1
timeoutSeconds: 5
We know that when all of peer's connections change to sleeve
, weave is suffering this issue locally and the only way to get it work again is by restarting it. So we set up this probe that will restart the container when it detects the number of sleeve
connections is greater than 15 (our cluster has 40 nodes).
We consider that reaching 15 means weave is misbehaving locally so we need to recover networking asap (that's the reason we do not wait until all connections are using sleeve
mode). That value (threshold) should not be too low: bear in mind that once a weave instance is changing the connections to its peers to sleeve
mode, the rest of the weave peers will also switch to sleeve
so you need to account for that while setting this value to avoid restarting a weave instance that is working fine.
Is there any progress?
Since last week we are also noticing this issue in our production environment. We didn't implement the livenessProbe yet but did a manual restart of all the weave pods this morning and now validating if it is back up&running. Could somebody share the latest on this?
@timboven are you sure you have the exact same symptoms?
It would be better to open a new issue with your details so we don't get confused.
If you could add the cat /proc/net/netlink
result, and ideally logs from the time it went wrong (the suggested problem at https://github.com/weaveworks/weave/issues/3773#issuecomment-587425607 would have a message in the log).
If anyone here has a message in the logs beginning "Error while listening on ODP datapath:"
, please could I see it.
Finally, we saw a reason why this goroutine could end: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/netlink.go#L693-L696
@oscar-martin that should lead to https://github.com/weaveworks/weave/blob/v2.6.0/router/fastdp.go#L1104 which will exit the program. So unless I'm missing something, this cannot be the problem.
Please can you attach your logs.
@bboreham You're right; I overlooked that Error
function. Sorry for adding confusion.
I will provide logs as soon as I cleanse them. Is it enough to provide the logs from one of the nodes?
@oscar-martin sure, one node that showed the "netlink not being read" symptom should be enough. Also the complete goroutine dump would be of interest.
If it's easier you can email logs to support at weave dot works; reference this issue.
Hi,
I think we are facing same issue with our setup. k8s version: 1.17.4.(3 Masters and 56 Worker Nodes) Weave version: 2.6.5
$ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 743025 326287820
$ top -p 2892 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2892 root 20 0 1915756 74848 15356 S 6.7 0.2 249:29.03 weaver
$ ls -l /proc/2892/fd | grep 326287820 lrwx------ 1 root root 64 Nov 25 14:59 8 -> socket:[326287820]
$ ss -f netlink | grep genl 0 0 genl:systemd/1 0 0 genl:kernel 0 0 genl:-3411315 0 0 genl:-3411313 0 0 genl:weaver/2892 213759 0 genl:-3411314 0 0 genl:-3411316 0 0 genl:-3411316 0 0 genl:-3411315 *
Drops count keep increasing on problematic node. $ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 743939 326287820
$ cat /proc/net/netlink | grep "326287820|Drop" sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff9a504eb99800 16 4291555982 00000000 213759 0 0 2 744006 326287820
After upgrading from 2.6.0 to 2.6.5. this issue was not observed fo around 2 months. But it again started appearing.
We have already checked weave status which is ready. $ /usr/local/bin/weave status
Version: 2.6.5 (failed to check latest version - see logs; next check at 2020/11/30 14:41:16)
Service: router
Protocol: weave 1..2
Name: c6:35:98:71:f9:4b(elk-cluster-2)
Encryption: disabled
PeerDiscovery: enabled Targets: 58 Connections: 58 (56 established, 2 failed) Peers: 57 (with 3192 established connections) TrustedSubnets: none
Service: ipam
Status: ready
Range: 10.32.0.0/12
DefaultSubnet: 10.32.0.0/12
Note: 2 Failed connection is because two worker nodes are not reachable. $ /usr/local/bin/weave status connections | grep -i failed -> 10.151.16.75:6783 failed dial tcp :0->10.151.16.75:6783: connect: connection refused, retry: 2020-11-30 08:29:13.054336435 +0000 UTC m=+428402.331273771 -> 10.151.16.126:6783 failed dial tcp :0->10.151.16.126:6783: connect: connection refused, retry: 2020-11-30 08:28:16.737857259 +0000 UTC m=+428346.014794648
If I do "ping 10.32.0.1 -I weave" from non working node, it doesn't work.(10.32.0.1 is IP address of weave interface on mater node. From working node, this works fine) And if we try to ping any pod's which is on same node, that works fine.(So connection between pods of same node works fine.) $ ping 10.32.0.1 -I weave PING 10.32.0.1 (10.32.0.1) from 10.45.0.0 weave: 56(84) bytes of data. ^C --- 10.32.0.1 ping statistics --- 7 packets transmitted, 0 received, 100% packet loss, time 5999ms
$ ping 10.45.0.1 -I weave PING 10.45.0.1 (10.45.0.1) from 10.45.0.0 weave: 56(84) bytes of data. 64 bytes from 10.45.0.1: icmp_seq=1 ttl=64 time=0.237 ms 64 bytes from 10.45.0.1: icmp_seq=2 ttl=64 time=0.119 ms 64 bytes from 10.45.0.1: icmp_seq=3 ttl=64 time=0.056 ms ^C --- 10.45.0.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.056/0.137/0.237/0.075 ms
We have deployed weave-scope in our cluster. By which we get problematic nodes(Highlighted as Unmanaged). On those nodes, sysctl config also have ip forwarding enabled.
sysctl --system
After restarting the weave pod of that particular node, problem goes off.
We also have a production setup with less number of nodes(3 Master, 6 Worker) where we haven't faced this issue. So I think this problem is more often when your cluster have high number of nodes.
@oscar-martin: Very nice finding sir. If there is any update on this issue, request you to please publish same here.
We are observing this in rather small clusters with a dynamic number of nodes, so for our case, I adjusted @oscar-martin's liveness probe.
Key differences of the modified version:
weave
containerThe connectivity issue happens rarely for us, so I have yet to see a restart due to failure of this probe.
livenessProbe:
exec:
command:
- sh
- -c
- weaveConnections=$(/home/weave/weave --local status connections)
sleeveCount=$(echo "$weaveConnections" | grep 'sleeve' | wc -l)
establishedCount=$(echo "$weaveConnections" | grep 'established' | wc -l)
if [ "$sleeveCount" -gt 1 && "$sleeveCount" -eq "$establishedCount" ]; then exi\
t 1; fi
failureThreshold: 1
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 5
We are using weave v2.6.0 as the CNI in our Kuberentes environment for some months in a cluster with more than 30 VMs. From time to time (approx once per week), containers in a VM stops being able to connect to containers in different VMs. But they can connect to containers running within the same VM.
Seen problems are
Destination Host Unreachable
,Unknown Host Exception
and related networking issues.Looking at the
weave status connections
for this VM, it says it is usingsleeve
mode with all the rest of the peers.The only way to make it work again was to restart the weave pod in that VM.
Troubleshooting
We spent a couple of days looking into it to find out the reason and this is what we found. A netlink "socket" connected to weave process was dropping all packages:
Then, we looked for inode 70732 in the weave process:
So file descriptor 8 was being used in user space to read/write to it. Additionally, we listed the open netlink sockets:
And found out the Rmem and the Recv-Q matches (214440). So everything pointed out the read queue was full and new packages were dropped. The question then was what is being sent by that socket.
We forced dumping the goroutines stack traces for weave and we looked for goroutines that were reading from fd 8 (0x8). We found no one.. We tried to do the same with another VM where weave was working fine and we got this nice stack trace:
So,
odp.DatapathHandle.ConsumeMisses
was not executing in our problematic VM. We dived into the code and understood that when the datapath is not able to find a Flow for a packet, it sent that (missed) packet to user space to handle it (so a new Flow can be created for such a packet). And also understood that the goroutine in charge of receiving those packets is created here: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/packet.go#L74.So it seems like the fact of not having that goroutine reading missing packages causes weave to not being able to create new Flow in the datapath so packets are "lost" in openvswitch module.
EDIT This is not true (as @bboreham pointed out here)
Finally, we saw a reason why this goroutine could end: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/netlink.go#L693-L696
So in case of an error coming out the
NetlinkSocket.Receive
the goroutine ends so nothing will then read from that socket causing weave to not processing "upcalls" from openvswitch any more. /EDITWhat you expected to happen?
Somehow the aforementioned goroutine should not finalize or a new one is spawn for reading from this netlink socket again.
What happened?
This is already described in above
How to reproduce it?
We have not found ways to reproduce it. It just happens. We do not know how to "force" netlink issues that can cause the goroutine to end.
Additional information
We also tried restarting another peer to see how the new connection is handled. Watching the
weave status connections
we saw it starts withfastdp
but after the heartbeat failed (exactly 1 minute), it felt back tosleeve
.It is using
bridged_fastdp
as the bridge type and the communication between peers is unencrypted.Versions: