weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

weave losing connections to other nodes with error: Multiple connections (Kubernetes CNI) #3619

Closed Tatsu-Kishi closed 4 years ago

Tatsu-Kishi commented 5 years ago

What you expected to happen?

Inter-node cluster-internal traffic to work

What happened?

At random times one nodes' pod network becomes unreachable/can't connect to other nodes' pod-network. Nodes internal traffic still works

Deleting the pod fixes the issue temporarily

Anything else we need to know?

Baremetal deployment with 3 nodes (1 master, 2 workers), metallb in L2 mode and WEAVE_MTU set to 1500 and NO_MASQ_LOCAL set to 1

Versions:

$ weave version
weave script 2.5.1
weave 2.5.1

$ docker version
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        6247962
 Built:             Tue Feb 26 23:52:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

$ uname -a
Linux k8sm1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Error occurred around 08:36 server time

$ kubectl logs -n kube-system <weave-net-pod> weave
INFO: 2019/03/20 08:34:47.205497 Sending ICMP 3,4 (10.32.0.71 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:35:31.789047 Sending ICMP 3,4 (10.32.0.69 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:36:21.533420 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.533536 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.533888 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.534236 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:36:21.534529 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.534565 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534642 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:36:21.534686 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534702 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.534815 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:36:21.535475 ->[192.168.100.72:34749] connection accepted
INFO: 2019/03/20 08:36:21.536439 ->[192.168.100.83:56185] connection accepted
INFO: 2019/03/20 08:36:21.536784 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536816 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536911 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536958 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.536982 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537033 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537067 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.537121 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537254 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.537418 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.537428 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537466 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.537573 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.538144 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.538952 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:36:21.538997 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:36:21.539047 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.539820 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:36:21.539865 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection fully established
INFO: 2019/03/20 08:36:21.539889 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.540080 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:36:21.540633 sleeve ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.538009 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538103 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.538208 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.538918 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538991 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.539300 ->[192.168.100.83:58245] connection accepted
INFO: 2019/03/20 08:37:21.539487 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.539927 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.540016 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.540245 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:37:21.540329 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:37:21.540428 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.540520 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.540564 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.541363 ->[192.168.100.72:34149] connection accepted
INFO: 2019/03/20 08:37:21.541773 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.541884 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.541919 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.542005 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.542359 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.542428 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544058 ->[192.168.100.83:60179] connection accepted
INFO: 2019/03/20 08:37:21.544313 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544401 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544435 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544515 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544654 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544768 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.544806 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:37:21.544841 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544929 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:37:21.544771 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:37:21.545106 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.545485 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.545499 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.545576 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.545624 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.543908 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.544002 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.544094 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:38:21.544839 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:38:21.545145 ->[192.168.100.83:53093] connection accepted
INFO: 2019/03/20 08:38:21.545377 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:38:21.545653 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.545717 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545796 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:38:21.545840 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.545859 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545971 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546050 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546107 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.546289 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:38:21.546325 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546526 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546586 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.546786 ->[192.168.100.72:36693] connection accepted
INFO: 2019/03/20 08:38:21.547484 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547587 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.547669 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.547876 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547987 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.548020 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.546646 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.546717 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.546826 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.547406 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:39:21.548006 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.548072 ->[192.168.100.83:40009] connection accepted
INFO: 2019/03/20 08:39:21.548155 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548281 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:39:21.548336 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.548371 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548525 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:39:21.549229 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549295 ->[192.168.100.72:45581] connection accepted
INFO: 2019/03/20 08:39:21.549327 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549346 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549437 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.549462 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549797 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.549858 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549914 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.549925 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: write tcp4 192.168.100.79:6783->192.168.100.83:40009: write: connection reset by peer
INFO: 2019/03/20 08:39:21.549980 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.550006 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550154 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.550257 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550368 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.552197 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:39:21.552260 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:39:21.552399 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:39:21.552982 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438

weave status in working state:

# weave status

        Version: 2.5.1 (up to date; next check at 2019/03/20 14:13:47)

        Service: router
       Protocol: weave 1..2
           Name: f2:e1:3e:28:f1:80(k8sw1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 3 (2 established, 1 failed)
          Peers: 3 (with 6 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

# weave status peers
ae:3f:91:18:2e:cb(k8sm1)
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established
   <- 192.168.100.79:50229  f2:e1:3e:28:f1:80(k8sw1)              established
9e:34:c5:bd:b8:9e(k8sw2)
   <- 192.168.100.72:38745  ae:3f:91:18:2e:cb(k8sm1)              established
   <- 192.168.100.79:39057  f2:e1:3e:28:f1:80(k8sw1)              established
f2:e1:3e:28:f1:80(k8sw1)
   -> 192.168.100.72:6783   ae:3f:91:18:2e:cb(k8sm1)              established
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established

weave status connections
-> 192.168.100.72:6783   established sleeve ae:3f:91:18:2e:cb(k8sm1) mtu=1438
-> 192.168.100.83:6783   established sleeve 9e:34:c5:bd:b8:9e(k8sw2) mtu=1438
-> 192.168.100.79:6783   failed      cannot connect to ourself, retry: never

I will try to get the weave outputs during failure state, but didn't have the weave script installed at the time/looked up weave troubleshooting and had to get the issue fixed asap.

murali-reddy commented 5 years ago

I see two issues. First there are heartbeat misses. Under normal circumstances this should not have happened.

INFO: 2019/03/20 08:36:21.533420 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.534529 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat

Second, looks like fastdp forwarder is not getting used, and only functional forwarder sleeve having timeouts resulting in connection shutdown.

INFO: 2019/03/20 08:36:21.533536 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.534565 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)

Could you please enable DEBUG logging and share the logs when you encounter this issue again?

Tatsu-Kishi commented 5 years ago

It happened again, so here are the weave outputs in failure state:

# weave status

        Version: 2.5.1 (up to date; next check at 2019/03/20 14:13:47)

        Service: router
       Protocol: weave 1..2
           Name: f2:e1:3e:28:f1:80(k8sw1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 3 (1 established, 1 pending, 1 failed)
          Peers: 3 (with 3 established, 3 pending connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12
# weave status peers
f2:e1:3e:28:f1:80(k8sw1)
   <- 192.168.100.72:42967  ae:3f:91:18:2e:cb(k8sm1)              pending
   <- 192.168.100.83:50687  9e:34:c5:bd:b8:9e(k8sw2)              pending
ae:3f:91:18:2e:cb(k8sm1)
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established
   -> 192.168.100.79:6783   f2:e1:3e:28:f1:80(k8sw1)              pending
9e:34:c5:bd:b8:9e(k8sw2)
   <- 192.168.100.72:38745  ae:3f:91:18:2e:cb(k8sm1)              established
   -> 192.168.100.79:6783   f2:e1:3e:28:f1:80(k8sw1)              pending
# weave status connections
<- 192.168.100.72:56677  pending     fastdp ae:3f:91:18:2e:cb(k8sm1) mtu=1500
-> 192.168.100.83:6783   established sleeve 9e:34:c5:bd:b8:9e(k8sw2) mtu=1438
-> 192.168.100.79:6783   failed      cannot connect to ourself, retry: never

Will enable debug logging now too.

Probably noteworthy is that there is heavy UDP traffic going on already in the cluster. Is there maybe a parameter to do the heartbeat over TCP?

murali-reddy commented 5 years ago

Probably noteworthy is that there is heavy UDP traffic going on already in the cluster. Is there maybe a parameter to do the heartbeat over TCP?

Its over UDP only. Also weave considers a hearbeat miss only on consecutive(hard-coded to be 6) misses. DEBUG logs should help to see whats going on.

Tatsu-Kishi commented 5 years ago

weave.log Here is the log of it happening again (first 150 lines and a bit before it actually happened) And the weave status of it at the time:

# weave status

        Version: 2.5.1 (up to date; next check at 2019/03/21 14:48:42)

        Service: router
       Protocol: weave 1..2
           Name: f2:e1:3e:28:f1:80(k8sw1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 3 (1 established, 1 pending, 1 failed)
          Peers: 3 (with 3 established, 3 pending connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

# weave status peers
f2:e1:3e:28:f1:80(k8sw1)
   <- 192.168.100.83:57165  9e:34:c5:bd:b8:9e(k8sw2)              pending
   -> 192.168.100.72:6783   ae:3f:91:18:2e:cb(k8sm1)              established
ae:3f:91:18:2e:cb(k8sm1)
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established
   <- 192.168.100.79:33711  f2:e1:3e:28:f1:80(k8sw1)              pending
9e:34:c5:bd:b8:9e(k8sw2)
   <- 192.168.100.72:50645  ae:3f:91:18:2e:cb(k8sm1)              established
   -> 192.168.100.79:6783   f2:e1:3e:28:f1:80(k8sw1)              pending
mikebryant commented 5 years ago

I think we're hitting the same issue. Symptoms look identical.

We have nodes with these log chunks repeating over and over:

On weave-net-nq27f 3/3 Running 2 7d 10.186.0.157 kubernetes-kubernetes-cr2-2-1547767704:

INFO: 2019/03/26 10:43:19.850007 ->[10.186.0.187:32815] connection accepted
INFO: 2019/03/26 10:43:19.851079 ->[10.186.0.187:32815|f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)]: connection ready; using protocol version 2
INFO: 2019/03/26 10:43:19.851282 overlay_switch ->[f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)] using fastdp
INFO: 2019/03/26 10:43:19.851336 ->[10.186.0.187:32815|f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704)]: connection shutting down due to error: Multiple connections to f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) added to 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)

On weave-net-hh9rj 3/3 Running 27 7d 10.186.0.187 kubernetes-kubernetes-cr1-13-1547767704:

INFO: 2019/03/26 10:43:19.848875 ->[10.186.0.157:6783] attempting connection
INFO: 2019/03/26 10:43:19.850807 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection ready; using protocol version 2
INFO: 2019/03/26 10:43:19.850973 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] using fastdp
INFO: 2019/03/26 10:43:19.851002 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection added
INFO: 2019/03/26 10:43:19.853848 Setting up IPsec between f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) and 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)
INFO: 2019/03/26 10:43:19.854323 ipsec: InitSALocal: 10.186.0.157 -> 10.186.0.187 :6784 0x73b96332
ERRO: 2019/03/26 10:43:19.876556 fastdp ->[10.186.0.157:6784|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: ipsec init SA local failed: send InitSARemote: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: broken pipe
INFO: 2019/03/26 10:43:19.876658 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection shutting down due to error: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: connection reset by peer
INFO: 2019/03/26 10:43:19.876719 ->[10.186.0.157:6783|6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)]: connection deleted
INFO: 2019/03/26 10:43:19.876735 Destroying IPsec between f6:de:0b:60:9f:17(kubernetes-kubernetes-cr1-13-1547767704) and 6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)
INFO: 2019/03/26 10:43:19.876802 ipsec: destroy: in 10.186.0.157 -> 10.186.0.187 0x73b96332
INFO: 2019/03/26 10:43:19.877741 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] fastdp send InitSARemote: write tcp4 10.186.0.187:32815->10.186.0.157:6783: write: broken pipe
INFO: 2019/03/26 10:43:19.877776 overlay_switch ->[6e:f2:36:34:7d:df(kubernetes-kubernetes-cr2-2-1547767704)] using sleeve
ReillyTevera commented 5 years ago

We've experienced the same error a couple of times now, each time hitting a node that hosts a Traefik load balancer instance. We're seeing the same pattern of this block repeating:

INFO: 2019/05/06 03:50:54.920197 ->[10.153.42.237:6783|4e:d9:88:c1:94:67(ip-10-153-42-237.us-east-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/05/06 03:50:54.920259 overlay_switch ->[4e:d9:88:c1:94:67(ip-10-153-42-237.us-east-2.compute.internal)] using fastdp
INFO: 2019/05/06 03:50:54.920282 ->[10.153.42.237:6783|4e:d9:88:c1:94:67(ip-10-153-42-237.us-east-2.compute.internal)]: connection shutting down due to error: Multiple connections to 4e:d9:88:c1:94:67(ip-10-153-42-237.us-east-2.compute.internal) added to ca:c7:e5:c7:be:80(ip-10-153-17-191.us-east-2.compute.internal)

We're seeing the same log messages as @mikebryant . In each case rebooting/terminating the node resolves the issue but restarting Weave on the node without doing that does not.

The symptoms are that the Weave Net appears to be otherwise healthy except that all connections are reported as "sleeve" (all good nodes report "fastdp") and all packets that would transit the Weave network are dropped with "no route to host". iptables rules appear to be normal and consistent with good nodes.

I've attached reports generated by both the unhealthy node and a health node for comparison. Hopefully they will help shed some light on this:

weave-bad.log weave-good.log

I also enabled debug logging so hopefully I have some more concrete information to report if this happens again.

jntakpe commented 5 years ago

Hello,

Is there a workarround or a version that is not affected by this bug ?

murali-reddy commented 5 years ago

Common theme across the shared logs is connection reset by peer. I tried to reproduce the scenario by blocking ports 6783/784 and reopening the ports again with not much luck. I am able to reproduce connection shutting down due to error: Multiple connections but connections gets reestablished to fastdp. I will investigate further.

Is there a workarround or a version that is not affected by this bug ?

@jntakpe We dont the root cause yet to suggest a work around.

raoofm commented 5 years ago

we faced the same issue too. logs repeated over again and again

INFO: 2019/06/06 11:50:55.443926 ->[10.201.20.164:6783|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection deleted
INFO: 2019/06/06 11:50:55.446266 ->[10.201.20.164:6783] attempting connection
INFO: 2019/06/06 11:50:55.475459 ->[10.201.20.164:6783|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection ready; using protocol version 2
INFO: 2019/06/06 11:50:55.475560 overlay_switch ->[96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)] using fastdp
INFO: 2019/06/06 11:50:55.475601 ->[10.201.20.164:6783|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection added
INFO: 2019/06/06 11:50:55.541047 overlay_switch ->[5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/06/06 11:50:55.553389 ->[10.201.20.164:27939] connection accepted
INFO: 2019/06/06 11:50:55.640593 ->[10.201.20.164:27939|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection ready; using protocol version 2
INFO: 2019/06/06 11:50:55.640704 overlay_switch ->[96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)] using fastdp
INFO: 2019/06/06 11:50:55.640750 ->[10.201.20.164:27939|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection shutting down due to error: Multiple connections to 96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal) added to 72:7c:ed:95:b2:24(ip-10-201-21-112.ec2.internal)
INFO: 2019/06/06 11:50:56.079073 ->[10.201.20.164:6783|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: connection fully established
INFO: 2019/06/06 11:50:56.552727 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/06 11:50:56.561926 sleeve ->[10.201.20.164:6783|96:d4:22:2a:65:00(ip-10-201-20-164.ec2.internal)]: Effective MTU verified at 8939
INFO: 2019/06/06 11:51:11.578086 Discovered remote MAC ba:38:97:c2:5a:48 at 46:9c:5b:6c:20:0f(ip-10-201-19-177.ec2.internal)
INFO: 2019/06/06 11:51:21.137827 ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection shutting down due to error: Inconsistent entries for 100.109.64.0: owned by 3a:0f:18:e2:50:99 but incoming message says b6:fd:31:4c:3c:7c
INFO: 2019/06/06 11:51:21.137938 ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection deleted
INFO: 2019/06/06 11:51:21.144054 ->[10.201.20.113:6783] attempting connection
INFO: 2019/06/06 11:51:21.154210 ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection ready; using protocol version 2
INFO: 2019/06/06 11:51:21.154344 overlay_switch ->[5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)] using fastdp
INFO: 2019/06/06 11:51:21.154382 ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection added
INFO: 2019/06/06 11:51:22.132283 ->[10.201.20.113:52959] connection accepted
INFO: 2019/06/06 11:51:22.238391 ->[10.201.20.113:52959|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection ready; using protocol version 2
INFO: 2019/06/06 11:51:22.238521 overlay_switch ->[5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)] using fastdp
INFO: 2019/06/06 11:51:22.238587 ->[10.201.20.113:52959|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection shutting down due to error: Multiple connections to 5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal) added to 72:7c:ed:95:b2:24(ip-10-201-21-112.ec2.internal)
INFO: 2019/06/06 11:51:22.817703 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/06/06 11:51:22.817869 overlay_switch ->[5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)] using sleeve
INFO: 2019/06/06 11:51:22.817913 ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: connection fully established
INFO: 2019/06/06 11:51:22.824401 sleeve ->[10.201.20.113:6783|5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)]: Effective MTU verified at 8939
INFO: 2019/06/06 11:51:22.838396 overlay_switch ->[5a:68:a6:8e:36:f5(ip-10-201-20-113.ec2.internal)] using fastdp
INFO: 2019/06/06 11:51:25.244635 ->[10.201.19.246:6783|f6:25:32:b6:96:86(ip-10-201-19-246.ec2.internal)]: connection shutting down due to error: read tcp4 10.201.21.112:33567->10.201.19.246:6783: read: connection reset by peer
monicagangwar commented 5 years ago

any workaround to this ? I'm using weave 2.5.2 on kubernetes version 1.11.10 and seeing these errors continuously. In addition, I'm also seeing these errors

INFO: 2019/07/04 07:23:33.520512 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/04 06:52:56.549384 sleeve ->[10.11.3.170:6783|ee:10:f8:5f:7a:b0(ip-10-11-3-170.ap-southeast-1.compute.internal)]: Effective MTU verified at 8939

I'm using default MTU value = 8912. Does this mean my MTU configuration is incorrect ? Do I need to update it ?

murali-reddy commented 5 years ago

I'm using weave 2.5.2 on kubernetes version 1.11.10 and seeing these errors continuously.

Please share your logs to reason why you are seeing this error connection shutting down due to error: Multiple connections.

Note that original issue is case where there no working forwarders found, and there were heartbeat timeouts etc. Please see if you have similar symptoms else open a new issue with relevant logs.

Does this mean my MTU configuration is incorrect ? Do I need to update it ?

Regarding MTU, are you using sleeve mode? In case of sleeve mode MTU is dynamically negotiated. If you were using fastdp you should check why connection drooped to sleeve mode. Either way sharing your logs would help.

monicagangwar commented 5 years ago

@murali-reddy sure. let me share the log lines. Also i'm using fastdp mode only let me know if you need any other logs/info

INFO: 2019/07/04 07:23:33.520512 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/04 06:52:56.549384 sleeve ->[10.11.3.170:6783|ee:10:f8:5f:7a:b0(ip-10-11-3-170.ap-southeast-1.compute.internal)]: Effective MTU verified at 8939
INFO: 2019/07/04 07:30:13.986939 ->[10.11.3.24:6783|8e:4a:6b:b5:b5:e0(ip-10-11-3-24.ap-southeast-1.compute.internal)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/07/04 07:20:25.694554 ->[10.11.4.147:6783] error during connection attempt: dial tcp4 :0->10.11.4.147:6783: connect: connection refused
INFO: 2019/07/04 07:22:47.667506 ->[10.11.4.147:6783] error during connection attempt: dial tcp4 :0->10.11.4.147:6783: connect: connection timed out
INFO: 2019/07/04 07:23:32.962780 ->[10.11.4.124:6783|be:25:ec:67:b2:53(ip-10-11-4-124.ap-southeast-1.compute.internal)]: connection shutting down due to error: Multiple connections to be:25:ec:67:b2:53(ip-10-11-4-124.ap-southeast-1.compute.internal) added to 8e:4a:6b:b5:b5:e0(ip-10-11-3-24.ap-southeast-1.compute.internal)

weave status

Use 'kubectl describe pod/weave-net-s7fhk -n kube-system' to see all of the containers in this pod.
/home/weave # ./weave --local status

        Version: 2.5.2 (up to date; next check at 2019/07/04 13:56:41)

        Service: router
       Protocol: weave 1..2
           Name: 8e:4a:6b:b5:b5:e0(ip-10-11-3-24.ap-southeast-1.compute.internal)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 54
    Connections: 54 (53 established, 1 failed)
          Peers: 54 (with 2862 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 100.96.0.0/11
  DefaultSubnet: 100.96.0.0/11

weave connections status

/home/weave # ./weave --local status connections
<- 10.11.3.186:42093     established fastdp de:4e:e2:52:09:5e(ip-10-11-3-186.ap-southeast-1.compute.internal) mtu=8912
-> 10.11.3.199:6783      established fastdp 2e:07:5a:86:47:db(ip-10-11-3-199.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.242:16539     established fastdp 4e:57:b6:df:bf:2e(ip-10-11-4-242.ap-southeast-1.compute.internal) mtu=8912
-> 10.11.3.193:6783      established fastdp 7a:00:13:3d:7f:37(ip-10-11-3-193.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.76:55588      established fastdp f6:45:18:0f:2f:9b(ip-10-11-4-76.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.3.177:37937     established fastdp be:78:1e:60:68:c6(ip-10-11-3-177.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.116:20028     established fastdp 7e:d4:ca:29:3d:b6(ip-10-11-4-116.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.96:47457      established fastdp 6e:40:11:fd:69:b7(ip-10-11-4-96.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.3.137:63090     established fastdp 7a:34:9d:f6:4a:92(ip-10-11-3-137.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.3.139:17730     established fastdp 06:3c:c0:54:34:bc(ip-10-11-3-139.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.225:18892     established fastdp ca:00:4e:48:ed:7c(ip-10-11-4-225.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.3.197:58850     established fastdp c6:45:7e:43:1e:06(ip-10-11-3-197.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.4.103:43112     established fastdp ca:ff:3f:f2:d5:c2(ip-10-11-4-103.ap-southeast-1.compute.internal) mtu=8912
<- 10.11.3.142:18248     established fastdp 62:27:64:05:d2:8a(ip-10-11-3-142.ap-southeast-1.compute.internal) mtu=8912
mars64 commented 5 years ago

What OS are you guys using?

We've encountered tons of connection reset by peer issues, and I've finally traced it back to networkd usage on CoreOS (2079.5.1 as of this writing, but have been running various permutations since ~600). K8s 1.11.10, Weave 2.5.2 at this time.

See #3: https://www.weave.works/blog/running-a-weave-network-on-coreos/

Since I'm using Kops, adding a drop-in to install this file:

## /etc/systemd/network/10-weave.network
[Match]
Name=weave datapath vxlan-6784 dummy0

[Network]
Description=Network interfaces managed by weave

[Link]
Unmanaged=true

Seems to have helped.

[edit] Made the example drop-in more explicit

monicagangwar commented 5 years ago

@mars64 thanks for the input, but this resolution is for connection reset by peers error. What about the following errors ? Also, how do i determine correct MTU configuration for weave ? There are multiple hosts on my host - docker0, eth0, number of vethwep* with different MTU values.

INFO: 2019/07/04 07:30:13.986939 ->[10.11.3.24:6783|8e:4a:6b:b5:b5:e0(ip-10-11-3-24.ap-southeast-1.compute.internal)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/07/04 07:23:32.962780 ->[10.11.4.124:6783|be:25:ec:67:b2:53(ip-10-11-4-124.ap-southeast-1.compute.internal)]: connection shutting down due to error: Multiple connections to be:25:ec:67:b2:53(ip-10-11-4-124.ap-southeast-1.compute.internal) added to 8e:4a:6b:b5:b5:e0(ip-10-11-3-24.ap-southeast-1.compute.internal)
mars64 commented 5 years ago

I'm saying that I encountered many connection reset by peers errors that were tough to track down and my solution above seems to be helping.

Re: MTU, please see @murali-reddy's comment on your other issue.

ezlo-eugenek commented 5 years ago

Have same issue: Kubernetes: Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", weave 2.5.2 We use ubuntu 18.04, with WireGuard vpn between nodes, it works fine in case of 3 nodes, now we have 7 nodes and this issue happens almost every day.

Logs:

INFO: 2019/07/17 08:00:00.460067 Sending ICMP 3,4 (10.42.0.0 -> 10.47.0.20): PMTU=1358
INFO: 2019/07/17 08:00:30.040359 Sending ICMP 3,4 (10.36.0.18 -> 10.47.0.2): PMTU=1358
INFO: 2019/07/17 08:00:30.040614 Sending ICMP 3,4 (10.36.0.18 -> 10.47.0.2): PMTU=1358
INFO: 2019/07/17 08:00:30.040948 Sending ICMP 3,4 (10.36.0.18 -> 10.47.0.2): PMTU=1358
INFO: 2019/07/17 08:00:30.041070 Sending ICMP 3,4 (10.36.0.18 -> 10.47.0.2): PMTU=1358
INFO: 2019/07/17 08:00:31.643604 Discovered remote MAC ce:e1:74:78:e9:99 at aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 07:59:50.232678 ->[10.0.1.4:42333|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:50.232777 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:50.232799 ->[10.0.1.4:42333|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 07:59:50.233947 ->[10.0.1.4:42333|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: read tcp4 10.0.1.3:6783->10.0.1.4:42333: read: connection reset by peer
INFO: 2019/07/17 07:59:50.234049 ->[10.0.1.4:42333|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 07:59:50.234168 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:50.234457 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:50.234505 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:00.401502 Discovered remote MAC 46:5d:b8:73:d2:f8 at ee:8c:04:f1:3d:5b(kube7)
INFO: 2019/07/17 08:00:31.649026 Discovered remote MAC ce:e1:74:78:e9:99 at aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 07:59:53.678243 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 07:59:53.684373 ->[10.0.1.4:58107|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:53.684468 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:53.684498 ->[10.0.1.4:58107|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 07:59:53.707558 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:53.708448 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:53.708495 ->[10.0.1.4:58107|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 07:59:53.708609 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 07:59:53.709189 ->[10.0.1.4:58107|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to aa:4f:55:0d:c9:e8(kube1)
ERRO: 2019/07/17 08:00:00.400693 Captured frame from MAC (02:f0:18:50:dd:f7) to (46:5d:b8:73:d2:f8) associated with another peer 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 07:59:54.783518 ->[10.0.1.4:48885|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 22:49:f6:fc:cc:cd(kube6)
ERRO: 2019/07/17 08:00:00.400535 Captured frame from MAC (02:f0:18:50:dd:f7) to (46:5d:b8:73:d2:f8) associated with another peer 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:05.039007 Sending ICMP 3,4 (10.42.0.0 -> 10.43.128.21): PMTU=1358
INFO: 2019/07/17 08:00:31.649188 Discovered remote MAC ce:e1:74:78:e9:99 at aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 08:00:34.145086 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 08:00:34.145652 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 08:00:34.145809 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 08:00:34.145912 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 07:59:58.323908 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:58.324035 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:34.145993 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 08:00:34.146112 Sending ICMP 3,4 (10.40.0.12 -> 10.43.128.2): PMTU=1358
INFO: 2019/07/17 07:59:58.324066 ->[10.0.1.4:54443|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:58.324152 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:58.324180 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 07:59:58.324860 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 07:59:58.325021 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 07:59:58.325317 ->[10.0.1.4:54443|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:00.399926 Discovered remote MAC 46:5d:b8:73:d2:f8 at ee:8c:04:f1:3d:5b(kube7)
INFO: 2019/07/17 08:00:31.646942 Discovered remote MAC ce:e1:74:78:e9:99 at aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 07:59:58.325698 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using fastdp
INFO: 2019/07/17 07:59:58.325746 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using fastdp
INFO: 2019/07/17 07:59:58.325763 ->[10.0.1.5:44479|02:f0:18:50:dd:f7(kube5)]: connection added (new peer)
INFO: 2019/07/17 07:59:58.325992 ->[10.0.1.5:44479|02:f0:18:50:dd:f7(kube5)]: connection deleted
INFO: 2019/07/17 07:59:58.326093 ->[10.0.1.5:44479|02:f0:18:50:dd:f7(kube5)]: connection shutting down due to error: Multiple connections to 02:f0:18:50:dd:f7(kube5) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 07:59:58.326866 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection added (new peer)
INFO: 2019/07/17 07:59:58.331304 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using sleeve
INFO: 2019/07/17 07:59:58.331743 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/17 07:59:58.331905 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection fully established
INFO: 2019/07/17 07:59:58.334000 sleeve ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: Effective MTU verified at 1358
INFO: 2019/07/17 07:59:50.216976 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 07:59:50.218344 ->[10.0.1.4:46457] connection accepted
INFO: 2019/07/17 07:59:50.219728 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:50.220004 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:50.220073 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 07:59:50.220176 ->[10.0.1.4:46457|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 07:59:50.220294 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 07:59:50.220727 ->[10.0.1.4:46457|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 86:0a:f7:72:aa:a2(kube2)
ERRO: 2019/07/17 08:00:00.400779 Captured frame from MAC (02:f0:18:50:dd:f7) to (46:5d:b8:73:d2:f8) associated with another peer 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:31.648688 Discovered remote MAC ce:e1:74:78:e9:99 at aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 08:00:42.857008 Discovered remote MAC e6:3b:5b:d9:e6:a4 at 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:42.862008 Discovered remote MAC e6:3b:5b:d9:e6:a4 at 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:42.862272 Discovered remote MAC e6:3b:5b:d9:e6:a4 at 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:42.862243 Discovered remote MAC e6:3b:5b:d9:e6:a4 at 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:48.395014 Sending ICMP 3,4 (10.42.0.22 -> 10.47.0.3): PMTU=1358
INFO: 2019/07/17 08:00:48.395202 Sending ICMP 3,4 (10.42.0.22 -> 10.47.0.3): PMTU=1358
INFO: 2019/07/17 08:00:50.221401 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: write kg tcp4 10.0.1.2:51739->10.0.1.4:6783: write: connection reset by peer
INFO: 2019/07/17 08:00:50.220229 overlay_switch ->[86:0a:f7:72:aa:a2(kube2)] using sleeve
INFO: 2019/07/17 08:00:50.220413 overlay_switch ->[86:0a:f7:72:aa:a2(kube2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:50.220505 ->[10.0.1.2:51739|86:0a:f7:72:aa:a2(kube2)]: connection shutting down due to error: no working forwarders to 86:0a:f7:72:aa:a2(kube2)
INFO: 2019/07/17 08:00:50.220673 ->[10.0.1.2:51739|86:0a:f7:72:aa:a2(kube2)]: connection deleted
INFO: 2019/07/17 08:00:50.221292 ->[10.0.1.2:6783] attempting connection
INFO: 2019/07/17 08:00:50.221449 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:50.221847 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:50.222207 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:50.227493 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:50.235191 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:50.235592 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:50.235999 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: no working forwarders to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:50.233175 ->[10.0.1.2:42041] connection accepted
INFO: 2019/07/17 08:00:50.235406 overlay_switch ->[6e:1f:12:00:ea:d2(kube3)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:50.235492 ->[10.0.1.3:36577|6e:1f:12:00:ea:d2(kube3)]: connection shutting down due to error: no working forwarders to 6e:1f:12:00:ea:d2(kube3)
INFO: 2019/07/17 08:00:50.235611 ->[10.0.1.3:36577|6e:1f:12:00:ea:d2(kube3)]: connection deleted
INFO: 2019/07/17 08:00:50.235855 ->[10.0.1.3:6783] attempting connection
INFO: 2019/07/17 08:00:50.239181 ->[10.0.1.3:35709] connection accepted
INFO: 2019/07/17 08:00:50.239438 ->[10.0.1.4:36293] connection accepted
INFO: 2019/07/17 08:00:50.236275 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:50.236856 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:50.238397 ->[10.0.1.4:60201] connection accepted
INFO: 2019/07/17 08:00:50.239553 ->[10.0.1.4:60201|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.239654 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:50.239697 ->[10.0.1.4:60201|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.243002 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.243155 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:50.243783 ->[10.0.1.4:60201|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:50.243841 ->[10.0.1.4:60201|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 6e:1f:12:00:ea:d2(kube3)
INFO: 2019/07/17 08:00:50.244937 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.241887 ->[10.0.1.4:36293|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.241982 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:50.242011 ->[10.0.1.4:36293|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.239771 ->[10.0.1.3:6783|6e:1f:12:00:ea:d2(kube3)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.239928 overlay_switch ->[6e:1f:12:00:ea:d2(kube3)] using fastdp
INFO: 2019/07/17 08:00:50.239957 ->[10.0.1.3:6783|6e:1f:12:00:ea:d2(kube3)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.241573 ->[10.0.1.3:35709|6e:1f:12:00:ea:d2(kube3)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.241667 overlay_switch ->[6e:1f:12:00:ea:d2(kube3)] using fastdp
INFO: 2019/07/17 08:00:50.241697 ->[10.0.1.3:6783|6e:1f:12:00:ea:d2(kube3)]: connection deleted
INFO: 2019/07/17 08:00:50.241829 ->[10.0.1.3:6783|6e:1f:12:00:ea:d2(kube3)]: connection shutting down due to error: Multiple connections to 6e:1f:12:00:ea:d2(kube3) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:50.241910 ->[10.0.1.3:35709|6e:1f:12:00:ea:d2(kube3)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.243326 ->[10.0.1.2:42041|86:0a:f7:72:aa:a2(kube2)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.243436 ->[10.0.1.2:6783|86:0a:f7:72:aa:a2(kube2)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.243483 overlay_switch ->[86:0a:f7:72:aa:a2(kube2)] using fastdp
INFO: 2019/07/17 08:00:50.243511 ->[10.0.1.2:42041|86:0a:f7:72:aa:a2(kube2)]: connection added (new peer)
INFO: 2019/07/17 08:00:50.243529 overlay_switch ->[86:0a:f7:72:aa:a2(kube2)] using fastdp
INFO: 2019/07/17 08:00:50.243641 ->[10.0.1.2:6783|86:0a:f7:72:aa:a2(kube2)]: connection shutting down due to error: Multiple connections to 86:0a:f7:72:aa:a2(kube2) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:50.243571 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:50.243692 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:50.243733 ->[10.0.1.4:36293|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:50.243819 ->[10.0.1.4:36293|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 86:0a:f7:72:aa:a2(kube2)
INFO: 2019/07/17 08:00:50.243897 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:53.709085 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:53.710034 overlay_switch ->[aa:4f:55:0d:c9:e8(kube1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:53.710180 ->[10.0.1.1:59301|aa:4f:55:0d:c9:e8(kube1)]: connection shutting down due to error: no working forwarders to aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 08:00:53.710324 ->[10.0.1.1:59301|aa:4f:55:0d:c9:e8(kube1)]: connection deleted
INFO: 2019/07/17 08:00:53.709944 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:53.710019 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: no working forwarders to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:53.710177 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:53.710537 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:53.712846 ->[10.0.1.4:50223] connection accepted
INFO: 2019/07/17 08:00:53.714692 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:53.710938 ->[10.0.1.1:6783] attempting connection
INFO: 2019/07/17 08:00:53.712821 ->[10.0.1.1:59375] connection accepted
INFO: 2019/07/17 08:00:53.714244 ->[10.0.1.1:6783|aa:4f:55:0d:c9:e8(kube1)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:53.715008 ->[10.0.1.4:50223|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:53.715126 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:53.715197 ->[10.0.1.4:50223|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:53.715572 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:53.715722 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to aa:4f:55:0d:c9:e8(kube1)
INFO: 2019/07/17 08:00:53.714418 overlay_switch ->[aa:4f:55:0d:c9:e8(kube1)] using fastdp
INFO: 2019/07/17 08:00:53.714524 ->[10.0.1.1:6783|aa:4f:55:0d:c9:e8(kube1)]: connection added (new peer)
INFO: 2019/07/17 08:00:53.714549 ->[10.0.1.1:59375|aa:4f:55:0d:c9:e8(kube1)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:53.714669 overlay_switch ->[aa:4f:55:0d:c9:e8(kube1)] using fastdp
INFO: 2019/07/17 08:00:53.714819 ->[10.0.1.1:59375|aa:4f:55:0d:c9:e8(kube1)]: connection shutting down due to error: Multiple connections to aa:4f:55:0d:c9:e8(kube1) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:53.717910 overlay_switch ->[aa:4f:55:0d:c9:e8(kube1)] using sleeve
INFO: 2019/07/17 08:00:53.717963 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/17 08:00:53.717984 ->[10.0.1.1:6783|aa:4f:55:0d:c9:e8(kube1)]: connection fully established
INFO: 2019/07/17 08:00:53.720227 sleeve ->[10.0.1.1:6783|aa:4f:55:0d:c9:e8(kube1)]: Effective MTU verified at 1358
INFO: 2019/07/17 08:00:54.782490 overlay_switch ->[22:49:f6:fc:cc:cd(kube6)] using sleeve
INFO: 2019/07/17 08:00:54.782578 overlay_switch ->[22:49:f6:fc:cc:cd(kube6)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:54.782607 ->[10.0.1.6:54783|22:49:f6:fc:cc:cd(kube6)]: connection shutting down due to error: no working forwarders to 22:49:f6:fc:cc:cd(kube6)
INFO: 2019/07/17 08:00:54.782728 ->[10.0.1.6:54783|22:49:f6:fc:cc:cd(kube6)]: connection deleted
INFO: 2019/07/17 08:00:54.783201 ->[10.0.1.6:6783] attempting connection
INFO: 2019/07/17 08:00:54.783921 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: read tcp4 10.0.1.6:54783->10.0.1.4:6783: read: connection reset by peer
INFO: 2019/07/17 08:00:54.784291 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:54.791787 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:54.791867 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:54.802761 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:54.804503 ->[10.0.1.4:49943] connection accepted
INFO: 2019/07/17 08:00:54.806602 ->[10.0.1.4:49943|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:54.806766 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:54.806808 ->[10.0.1.4:49943|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:54.807706 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:54.807807 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:54.807843 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 22:49:f6:fc:cc:cd(kube6)
INFO: 2019/07/17 08:00:54.805306 ->[10.0.1.6:40613] connection accepted
INFO: 2019/07/17 08:00:54.807246 ->[10.0.1.6:40613|22:49:f6:fc:cc:cd(kube6)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:54.807432 overlay_switch ->[22:49:f6:fc:cc:cd(kube6)] using fastdp
INFO: 2019/07/17 08:00:54.807464 ->[10.0.1.6:6783|22:49:f6:fc:cc:cd(kube6)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:54.807607 overlay_switch ->[22:49:f6:fc:cc:cd(kube6)] using fastdp
INFO: 2019/07/17 08:00:54.807622 ->[10.0.1.6:40613|22:49:f6:fc:cc:cd(kube6)]: connection added (new peer)
INFO: 2019/07/17 08:00:54.807887 ->[10.0.1.6:40613|22:49:f6:fc:cc:cd(kube6)]: connection deleted
INFO: 2019/07/17 08:00:54.807976 ->[10.0.1.6:40613|22:49:f6:fc:cc:cd(kube6)]: connection shutting down due to error: Multiple connections to 22:49:f6:fc:cc:cd(kube6) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:54.808353 ->[10.0.1.6:6783|22:49:f6:fc:cc:cd(kube6)]: connection added (new peer)
INFO: 2019/07/17 08:00:54.812050 overlay_switch ->[22:49:f6:fc:cc:cd(kube6)] using sleeve
INFO: 2019/07/17 08:00:54.812164 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/17 08:00:54.812167 ->[10.0.1.6:6783|22:49:f6:fc:cc:cd(kube6)]: connection fully established
INFO: 2019/07/17 08:00:54.814045 sleeve ->[10.0.1.6:6783|22:49:f6:fc:cc:cd(kube6)]: Effective MTU verified at 1358
INFO: 2019/07/17 08:00:57.638657 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:57.643321 overlay_switch ->[ee:8c:04:f1:3d:5b(kube7)] using sleeve
INFO: 2019/07/17 08:00:57.643405 overlay_switch ->[ee:8c:04:f1:3d:5b(kube7)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:57.643446 ->[10.0.1.7:33687|ee:8c:04:f1:3d:5b(kube7)]: connection shutting down due to error: no working forwarders to ee:8c:04:f1:3d:5b(kube7)
INFO: 2019/07/17 08:00:57.643579 ->[10.0.1.7:33687|ee:8c:04:f1:3d:5b(kube7)]: connection deleted
INFO: 2019/07/17 08:00:57.639063 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:57.639107 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: no working forwarders to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:57.639170 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:57.639650 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:57.640546 ->[10.0.1.4:43675] connection accepted
INFO: 2019/07/17 08:00:57.641988 ->[10.0.1.4:43675|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:57.642145 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:57.642243 ->[10.0.1.4:43675|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:57.642908 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:57.642986 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:57.643014 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to ee:8c:04:f1:3d:5b(kube7)
INFO: 2019/07/17 08:00:57.644105 ->[10.0.1.7:6783] attempting connection
INFO: 2019/07/17 08:00:57.645832 ->[10.0.1.7:57153] connection accepted
INFO: 2019/07/17 08:00:57.647364 ->[10.0.1.7:6783|ee:8c:04:f1:3d:5b(kube7)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:57.647461 overlay_switch ->[ee:8c:04:f1:3d:5b(kube7)] using fastdp
INFO: 2019/07/17 08:00:57.647494 ->[10.0.1.7:6783|ee:8c:04:f1:3d:5b(kube7)]: connection added (new peer)
INFO: 2019/07/17 08:00:57.648990 overlay_switch ->[ee:8c:04:f1:3d:5b(kube7)] using sleeve
INFO: 2019/07/17 08:00:57.649055 ->[10.0.1.7:6783|ee:8c:04:f1:3d:5b(kube7)]: connection fully established
INFO: 2019/07/17 08:00:57.649082 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/17 08:00:57.649537 ->[10.0.1.7:57153|ee:8c:04:f1:3d:5b(kube7)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:57.649621 overlay_switch ->[ee:8c:04:f1:3d:5b(kube7)] using fastdp
INFO: 2019/07/17 08:00:57.649653 ->[10.0.1.7:57153|ee:8c:04:f1:3d:5b(kube7)]: connection shutting down due to error: Multiple connections to ee:8c:04:f1:3d:5b(kube7) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:57.652522 sleeve ->[10.0.1.7:6783|ee:8c:04:f1:3d:5b(kube7)]: Effective MTU verified at 1358
INFO: 2019/07/17 08:00:58.326640 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using sleeve
INFO: 2019/07/17 08:00:58.328993 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/07/17 08:00:58.329058 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection shutting down due to error: no working forwarders to 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:58.329164 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection deleted
INFO: 2019/07/17 08:00:58.329509 ->[10.0.1.5:6783] attempting connection
INFO: 2019/07/17 08:00:58.327961 ->[10.0.1.4:54443|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: read tcp4 10.0.1.5:6783->10.0.1.4:54443: read: connection reset by peer
INFO: 2019/07/17 08:00:58.328079 ->[10.0.1.4:54443|a6:a3:89:97:87:9b(kube4)]: connection deleted
INFO: 2019/07/17 08:00:58.328594 ->[10.0.1.4:6783] attempting connection
INFO: 2019/07/17 08:00:58.330977 ->[10.0.1.4:58473] connection accepted
INFO: 2019/07/17 08:00:58.336319 ->[10.0.1.5:55475] connection accepted
INFO: 2019/07/17 08:00:58.334913 ->[10.0.1.4:58473|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:58.335043 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:58.335073 ->[10.0.1.4:58473|a6:a3:89:97:87:9b(kube4)]: connection added (new peer)
INFO: 2019/07/17 08:00:58.336778 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:58.336868 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using fastdp
INFO: 2019/07/17 08:00:58.336895 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection added (new peer)
INFO: 2019/07/17 08:00:58.337942 ->[10.0.1.5:55475|02:f0:18:50:dd:f7(kube5)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:58.338056 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using fastdp
INFO: 2019/07/17 08:00:58.336858 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection ready; using protocol version 2
INFO: 2019/07/17 08:00:58.337021 overlay_switch ->[a6:a3:89:97:87:9b(kube4)] using fastdp
INFO: 2019/07/17 08:00:58.337104 ->[10.0.1.4:6783|a6:a3:89:97:87:9b(kube4)]: connection shutting down due to error: Multiple connections to a6:a3:89:97:87:9b(kube4) added to 02:f0:18:50:dd:f7(kube5)
INFO: 2019/07/17 08:00:58.338099 ->[10.0.1.5:55475|02:f0:18:50:dd:f7(kube5)]: connection shutting down due to error: Multiple connections to 02:f0:18:50:dd:f7(kube5) added to a6:a3:89:97:87:9b(kube4)
INFO: 2019/07/17 08:00:58.340614 overlay_switch ->[02:f0:18:50:dd:f7(kube5)] using sleeve
INFO: 2019/07/17 08:00:58.341034 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/07/17 08:00:58.341043 ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: connection fully established
INFO: 2019/07/17 08:00:58.342250 sleeve ->[10.0.1.5:6783|02:f0:18:50:dd:f7(kube5)]: Effective MTU verified at 1358
INFO: 2019/07/17 08:01:34.828438 Sending ICMP 3,4 (10.42.0.28 -> 10.47.0.13): PMTU=1358
INFO: 2019/07/17 08:01:34.828624 Sending ICMP 3,4 (10.42.0.28 -> 10.47.0.13): PMTU=1358
ezlo-eugenek commented 5 years ago

After last issue reported, I've enabled DEBUG logging, and it happened again. Pls check attached log. weawe_error.log

And as usual, issue fixed after restart of all weave pods.

storax commented 4 years ago

Same issue on K8s 1.16 and CoreOS stable (2303.3.0).

It correlated exactly with memory pressure caused by a pod running on the affected nodes. Weave started logging what others have posted here, when the node hit around 150MB-0MB free memory reported by the Prometheus node-exporter. So far, I could correlate 3 incidents with memory pressure.

Is anyone else seeing this correlation? EDIT: Seems to be related to #3695 in my case.

dstrimble commented 4 years ago

This is the same as my issue I think

https://github.com/weaveworks/weave/issues/3757

Any help is appreciated.

bboreham commented 4 years ago

The original trail is long-dead and it's too confusing to respond to multiple threads of conversation in a GitHub issue; please open a new issue rather than commenting on this. The new issue template will request info that is essential to debug.

Note that "multiple connections" is not really a problem; just a transient condition that gets reported in passing. People commenting here were having other issues.

TechnicalVegeta commented 2 years ago

I faced the same issue, If you see the message: IP allocation was seeded by different peers, this means that some Weave Net peers were initialized into one cluster and some into another cluster; Weave Net cannot operate in this state. To recover, you need to eliminate the IPAM data from the affected nodes and restart. If you installed via the Kubernetes Addon, this data will be in a file under /var/lib/weave on the node , this solution worked for me