weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Weave network unreachable upon restarting a VM #3993

Open vmalempa opened 1 year ago

vmalempa commented 1 year ago

What you expected to happen?

Restarting a VM in a weave network should not affect any other running VMs connectivity

What happened?

We have a docker environment where multiple VMs / containers are connected over weave. When the traffic is running, if we restart one of the VM, it is affecting the VMs which are handling high traffic and loses the connectivity. All weave connections are going to sleeve mode and also observing the ping timeouts over the weave network

Restarting weave is not helping, we have to restart the VM to recover from this state.

Weave Version: 1.9.4 Ubuntu: 18.04

cps@drd03v:~$ weave status connections -> 172.26.240.14:6783 established sleeve 7e:1f:42:18:d4:02(drc02v) mtu=1438 -> 172.26.240.24:6783 established sleeve de:85:b9:76:17:47(drd06v) mtu=1438 -> 172.26.240.36:6783 established sleeve fe:11:42:fe:36:ef(drw10v) mtu=1438 <- 172.26.240.35:37781 established sleeve 2e:89:5b:59:f0:98(drw09v) mtu=1438 -> 172.26.240.17:6783 established sleeve 2a:e2:81:43:1c:f4(drl03v) mtu=1438 <- 172.26.240.13:57201 established sleeve ea:03:03:50:3c:d6(drc01v) mtu=1438 <- 172.26.240.34:38899 established sleeve a6:40:f3:54:f5:ae(drw08v) mtu=1438 <- 172.26.240.26:53301 established sleeve 22:61:77:5f:75:05(drd08v) mtu=1438 <- 172.26.240.23:36895 established sleeve 12:85:e4:ff:14:19(drd05v) mtu=1438 <- 172.26.240.22:46821 established sleeve 7e:ab:fd:30:70:e4(drd04v) mtu=1438 <- 172.26.240.19:56903 established sleeve 8a:35:92:7f:2e:96(drd01v) mtu=1438 <- 172.26.240.18:41897 established sleeve 2a:0c:5b:4a:04:c9(drl04v) mtu=1438 -> 172.26.240.28:6783 established sleeve c2:46:c0:b1:63:ce(drw02v) mtu=1438 -> 172.26.240.12:6783 established sleeve ce:d3:0f:78:61:ef(drm01v) mtu=1438 <- 172.26.240.31:43823 established sleeve ae:bc:98:52:ac:07(drw05v) mtu=1438 -> 172.26.240.30:6783 established sleeve 6e:06:4c:51:58:f7(drw04v) mtu=1438 -> 172.26.240.25:6783 established sleeve 2e:f0:03:1d:c3:c2(drd07v) mtu=1438 -> 172.26.240.29:6783 established sleeve 1e:7a:a7:d4:48:b7(drw03v) mtu=1438 -> 172.26.240.27:6783 established sleeve 5e:79:d5:c1:cb:fa(drw01v) mtu=1438 <- 172.26.240.16:56293 established sleeve ce:75:86:80:06:5b(drl02v) mtu=1438 -> 172.26.240.20:6783 established sleeve 8a:da:c7:bd:49:08(drd02v) mtu=1438 <- 172.26.240.15:57335 established sleeve da:df:34:ec:2e:f5(drl01v) mtu=1438 -> 172.26.240.32:6783 established sleeve aa:5b:63:18:1f:c7(drw06v) mtu=1438 <- 172.26.240.33:38283 established sleeve ce:44:b2:76:95:0f(drw07v) mtu=1438 cps@drd03v:~$ weave status ipam 06:70:6b:a2:16:8e(drd03v) 32768 IPs (03.1% of total) (15 active) ce:75:86:80:06:5b(drl02v) 32768 IPs (03.1% of total) ea:03:03:50:3c:d6(drc01v) 32768 IPs (03.1% of total) ce:d3:0f:78:61:ef(drm01v) 344064 IPs (32.8% of total) 7e:1f:42:18:d4:02(drc02v) 49152 IPs (04.7% of total) 2a:0c:5b:4a:04:c9(drl04v) 32768 IPs (03.1% of total) c2:46:c0:b1:63:ce(drw02v) 32768 IPs (03.1% of total) ce:44:b2:76:95:0f(drw07v) 16384 IPs (01.6% of total) 12:85:e4:ff:14:19(drd05v) 16384 IPs (01.6% of total) 5e:79:d5:c1:cb:fa(drw01v) 32768 IPs (03.1% of total) ae:bc:98:52:ac:07(drw05v) 49152 IPs (04.7% of total) 2e:89:5b:59:f0:98(drw09v) 16384 IPs (01.6% of total) da:df:34:ec:2e:f5(drl01v) 49152 IPs (04.7% of total) 7e:ab:fd:30:70:e4(drd04v) 16384 IPs (01.6% of total) 8a:35:92:7f:2e:96(drd01v) 49152 IPs (04.7% of total) fe:11:42:fe:36:ef(drw10v) 16384 IPs (01.6% of total) 6e:06:4c:51:58:f7(drw04v) 32768 IPs (03.1% of total) 2e:f0:03:1d:c3:c2(drd07v) 16384 IPs (01.6% of total) 22:61:77:5f:75:05(drd08v) 16384 IPs (01.6% of total) 8a:da:c7:bd:49:08(drd02v) 16384 IPs (01.6% of total) 1e:7a:a7:d4:48:b7(drw03v) 49152 IPs (04.7% of total) de:85:b9:76:17:47(drd06v) 16384 IPs (01.6% of total) aa:5b:63:18:1f:c7(drw06v) 49152 IPs (04.7% of total) 2a:e2:81:43:1c:f4(drl03v) 16384 IPs (01.6% of total) a6:40:f3:54:f5:ae(drw08v) 16384 IPs (01.6% of total) cps@drd03v:~$ logout

How to reproduce it?

Restart VM while traffic is running

Anything else we need to know?

Versions:

$ weave version
weave script 1.9.4
weave router 1.9.4
weave proxy  1.9.4
weave plugin 1.9.4

$ docker version
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:54:08 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:16 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ uname -a
Linux vpas-B1-master-0 4.15.0-167-generic #175-Ubuntu SMP Wed Jan 5 01:56:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version

Logs:

$ docker logs weave

Below messages are flooding to all the containers in other VMs. This log is taken from one of the VM [DRD02] which got affected after restarting a different VM [DRC01].

Jun 09 06:04:09 drd02v 055857a89e35[4285]: ERRO: 2023/06/09 06:04:09.770378 Captured frame from MAC (a6:10:c7:fd:b0:6e) to (32:01:98:ea:ba:16) associated with another peer 06:70:6b:a2:16:8e(hstntx1drd03v)
Jun 09 06:04:09 drd02v 055857a89e35[4285]: ERRO: 2023/06/09 06:04:09.770557 Captured frame from MAC (92:18:b5:cf:cf:41) to (f2:70:80:47:4a:1e) associated with another peer a6:40:f3:54:f5:ae(hstntx1drw08v)
Jun 09 06:04:09 drd02v 7cf25e8b3155[4285]: Checking DRA configuration files in repository...
Jun 09 06:04:09 drd02v 94344de52cd3[4285]: [WARNING] 159/060409 (12687) : Server hi_res_prometheus_servers/prometheus5 is DOWN, reason: Layer4 connection problem, info: "No route to host", check duration: 3059ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 09 06:04:09 drd02v 94344de52cd3[4285]: [WARNING] 159/060409 (12687) : Server prometheus_servers/prometheus2 is DOWN, reason: Layer4 connection problem, info: "No route to host", check duration: 3059ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

or, if using Kubernetes:

$ kubectl logs -n kube-system weave

<!-- (If output is long, please consider a Gist.) -->
<!-- Anything interesting or unusual output by the below, potentially relevant, commands?
$ journalctl -u docker.service --no-pager
$ journalctl -u kubelet --no-pager
$ kubectl get events
-->

## Network:
<!-- If your problem has anything to do with one network endpoint not being able to contact another, please run the following commands -->

$ ip route $ ip -4 -o addr $ sudo iptables-save

vmalempa commented 1 year ago

journal.log Attached the journalctl log