weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

ports expose does listen but does not have access #3615

Open mfinkelstine opened 5 years ago

mfinkelstine commented 5 years ago

Issue redirecting exposed port to worker nodes

first k8s cluster: I have k8s with 11 nodes environment with 3 master nodes (2 replicas). With several services and pods in my environment my main master node stopped exposing ports while other master (replica1, replica2) are exposing the ports for my pods

secound k8s cluster : I have the same issue with a second cluster that have 1 master with 3 worker nodes,

the logs that I have added is from my second cluster node

here is example of my k8s dashboard which is working on my 2 replicas but not in the master node with port exposed

on my k8s-master-main this is the output from the tcpdump

we can see that the host is listen to the port but does not redirect the networking

root@k8s-master-main:~# lsof -i:30465
COMMAND     PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
kube-prox 32627 root   25u  IPv6 7465821      0t0  TCP *:30465 (LISTEN)

here the iptable rules that I have for the dashboard it's exact on all master nodes

root@k8s-master-main:~# iptables -t nat -nL | grep dashboard
KUBE-MARK-MASQ  tcp  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kubernetes-dashboard: */ tcp dpt:30465
KUBE-SVC-XGLOHA7QRQ3V22RZ  tcp  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kubernetes-dashboard: */ tcp dpt:30465
KUBE-MARK-MASQ  all  --  10.32.0.3            0.0.0.0/0            /* kube-system/kubernetes-dashboard: */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kubernetes-dashboard: */ tcp to:10.32.0.3:8443
KUBE-MARK-MASQ  tcp  -- !10.32.0.0/12         10.110.60.225        /* kube-system/kubernetes-dashboard: cluster IP */ tcp dpt:443
KUBE-SVC-XGLOHA7QRQ3V22RZ  tcp  --  0.0.0.0/0            10.110.60.225        /* kube-system/kubernetes-dashboard: cluster IP */ tcp dpt:443
KUBE-SEP-LPUGT7E25KUQ5PUI  all  --  0.0.0.0/0            0.0.0.0/0            /* kube-system/kubernetes-dashboard: */

the dashboard is listen on port 30465 here is the output of my tcpdump

root@k8s-master-main:~# tcpdump -ni any port 30465
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
13:01:13.002312 IP 192.168.10.139.65533 > 192.168.132.133.30465: Flags [S], seq 3169086234, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0
13:01:13.252722 IP 192.168.10.139.65535 > 192.168.132.133.30465: Flags [S], seq 1338864689, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0
13:01:16.000630 IP 192.168.10.139.65533 > 192.168.132.133.30465: Flags [S], seq 3169086234, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0
13:01:16.253443 IP 192.168.10.139.65535 > 192.168.132.133.30465: Flags [S], seq 1338864689, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0

on my k8s-master-replica1 this is the output from the tcpdump

root@k8s-master-replica1:~# tcpdump -ni any port 30465
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
13:06:15.891608 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [S], seq 3874981346, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0
13:06:15.892994 IP 192.168.132.135.30465 > 192.168.10.139.49475: Flags [S.], seq 4062915301, ack 3874981347, win 26720, options [mss 1336,nop,nop,sackOK,nop,wscale 7], length 0
13:06:15.893576 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [.], ack 1, win 260, length 0
13:06:15.895480 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [P.], seq 1:518, ack 1, win 260, length 517
13:06:15.895658 IP 192.168.132.135.30465 > 192.168.10.139.49475: Flags [.], ack 518, win 218, length 0
13:06:15.895788 IP 192.168.132.135.30465 > 192.168.10.139.49475: Flags [P.], seq 1:147, ack 518, win 218, length 146
13:06:15.896432 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [P.], seq 518:569, ack 147, win 260, length 51
13:06:15.896865 IP 192.168.132.135.30465 > 192.168.10.139.49475: Flags [P.], seq 147:203, ack 569, win 218, length 56
13:06:15.933856 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [P.], seq 569:746, ack 203, win 260, length 177
13:06:15.933885 IP 192.168.10.139.49475 > 192.168.132.135.30465: Flags [P.], seq 746:1022, ack 203, win 260, length 276

I have created a docker image for nginx and I saw that the system is redirected the http port

root@k8s-master-main:~# tcpdump  -ni any port 8080
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
13:35:19.394587 IP 192.168.10..139.52032 > 192.168.132.133.8080: Flags [F.], seq 3885521030, ack 1729975292, win 252, length 0
13:35:19.394755 IP 192.168.132.133.8080 > 192.168.10..139.52032: Flags [F.], seq 1, ack 1, win 245, length 0
13:35:19.395317 IP 192.168.10..139.52032 > 192.168.132.133.8080: Flags [.], ack 2, win 252, length 0
13:35:19.428629 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [S], seq 3014852904, win 64260, options [mss 1428,nop,wscale 8,nop,nop,sackOK], length 0
13:35:19.428724 IP 192.168.132.133.8080 > 192.168.10..139.52071: Flags [S.], seq 1168464434, ack 3014852905, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
13:35:19.430191 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [.], ack 1, win 256, length 0
13:35:19.431712 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [P.], seq 1:380, ack 1, win 256, length 379: HTTP: GET / HTTP/1.1
13:35:19.431751 IP 192.168.132.133.8080 > 192.168.10..139.52071: Flags [.], ack 380, win 237, length 0
13:35:19.431905 IP 192.168.132.133.8080 > 192.168.10..139.52071: Flags [P.], seq 1:239, ack 380, win 237, length 238: HTTP: HTTP/1.1 200 OK
13:35:19.431983 IP 192.168.132.133.8080 > 192.168.10..139.52071: Flags [P.], seq 239:851, ack 380, win 237, length 612: HTTP
13:35:19.433158 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [.], ack 851, win 253, length 0
13:35:19.453609 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [P.], seq 380:740, ack 851, win 253, length 360: HTTP: GET /favicon.ico HTTP/1.1
13:35:19.453927 IP 192.168.132.133.8080 > 192.168.10..139.52071: Flags [P.], seq 851:1159, ack 740, win 245, length 308: HTTP: HTTP/1.1 404 Not Found
13:35:19.498549 IP 192.168.10..139.52071 > 192.168.132.133.8080: Flags [.], ack 1159, win 252, length 0

What you expected to happen: To have access to my exposed ports

Versions:

$ weave version
root@pa-k8s-master:~/weave-net# weave version
weave script 2.5.1
weave 2.3.0

$ docker version
root@pa-k8s-master:~/weave-net# docker version
Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        6247962
 Built:             Sun Feb 10 04:13:50 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:42:13 2019
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
root@pa-k8s-master:~# uname -a
Linux pa-k8s-master 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
root@pa-k8s-master:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}

Logs:

$ kubectl logs -n kube-system weave https://gist.github.com/mfinkelstine/dea4be768aa62d86af2182a9d8709e10

<!-- (If output is long, please consider a Gist.) -->
<!-- Anything interesting or unusual output by the below, potentially relevant, commands?
root@pa-k8s-master:~/weave-net# journalctl -u docker.service --no-pager
-- No entries --
root@pa-k8s-master:~/weave-net# journalctl -u kubelet --no-pager
-- Logs begin at Thu 2019-02-28 00:31:59 IST, end at Thu 2019-03-14 11:35:49 IST. --
Mar 07 13:24:07 pa-k8s-master kubelet[28591]: E0307 13:24:07.007113   28591 logs.go:351] Failed with err write tcp 192.168.132.186:10250->192.168.132.186:32856: write: broken pipe when writing log for log file "/var/log/pods/4e27137d2593f9ab454597831e362169/kube-apiserver/15.log": &{timestamp:{wall:714520332 ext:63687554646 loc:<nil>} stream:stderr log:[69 48 51 48 55 32 49 49 58 50 52 58 48 54 46 55 49 52 51 51 57 32 32 32 32 32 32 32 49 32 97 117 116 104 101 110 116 105 99 97 116 105 111 110 46 103 111 58 54 50 93 32 85 110 97 98 108 101 32 116 111 32 97 117 116 104 101 110 116 105 99 97 116 101 32 116 104 101 32 114 101 113 117 101 115 116 32 100 117 101 32 116 111 32 97 110 32 101 114 114 111 114 58 32 91 105 110 118 97 108 105 100 32 98 101 97 114 101 114 32 116 111 107 101 110 44 32 91 105 110 118 97 108 105 100 32 98 101 97 114 101 114 32 116 111 107 101 110 44 32 84 111 107 101 110 32 104 97 115 32 98 101 101 110 32 105 110 118 97 108 105 100 97 116 101 100 93 93 10]}
Mar 07 13:24:07 pa-k8s-master kubelet[28591]: I0307 13:24:07.341926   28591 logs.go:49] http: multiple response.WriteHeader calls
Mar 11 14:14:57 pa-k8s-master kubelet[28591]: W0311 14:14:57.611575   28591 prober.go:103] No ref for container "docker://9cee398328521844c22454df3f4d40fa3bb7aefa1c4805d07eeb341ca16f3f8e" (kube-apiserver-pa-k8s-master_kube-system(4e27137d2593f9ab454597831e362169):kube-apiserver)
Mar 11 14:14:58 pa-k8s-master kubelet[28591]: E0311 14:14:58.118725   28591 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "pa-k8s-master": Get https://192.168.132.186:6443/api/v1/nodes/pa-k8s-master?resourceVersion=0&timeout=10s: context deadline exceeded
Mar 12 16:31:28 pa-k8s-master kubelet[28591]: W0312 16:31:28.418604   28591 reflector.go:341] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: too old resource version: 17780644 (19932211)
Mar 13 11:39:26 pa-k8s-master kubelet[28591]: E0313 11:39:26.277339   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:41664->127.0.0.1:46855: write tcp 127.0.0.1:41664->127.0.0.1:46855: write: broken pipe
Mar 13 16:39:57 pa-k8s-master kubelet[28591]: E0313 16:39:57.482488   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:58346->127.0.0.1:46855: write tcp 127.0.0.1:58346->127.0.0.1:46855: write: broken pipe
Mar 13 16:59:14 pa-k8s-master kubelet[28591]: E0313 16:59:14.553210   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:36622->127.0.0.1:46855: write tcp 127.0.0.1:36622->127.0.0.1:46855: write: broken pipe
Mar 14 02:41:20 pa-k8s-master kubelet[28591]: E0314 02:41:20.187796   28591 kubelet_node_status.go:391] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2019-03-14T00:40:53Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2019-03-14T00:40:53Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2019-03-14T00:40:53Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2019-03-14T00:40:53Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2019-03-14T00:40:53Z\",\"type\":\"Ready\"}]}}" for node "pa-k8s-master": Timeout: request did not complete within allowed duration
Mar 14 02:41:20 pa-k8s-master kubelet[28591]: I0314 02:41:20.546726   28591 kubelet.go:1777] skipping pod synchronization - [container runtime is down]
Mar 14 09:50:46 pa-k8s-master kubelet[28591]: E0314 09:50:46.825555   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:38190->127.0.0.1:46855: write tcp 127.0.0.1:38190->127.0.0.1:46855: write: broken pipe
Mar 14 10:02:12 pa-k8s-master kubelet[28591]: E0314 10:02:12.935091   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:42268->127.0.0.1:46855: write tcp 127.0.0.1:42268->127.0.0.1:46855: write: broken pipe
Mar 14 10:10:33 pa-k8s-master kubelet[28591]: E0314 10:10:33.146568   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:45186->127.0.0.1:46855: write tcp 127.0.0.1:45186->127.0.0.1:46855: write: broken pipe
Mar 14 10:12:27 pa-k8s-master kubelet[28591]: E0314 10:12:27.742693   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:45872->127.0.0.1:46855: write tcp 127.0.0.1:45872->127.0.0.1:46855: write: broken pipe
Mar 14 10:12:37 pa-k8s-master kubelet[28591]: E0314 10:12:37.307838   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:45930->127.0.0.1:46855: write tcp 127.0.0.1:45930->127.0.0.1:46855: write: broken pipe
Mar 14 10:27:24 pa-k8s-master kubelet[28591]: E0314 10:27:24.068813   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51042->127.0.0.1:46855: write tcp 127.0.0.1:51042->127.0.0.1:46855: write: broken pipe
Mar 14 10:29:38 pa-k8s-master kubelet[28591]: E0314 10:29:38.437402   28591 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 127.0.0.1:51882->127.0.0.1:46855: write tcp 127.0.0.1:51882->127.0.0.1:46855: write: broken pipe
Mar 14 10:54:58 pa-k8s-master kubelet[28591]: W0314 10:54:58.014142   28591 container.go:393] Failed to create summary reader for "/docker/e1b0bbd236d6b035582928223b20424935cba381ea4fc96819b5c3be9b4f4474": none of the resources are being tracked.
root@pa-k8s-master:~/weave-net# kubectl get events
No resources found.

-->

Network:

root@pa-k8s-master:~/weave-net# ip route
default via 192.168.132.1 dev ens32
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.32.0.1
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown
192.168.132.0/23 dev ens32  proto kernel  scope link  src 192.168.132.186

root@pa-k8s-master:~/weave-net# ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens32    inet 192.168.132.186/23 brd 192.168.133.255 scope global ens32\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.32.0.1/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever

$ sudo iptables-save
https://gist.github.com/mfinkelstine/830d08747717ea53dc8c9e40f11e4b8f

wave

root@pa-k8s-master:~# weave status connections
-> 192.168.132.235:6783  failed      Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251, retry: 2019-03-14 11:24:47.111003924 +0000 UTC m=+2514398.775392642
-> 192.168.132.189:6783  failed      read tcp4 192.168.132.186:54328->192.168.132.189:6783: read: connection reset by peer, retry: 2019-03-14 11:28:08.371442911 +0000 UTC m=+2514600.035831619
-> 192.168.132.166:6783  failed      Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251, retry: 2019-03-14 11:24:02.60213972 +0000 UTC m=+2514354.266528438
-> 192.168.132.186:6783  failed      cannot connect to ourself, retry: never

weave peers

root@pa-k8s-master:~# weave status peers
92:5d:97:f3:5a:4f(pa-k8s-master)
murali-reddy commented 5 years ago

thanks for reporting the issue. From the below errors in the logs, seems like k8s-master-main is not connected to any peers, hence difficulty in routing traffic. Could you please share the output of weave status connections on k8s-master-main?

INFO: 2019/03/14 08:18:13.804725 ->[192.168.132.166:6783|ea:5b:96:50:b9:38(pa-k8s-worker5)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
INFO: 2019/03/14 08:21:10.156378 ->[192.168.132.235:6783|be:b8:cd:ee:cc:a1(pa-k8s-worker1)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
INFO: 2019/03/14 08:22:19.265730 ->[192.168.132.189:6783|16:7c:78:f6:24:bb(pa-k8s-worker2)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
mfinkelstine commented 5 years ago

This is from my second cluster

root@k8s-master-main:~# kubectl exec -n kube-system weave-net-wcmdw -c weave -- /home/weave/weave --local status connections
-> 192.168.132.133:6783  failed      cannot connect to ourself, retry: never
-> 192.168.132.210:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:09.149257621 +0000 UTC m=+2176534.442359450
-> 192.168.132.156:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:25:46.154122588 +0000 UTC m=+2176631.447224427
-> 192.168.132.135:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:29:35.991252955 +0000 UTC m=+2176861.284354804
-> 192.168.132.136:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:28:45.099765689 +0000 UTC m=+2176810.392867518
-> 192.168.132.203:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:06.533736434 +0000 UTC m=+2176531.826838263
-> 192.168.132.168:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:27:13.805004941 +0000 UTC m=+2176719.098106760
-> 192.168.132.175:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:26:55.611503829 +0000 UTC m=+2176700.904605668
-> 192.168.132.243:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:11.017299887 +0000 UTC m=+2176536.310401726
-> 192.168.132.180:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:26:15.549998236 +0000 UTC m=+2176660.843100065
-> 192.168.132.174:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:30:39.171722283 +0000 UTC m=+2176924.464824112

I have also checked the peers between the hosts

[INFO] weave-net-wcmdw WeaveNET [ k8s-master-main ] ##########
ce:73:f4:54:77:80(k8s-master-main)
[INFO] weave-net-vx6t4 WeaveNET [ k8s-master-replica1 ] ##########
6e:09:e1:bb:13:0d(k8s-node7)
   -> 192.168.132.156:6783  2e:b2:36:6b:35:8c(k8s-node5)          established
   <- 192.168.132.135:60302 12:6a:c8:c1:9f:96(k8s-master-replica1) established
   <- 192.168.132.136:55721 36:73:b6:38:d7:29(k8s-master-replica2) established
   <- 192.168.132.180:48642 1e:fe:30:e3:d2:c3(k8s-node4)          established
   -> 192.168.132.168:6783  ca:73:6b:33:61:25(k8s-node8)          established
   <- 192.168.132.210:38524 7e:5e:97:65:34:43(k8s-node3)          established
   <- 192.168.132.175:44555 02:02:64:37:51:db(k8s-node2)          established
   <- 192.168.132.174:34144 ea:55:d7:b9:54:f5(k8s-node6)          established
   <- 192.168.132.243:55516 ce:1b:a6:1e:67:4d(k8s-node1)          established
...

all the other node have established connection between all the worker & clusters expect the main

the configmap of the k8s show the right masters and nodes

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    kube-peers.weave.works/peers: '{"Peers":[{"PeerName":"ce:73:f4:54:77:80","NodeName":"k8s-master-main"},{"PeerName":"36:73:b6:38:d7:29","NodeName":"k8s-master-replica2"},{"PeerName":"12:6a:c8:c1:9f:96","NodeName":"k8s-master-replica1"},{"PeerName":"ce:1b:a6:1e:67:4d","NodeName":"k8s-node1"},{"PeerName":"02:02:64:37:51:db","NodeName":"k8s-node2"},{"PeerName":"7e:5e:97:65:34:43","NodeName":"k8s-node3"},{"PeerName":"1e:fe:30:e3:d2:c3","NodeName":"k8s-node4"},{"PeerName":"2e:b2:36:6b:35:8c","NodeName":"k8s-node5"},{"PeerName":"ea:55:d7:b9:54:f5","NodeName":"k8s-node6"},{"PeerName":"6e:09:e1:bb:13:0d","NodeName":"k8s-node7"},{"PeerName":"ca:73:6b:33:61:25","NodeName":"k8s-node8"}]}'
  creationTimestamp: 2018-08-15T15:26:48Z
mfinkelstine commented 5 years ago

I want to know how do I reconnect my master to the other nodes with out losing any rules/data

and what those failure mean

<- 192.168.132.210:55228 established fastdp 7e:5e:97:65:34:43(k8s-node3) mtu=1376
-> 192.168.132.133:6783  failed      Merge of incoming data causes: Entry 10.40.0.0-10.41.128.0 reporting too much free space: 131068 > 98304, retry: 2019-03-17 09:40:44.753235327 +0000 UTC m=+1715829.190134316
-> 192.168.132.156:6783  failed      cannot connect to ourself, retry: never
bboreham commented 5 years ago

reporting too much free space

That's a new one. Can you run weave report on the two nodes that are trying to connect and upload those files here?

You can probably remove the condition by deleting the data file under /var/lib/weave and restarting the pod (or rebooting that node). But please get the report data first so we can understand how it went wrong.

mfinkelstine commented 5 years ago

hi @bboreham ,

here is the report results from weave-net-wcmdw which is configured on the k8s-master-main

root@k8s-master-main:~/weave-net# kubectl exec -n kube-system weave-net-wcmdw -c weave -- /home/weave/weave --local status connections
-> 192.168.132.133:6783  failed      cannot connect to ourself, retry: never
-> 192.168.132.210:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:17.553913063 +0000 UTC m=+2501862.847014932
-> 192.168.132.156:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:17.589009548 +0000 UTC m=+2501922.882111387
-> 192.168.132.135:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:45:23.067798828 +0000 UTC m=+2501808.360900657
-> 192.168.132.136:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:50.327646137 +0000 UTC m=+2501895.620747966
-> 192.168.132.203:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:48.880334997 +0000 UTC m=+2501954.173436826
-> 192.168.132.168:6783  failed      read tcp4 192.168.132.133:38976->192.168.132.168:6783: read: connection reset by peer, retry: 2019-03-18 09:51:15.536063617 +0000 UTC m=+2502160.829165516
-> 192.168.132.175:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:50:03.200498611 +0000 UTC m=+2502088.493600440
-> 192.168.132.243:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:01.098143619 +0000 UTC m=+2501846.391245468
-> 192.168.132.180:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:43.444096765 +0000 UTC m=+2501888.737198594
-> 192.168.132.174:6783  failed      Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:50.858212885 +0000 UTC m=+2501956.151314724
{
    "Ready": true,
    "Version": "2.4.0",
    "VersionCheck": {
        "Enabled": true,
        "Success": true,
        "NewVersion": "2.5.1",
        "NextCheckAt": "2019-03-18T09:55:46.006493402Z"
    },
    "Router": {
        "Protocol": "weave",
        "ProtocolMinVersion": 1,
        "ProtocolMaxVersion": 2,
        "Encryption": false,
        "PeerDiscovery": true,
        "Name": "ce:73:f4:54:77:80",
        "NickName": "k8s-master-main",
        "Port": 6783,
        "Peers": [
            {
                "Name": "ce:73:f4:54:77:80",
                "NickName": "k8s-master-main",
                "UID": 12988469968597128521,
                "ShortID": 780,
                "Version": 289447,
                "Connections": null
            }
        ],
        "UnicastRoutes": [
            {
                "Dest": "ce:73:f4:54:77:80",
                "Via": "00:00:00:00:00:00"
            }
        ],
        "BroadcastRoutes": [
            {
                "Source": "ce:73:f4:54:77:80",
                "Via": null
            }
        ],
        "Connections": [
            {
                "Address": "192.168.132.168:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:38.338984999 +0000 UTC m=+2500563.632086828",
                "Attrs": null
            },
            {
                "Address": "192.168.132.175:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:29:44.896211096 +0000 UTC m=+2500870.189313035",
                "Attrs": null
            },
            {
                "Address": "192.168.132.243:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:59.860844645 +0000 UTC m=+2500585.153946494",
                "Attrs": null
            },
            {
                "Address": "192.168.132.180:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:26:03.762294852 +0000 UTC m=+2500649.055396731",
                "Attrs": null
            },
            {
                "Address": "192.168.132.174:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:36.065972774 +0000 UTC m=+2500561.359074623",
                "Attrs": null
            },
            {
                "Address": "192.168.132.203:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:23:13.174067983 +0000 UTC m=+2500478.467169842",
                "Attrs": null
            },
            {
                "Address": "192.168.132.210:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "no working forwarders to 7e:5e:97:65:34:43(k8s-node3), retry: 2019-03-18 09:23:39.754483788 +0000 UTC m=+2500505.047585627",
                "Attrs": null
            },
            {
                "Address": "192.168.132.156:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:21:37.013001354 +0000 UTC m=+2500382.306103183",
                "Attrs": null
            },
            {
                "Address": "192.168.132.135:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:25:03.735827028 +0000 UTC m=+2500589.028928867",
                "Attrs": null
            },
            {
                "Address": "192.168.132.136:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:25:52.910799993 +0000 UTC m=+2500638.203901812",
                "Attrs": null
            },
            {
                "Address": "192.168.132.133:6783",
                "Outbound": true,
                "State": "failed",
                "Info": "cannot connect to ourself, retry: never",
                "Attrs": null
            }
        ],
        "TerminationCount": 139542,
        "Targets": [
            "192.168.132.135",
            "192.168.132.180",
            "192.168.132.136",
            "192.168.132.174",
            "192.168.132.203",
            "192.168.132.168",
            "192.168.132.133",
            "192.168.132.175",
            "192.168.132.210",
            "192.168.132.243",
            "192.168.132.156"
        ],
        "OverlayDiagnostics": {
            "fastdp": {
                "Vports": [
                    {
                        "ID": 0,
                        "Name": "datapath",
                        "TypeName": "internal"
                    },
                    {
                        "ID": 1,
                        "Name": "vethwe-datapath",
                        "TypeName": "netdev"
                    },
                    {
                        "ID": 2,
                        "Name": "vxlan-6784",
                        "TypeName": "vxlan"
                    }
                ],
                "Flows": []
            },
            "sleeve": null
        },
        "TrustedSubnets": [],
        "Interface": "datapath (via ODP)",
        "CaptureStats": {
            "FlowMisses": 138011
        },
        "MACs": null
    },
    "IPAM": {
        "Paxos": null,
        "Range": "10.32.0.0/12",
        "RangeNumIPs": 1048576,
        "ActiveIPs": 4,
        "DefaultSubnet": "10.32.0.0/12",
        "Entries": [
            {
                "Token": "10.32.0.0",
                "Size": 131072,
                "Peer": "36:73:b6:38:d7:29",
                "Nickname": "k8s-master-replica2",
                "IsKnownPeer": false,
                "Version": 7
            },
            {
                "Token": "10.34.0.0",
                "Size": 32768,
                "Peer": "7e:5e:97:65:34:43",
                "Nickname": "k8s-node3",
                "IsKnownPeer": false,
                "Version": 24806
            },
            {
                "Token": "10.34.128.0",
                "Size": 32768,
                "Peer": "ca:73:6b:33:61:25",
                "Nickname": "k8s-node8",
                "IsKnownPeer": false,
                "Version": 1615
            },
            {
                "Token": "10.35.0.0",
                "Size": 32768,
                "Peer": "7e:5e:97:65:34:43",
                "Nickname": "k8s-node3",
                "IsKnownPeer": false,
                "Version": 0
            },
            {
                "Token": "10.35.128.0",
                "Size": 32768,
                "Peer": "2e:b2:36:6b:35:8c",
                "Nickname": "k8s-node5",
                "IsKnownPeer": false,
                "Version": 6720
            },
            {
                "Token": "10.36.0.0",
                "Size": 65536,
                "Peer": "ea:55:d7:b9:54:f5",
                "Nickname": "k8s-node6",
                "IsKnownPeer": false,
                "Version": 1765
            },
            {
                "Token": "10.37.0.0",
                "Size": 65536,
                "Peer": "6e:09:e1:bb:13:0d",
                "Nickname": "k8s-node7",
                "IsKnownPeer": false,
                "Version": 2187
            },
            {
                "Token": "10.38.0.0",
                "Size": 131072,
                "Peer": "ce:1b:a6:1e:67:4d",
                "Nickname": "k8s-node1",
                "IsKnownPeer": false,
                "Version": 5016
            },
            {
                "Token": "10.40.0.0",
                "Size": 131072,
                "Peer": "ce:73:f4:54:77:80",
                "Nickname": "k8s-master-main",
                "IsKnownPeer": true,
                "Version": 16
            },
            {
                "Token": "10.42.0.0",
                "Size": 131072,
                "Peer": "02:02:64:37:51:db",
                "Nickname": "k8s-node2",
                "IsKnownPeer": false,
                "Version": 6057
            },
            {
                "Token": "10.44.0.0",
                "Size": 65536,
                "Peer": "ea:55:d7:b9:54:f5",
                "Nickname": "k8s-node6",
                "IsKnownPeer": false,
                "Version": 515
            },
            {
                "Token": "10.45.0.0",
                "Size": 65536,
                "Peer": "1e:fe:30:e3:d2:c3",
                "Nickname": "k8s-node4",
                "IsKnownPeer": false,
                "Version": 4993
            },
            {
                "Token": "10.46.0.0",
                "Size": 131072,
                "Peer": "12:6a:c8:c1:9f:96",
                "Nickname": "k8s-master-replica1",
                "IsKnownPeer": false,
                "Version": 1
            }
        ],
        "PendingClaims": null,
        "PendingAllocates": null
    }
}
bboreham commented 5 years ago

OK, I think I understand the message now. Remediation in your cluster is the same - remove the persistence file and restart. This is broadly the same as #3310

It would be good to understand how your cluster got into this state. #1962 is our best idea of how to get out of it without human interaction.

mfinkelstine commented 5 years ago

HI @bboreham ,

Thanks for the reply, we (our team) don't know exactly how we got this situation this happen on a 2 different clusters,

you are talking about the file weave-net.db which located on /var/lib/weave/ ?