Open mfinkelstine opened 5 years ago
thanks for reporting the issue. From the below errors in the logs, seems like k8s-master-main
is not connected to any peers, hence difficulty in routing traffic. Could you please share the output of weave status connections
on k8s-master-main
?
INFO: 2019/03/14 08:18:13.804725 ->[192.168.132.166:6783|ea:5b:96:50:b9:38(pa-k8s-worker5)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
INFO: 2019/03/14 08:21:10.156378 ->[192.168.132.235:6783|be:b8:cd:ee:cc:a1(pa-k8s-worker1)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
INFO: 2019/03/14 08:22:19.265730 ->[192.168.132.189:6783|16:7c:78:f6:24:bb(pa-k8s-worker2)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v88: incoming message says owner 92:5d:97:f3:5a:4f v251
This is from my second cluster
root@k8s-master-main:~# kubectl exec -n kube-system weave-net-wcmdw -c weave -- /home/weave/weave --local status connections
-> 192.168.132.133:6783 failed cannot connect to ourself, retry: never
-> 192.168.132.210:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:09.149257621 +0000 UTC m=+2176534.442359450
-> 192.168.132.156:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:25:46.154122588 +0000 UTC m=+2176631.447224427
-> 192.168.132.135:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:29:35.991252955 +0000 UTC m=+2176861.284354804
-> 192.168.132.136:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:28:45.099765689 +0000 UTC m=+2176810.392867518
-> 192.168.132.203:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:06.533736434 +0000 UTC m=+2176531.826838263
-> 192.168.132.168:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:27:13.805004941 +0000 UTC m=+2176719.098106760
-> 192.168.132.175:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:26:55.611503829 +0000 UTC m=+2176700.904605668
-> 192.168.132.243:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:24:11.017299887 +0000 UTC m=+2176536.310401726
-> 192.168.132.180:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:26:15.549998236 +0000 UTC m=+2176660.843100065
-> 192.168.132.174:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-14 15:30:39.171722283 +0000 UTC m=+2176924.464824112
I have also checked the peers between the hosts
[INFO] weave-net-wcmdw WeaveNET [ k8s-master-main ] ##########
ce:73:f4:54:77:80(k8s-master-main)
[INFO] weave-net-vx6t4 WeaveNET [ k8s-master-replica1 ] ##########
6e:09:e1:bb:13:0d(k8s-node7)
-> 192.168.132.156:6783 2e:b2:36:6b:35:8c(k8s-node5) established
<- 192.168.132.135:60302 12:6a:c8:c1:9f:96(k8s-master-replica1) established
<- 192.168.132.136:55721 36:73:b6:38:d7:29(k8s-master-replica2) established
<- 192.168.132.180:48642 1e:fe:30:e3:d2:c3(k8s-node4) established
-> 192.168.132.168:6783 ca:73:6b:33:61:25(k8s-node8) established
<- 192.168.132.210:38524 7e:5e:97:65:34:43(k8s-node3) established
<- 192.168.132.175:44555 02:02:64:37:51:db(k8s-node2) established
<- 192.168.132.174:34144 ea:55:d7:b9:54:f5(k8s-node6) established
<- 192.168.132.243:55516 ce:1b:a6:1e:67:4d(k8s-node1) established
...
all the other node have established connection between all the worker & clusters expect the main
the configmap of the k8s show the right masters and nodes
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
kube-peers.weave.works/peers: '{"Peers":[{"PeerName":"ce:73:f4:54:77:80","NodeName":"k8s-master-main"},{"PeerName":"36:73:b6:38:d7:29","NodeName":"k8s-master-replica2"},{"PeerName":"12:6a:c8:c1:9f:96","NodeName":"k8s-master-replica1"},{"PeerName":"ce:1b:a6:1e:67:4d","NodeName":"k8s-node1"},{"PeerName":"02:02:64:37:51:db","NodeName":"k8s-node2"},{"PeerName":"7e:5e:97:65:34:43","NodeName":"k8s-node3"},{"PeerName":"1e:fe:30:e3:d2:c3","NodeName":"k8s-node4"},{"PeerName":"2e:b2:36:6b:35:8c","NodeName":"k8s-node5"},{"PeerName":"ea:55:d7:b9:54:f5","NodeName":"k8s-node6"},{"PeerName":"6e:09:e1:bb:13:0d","NodeName":"k8s-node7"},{"PeerName":"ca:73:6b:33:61:25","NodeName":"k8s-node8"}]}'
creationTimestamp: 2018-08-15T15:26:48Z
I want to know how do I reconnect my master to the other nodes with out losing any rules/data
and what those failure mean
<- 192.168.132.210:55228 established fastdp 7e:5e:97:65:34:43(k8s-node3) mtu=1376
-> 192.168.132.133:6783 failed Merge of incoming data causes: Entry 10.40.0.0-10.41.128.0 reporting too much free space: 131068 > 98304, retry: 2019-03-17 09:40:44.753235327 +0000 UTC m=+1715829.190134316
-> 192.168.132.156:6783 failed cannot connect to ourself, retry: never
reporting too much free space
That's a new one. Can you run weave report
on the two nodes that are trying to connect and upload those files here?
You can probably remove the condition by deleting the data file under /var/lib/weave
and restarting the pod (or rebooting that node). But please get the report
data first so we can understand how it went wrong.
hi @bboreham ,
here is the report results from weave-net-wcmdw which is configured on the k8s-master-main
root@k8s-master-main:~/weave-net# kubectl exec -n kube-system weave-net-wcmdw -c weave -- /home/weave/weave --local status connections
-> 192.168.132.133:6783 failed cannot connect to ourself, retry: never
-> 192.168.132.210:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:17.553913063 +0000 UTC m=+2501862.847014932
-> 192.168.132.156:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:17.589009548 +0000 UTC m=+2501922.882111387
-> 192.168.132.135:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:45:23.067798828 +0000 UTC m=+2501808.360900657
-> 192.168.132.136:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:50.327646137 +0000 UTC m=+2501895.620747966
-> 192.168.132.203:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:48.880334997 +0000 UTC m=+2501954.173436826
-> 192.168.132.168:6783 failed read tcp4 192.168.132.133:38976->192.168.132.168:6783: read: connection reset by peer, retry: 2019-03-18 09:51:15.536063617 +0000 UTC m=+2502160.829165516
-> 192.168.132.175:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:50:03.200498611 +0000 UTC m=+2502088.493600440
-> 192.168.132.243:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:01.098143619 +0000 UTC m=+2501846.391245468
-> 192.168.132.180:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:46:43.444096765 +0000 UTC m=+2501888.737198594
-> 192.168.132.174:6783 failed Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:47:50.858212885 +0000 UTC m=+2501956.151314724
{
"Ready": true,
"Version": "2.4.0",
"VersionCheck": {
"Enabled": true,
"Success": true,
"NewVersion": "2.5.1",
"NextCheckAt": "2019-03-18T09:55:46.006493402Z"
},
"Router": {
"Protocol": "weave",
"ProtocolMinVersion": 1,
"ProtocolMaxVersion": 2,
"Encryption": false,
"PeerDiscovery": true,
"Name": "ce:73:f4:54:77:80",
"NickName": "k8s-master-main",
"Port": 6783,
"Peers": [
{
"Name": "ce:73:f4:54:77:80",
"NickName": "k8s-master-main",
"UID": 12988469968597128521,
"ShortID": 780,
"Version": 289447,
"Connections": null
}
],
"UnicastRoutes": [
{
"Dest": "ce:73:f4:54:77:80",
"Via": "00:00:00:00:00:00"
}
],
"BroadcastRoutes": [
{
"Source": "ce:73:f4:54:77:80",
"Via": null
}
],
"Connections": [
{
"Address": "192.168.132.168:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:38.338984999 +0000 UTC m=+2500563.632086828",
"Attrs": null
},
{
"Address": "192.168.132.175:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:29:44.896211096 +0000 UTC m=+2500870.189313035",
"Attrs": null
},
{
"Address": "192.168.132.243:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:59.860844645 +0000 UTC m=+2500585.153946494",
"Attrs": null
},
{
"Address": "192.168.132.180:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:26:03.762294852 +0000 UTC m=+2500649.055396731",
"Attrs": null
},
{
"Address": "192.168.132.174:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:24:36.065972774 +0000 UTC m=+2500561.359074623",
"Attrs": null
},
{
"Address": "192.168.132.203:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:23:13.174067983 +0000 UTC m=+2500478.467169842",
"Attrs": null
},
{
"Address": "192.168.132.210:6783",
"Outbound": true,
"State": "failed",
"Info": "no working forwarders to 7e:5e:97:65:34:43(k8s-node3), retry: 2019-03-18 09:23:39.754483788 +0000 UTC m=+2500505.047585627",
"Attrs": null
},
{
"Address": "192.168.132.156:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:21:37.013001354 +0000 UTC m=+2500382.306103183",
"Attrs": null
},
{
"Address": "192.168.132.135:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:25:03.735827028 +0000 UTC m=+2500589.028928867",
"Attrs": null
},
{
"Address": "192.168.132.136:6783",
"Outbound": true,
"State": "failed",
"Info": "Peer ca:73:6b:33:61:25 says it owns the IP range from 10.41.128.0, which I think I own, retry: 2019-03-18 09:25:52.910799993 +0000 UTC m=+2500638.203901812",
"Attrs": null
},
{
"Address": "192.168.132.133:6783",
"Outbound": true,
"State": "failed",
"Info": "cannot connect to ourself, retry: never",
"Attrs": null
}
],
"TerminationCount": 139542,
"Targets": [
"192.168.132.135",
"192.168.132.180",
"192.168.132.136",
"192.168.132.174",
"192.168.132.203",
"192.168.132.168",
"192.168.132.133",
"192.168.132.175",
"192.168.132.210",
"192.168.132.243",
"192.168.132.156"
],
"OverlayDiagnostics": {
"fastdp": {
"Vports": [
{
"ID": 0,
"Name": "datapath",
"TypeName": "internal"
},
{
"ID": 1,
"Name": "vethwe-datapath",
"TypeName": "netdev"
},
{
"ID": 2,
"Name": "vxlan-6784",
"TypeName": "vxlan"
}
],
"Flows": []
},
"sleeve": null
},
"TrustedSubnets": [],
"Interface": "datapath (via ODP)",
"CaptureStats": {
"FlowMisses": 138011
},
"MACs": null
},
"IPAM": {
"Paxos": null,
"Range": "10.32.0.0/12",
"RangeNumIPs": 1048576,
"ActiveIPs": 4,
"DefaultSubnet": "10.32.0.0/12",
"Entries": [
{
"Token": "10.32.0.0",
"Size": 131072,
"Peer": "36:73:b6:38:d7:29",
"Nickname": "k8s-master-replica2",
"IsKnownPeer": false,
"Version": 7
},
{
"Token": "10.34.0.0",
"Size": 32768,
"Peer": "7e:5e:97:65:34:43",
"Nickname": "k8s-node3",
"IsKnownPeer": false,
"Version": 24806
},
{
"Token": "10.34.128.0",
"Size": 32768,
"Peer": "ca:73:6b:33:61:25",
"Nickname": "k8s-node8",
"IsKnownPeer": false,
"Version": 1615
},
{
"Token": "10.35.0.0",
"Size": 32768,
"Peer": "7e:5e:97:65:34:43",
"Nickname": "k8s-node3",
"IsKnownPeer": false,
"Version": 0
},
{
"Token": "10.35.128.0",
"Size": 32768,
"Peer": "2e:b2:36:6b:35:8c",
"Nickname": "k8s-node5",
"IsKnownPeer": false,
"Version": 6720
},
{
"Token": "10.36.0.0",
"Size": 65536,
"Peer": "ea:55:d7:b9:54:f5",
"Nickname": "k8s-node6",
"IsKnownPeer": false,
"Version": 1765
},
{
"Token": "10.37.0.0",
"Size": 65536,
"Peer": "6e:09:e1:bb:13:0d",
"Nickname": "k8s-node7",
"IsKnownPeer": false,
"Version": 2187
},
{
"Token": "10.38.0.0",
"Size": 131072,
"Peer": "ce:1b:a6:1e:67:4d",
"Nickname": "k8s-node1",
"IsKnownPeer": false,
"Version": 5016
},
{
"Token": "10.40.0.0",
"Size": 131072,
"Peer": "ce:73:f4:54:77:80",
"Nickname": "k8s-master-main",
"IsKnownPeer": true,
"Version": 16
},
{
"Token": "10.42.0.0",
"Size": 131072,
"Peer": "02:02:64:37:51:db",
"Nickname": "k8s-node2",
"IsKnownPeer": false,
"Version": 6057
},
{
"Token": "10.44.0.0",
"Size": 65536,
"Peer": "ea:55:d7:b9:54:f5",
"Nickname": "k8s-node6",
"IsKnownPeer": false,
"Version": 515
},
{
"Token": "10.45.0.0",
"Size": 65536,
"Peer": "1e:fe:30:e3:d2:c3",
"Nickname": "k8s-node4",
"IsKnownPeer": false,
"Version": 4993
},
{
"Token": "10.46.0.0",
"Size": 131072,
"Peer": "12:6a:c8:c1:9f:96",
"Nickname": "k8s-master-replica1",
"IsKnownPeer": false,
"Version": 1
}
],
"PendingClaims": null,
"PendingAllocates": null
}
}
OK, I think I understand the message now. Remediation in your cluster is the same - remove the persistence file and restart. This is broadly the same as #3310
It would be good to understand how your cluster got into this state. #1962 is our best idea of how to get out of it without human interaction.
HI @bboreham ,
Thanks for the reply, we (our team) don't know exactly how we got this situation this happen on a 2 different clusters,
you are talking about the file weave-net.db which located on /var/lib/weave/ ?
Issue redirecting exposed port to worker nodes
first k8s cluster: I have k8s with 11 nodes environment with 3 master nodes (2 replicas). With several services and pods in my environment my main master node stopped exposing ports while other master (replica1, replica2) are exposing the ports for my pods
secound k8s cluster : I have the same issue with a second cluster that have 1 master with 3 worker nodes,
the logs that I have added is from my second cluster node
here is example of my k8s dashboard which is working on my 2 replicas but not in the master node with port exposed
on my k8s-master-main this is the output from the tcpdump
we can see that the host is listen to the port but does not redirect the networking
here the iptable rules that I have for the dashboard it's exact on all master nodes
the dashboard is listen on port 30465 here is the output of my tcpdump
on my k8s-master-replica1 this is the output from the tcpdump
I have created a docker image for nginx and I saw that the system is redirected the http port
What you expected to happen: To have access to my exposed ports
Versions:
Logs:
$ kubectl logs -n kube-system weave
https://gist.github.com/mfinkelstine/dea4be768aa62d86af2182a9d8709e10
-->
Network:
wave
weave peers