Open HackToHell opened 5 years ago
Was able to isolate logs better, getting
WARN: 2019/02/28 16:09:51.536184 Vetoed installation of hairpin flow FlowSpec{keys: [InPortFlowKey{vport: 1} EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d}], actions: [OutputAction{vport: 1}]}
WARN: 2019/02/28 16:09:52.536743 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
WARN: 2019/02/28 16:09:54.540654 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
WARN: 2019/02/28 16:09:58.548697 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
WARN: 2019/02/28 16:10:06.570492 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
WARN: 2019/02/28 16:10:08.568584 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 12:cc:f5:35:61:28, dst: 66:2f:0f:7c:15:6d} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
IN
12:cc:f5:35:61:28 is the weave bridge IP.
6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default qlen 1000
link/ether 12:cc:f5:35:61:28 brd ff:ff:ff:ff:ff:ff
inet 10.32.200.0/16 brd 10.32.255.255 scope global weave
valid_lft forever preferred_lft forever
inet6 fe80::10cc:f5ff:fe35:6128/64 scope link
valid_lft forever preferred_lft forever
Looks like it was an issue in Openshift SDN
OpenShift and Weave Net are implemented in completely different ways; there is no reason to suppose that a similar symptom in one would be related to another.
Vetoed installation of hairpin flow
This can happen in harmful and non-harmful cases - see #2808
From the Weave logs, this is a sign of something definitely wrong:
INFO: 2019/03/01 05:30:40.299239 overlay_switch ->[12:e6:b4:29:a8:6e(ocg2vfett)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.299244 overlay_switch ->[32:2b:7e:8b:3d:ed(ocg2bupgd)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.299474 overlay_switch ->[c6:29:85:f4:63:a9(ocbuild-g2smppi)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.302852 overlay_switch ->[42:77:f6:7f:39:1c(ocg2ogfsz)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.307102 overlay_switch ->[be:e2:0b:28:c4:be(oc-infra-1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.310343 overlay_switch ->[12:cc:f5:35:61:28(ocg2xpkdi)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.314442 overlay_switch ->[7e:ec:c5:cd:ee:bc(oc-infra-0)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.317610 overlay_switch ->[76:e3:6d:b1:9b:0a(ocbuild-g2ijddl)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.317910 overlay_switch ->[42:81:a1:34:2e:da(oc-master-1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/01 05:30:40.318902 overlay_switch ->[b2:2e:c7:e3:fe:80(oc-node-2)] sleeve timed out waiting for UDP heartbeat
this means that several heartbeats were missed. Yet those nodes continue to appear in later log messages.
These connections seem to be working better: (including some overlap with the above set)
INFO: 2019/03/01 05:29:40.351383 sleeve ->[10.2.0.16:6783|82:b8:f1:6f:61:e6(ocg2kgxea)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:29:40.353232 sleeve ->[10.2.0.7:6783|92:ba:d1:8d:5f:b4(ocg2bbyqo)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:29:40.800847 sleeve ->[10.1.0.5:6783|96:d2:39:f2:ee:dd(oc-infra-2)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:29:40.807056 sleeve ->[10.1.0.9:6783|1a:a4:db:93:9b:fb(oc-master-2)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:29:40.811124 sleeve ->[10.2.0.6:6783|3a:0e:f0:a3:94:35(ocg2cshhl)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:29:40.812061 sleeve ->[10.1.0.8:6783|52:cd:da:04:c9:04(oc-master-0)]: Effective MTU verified at 1438
INFO: 2019/03/01 05:30:40.819752 sleeve ->[10.1.0.7:6783|be:e2:0b:28:c4:be(oc-infra-1)]: Effective MTU verified at 1438
Is there any correlation between pods that work and don't work, and those two sets of nodes above?
I understand from Slack your symptoms are intermittent, so it would be useful to know any correlation between when the problem hit and the time in the logs.
What you expected to happen?
When a pod comes up it's able to hit any internal service consistently,
kubernetes.default.svc
for this issue.What happened?
Getting
error: couldn't get deployment devicebroker-12: Get https://172.30.0.1:443/api/v1/namespaces/v14-rapyuta-core/replicationcontrollers/devicebroker-12: dial tcp 172.30.0.1:443: i/o timeout
How to reproduce it?
Happens with openshift deployer pods that bring up a new pods when a deployment is rolled, little sporadic but can be easily reproduced.
Anything else we need to know?
Looks like it was an issue in Openshift SDN and was fixed by changing a flow. https://github.com/openshift/origin/issues/5796 The fix (?) https://github.com/openshift/openshift-sdn/pull/236/files Since weave also uses ovs, could it be related?
Network policies are enabled for some name spaces in the cluster.
Versions:
Openshift 3.9 running on Microsoft Azure.
Openshift version
oc v3.9.0+ba7faec-1 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO
Uname
Linux oc-master-0 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Weave
/home/weave # ./weave --local status
PeerDiscovery: enabled Targets: 9 Connections: 18 (17 established, 1 failed) Peers: 18 (with 306 established connections) TrustedSubnets: none
DefaultSubnet: 10.32.0.0/16
Docker
Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64 Go version: go1.9.4 Git commit: 07f3374/1.13.1 Built: Fri Dec 7 16:13:51 2018 OS/Arch: linux/amd64
Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64 Go version: go1.9.4 Git commit: 07f3374/1.13.1 Built: Fri Dec 7 16:13:51 2018 OS/Arch: linux/amd64 Experimental: false
Logs:
There's a lot of
Vetoed installation of hairpin flow
messages.Complete logs for weave at https://gist.github.com/HackToHell/5ffc79ca73f0bbf83c7697857fb34395
Startup logs for kubelet at https://gist.github.com/HackToHell/97605144a51a5850c8828ef5f45cb745
Network:
Routes
default via 10.2.0.1 dev eth0 proto dhcp metric 100 10.2.0.0/16 dev eth0 proto kernel scope link src 10.2.0.8 metric 100 10.32.0.0/16 dev weave proto kernel scope link src 10.32.16.0 168.63.129.16 via 10.2.0.1 dev eth0 proto dhcp metric 100 169.254.169.254 via 10.2.0.1 dev eth0 proto dhcp metric 100 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
Addrs
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eth0 inet 10.2.0.8/16 brd 10.2.255.255 scope global noprefixroute eth0\ valid_lft forever preferred_lft forever 3: docker0 inet 172.17.0.1/16 scope global docker0\ valid_lft forever preferred_lft forever 6: weave inet 10.32.16.0/16 brd 10.32.255.255 scope global weave\ valid_lft forever preferred_lft forever
IP Tables Save
https://gist.github.com/HackToHell/73a249b0ab9818703905d976d41ab262
Output of
ip a