weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Kubectl works and all the pods are up but no traffic goes through #3703

Open avarf opened 5 years ago

avarf commented 5 years ago

What happened?

We have a small in house cluster consists of 5 nodes which we run our platform on it. Our platform consists of different components which they communicate via http or amqp both among themselves and also to outside of the cluster.

From yesterday no traffic goes to the components and they became unreachable while they are up, there is no error, neither in our components nor in k8s components (dns, proxy,etc.) BUT I can access the cluster and components via kubectl and all of the kubectl commands work properly. What I mean is I can run kubectl exec, kubectl logs, helm install, etc but if I want to go to webpage I receive This site can’t be reached but there is no logs in neither nginx pod nor any of the k8s components which means they haven't received the request and no traffic goes through.

The only error that I can see in weave net is:

ERRO: 2019/09/27 13:40:03.357182 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 13:42:31.455040 Discovered remote MAC 12:de:a9:b4:03:61 at 12:de:a9:b4:03:61(viscom-titan)
INFO: 2019/09/27 13:43:16.854651 Discovered remote MAC ee:18:75:33:59:da at 86:03:7c:d8:f9:1a(cutie)
INFO: 2019/09/27 13:43:56.282079 Discovered remote MAC d6:58:fd:ac:5b:bb at 02:01:5b:b9:8e:fd(serflex-giant-7)
ERRO: 2019/09/27 13:45:03.365062 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
ERRO: 2019/09/27 13:50:03.359428 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
ERRO: 2019/09/27 13:55:03.360561 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 13:57:08.694753 Discovered remote MAC f2:b4:10:14:a8:b1 at 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 13:57:40.078919 Discovered remote MAC e2:b0:06:4f:ad:92 at 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 14:00:00.031049 Discovered remote MAC 3e:75:38:73:e2:46 at 12:de:a9:b4:03:61(viscom-titan)
ERRO: 2019/09/27 14:00:03.359176 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 14:00:46.225141 Discovered remote MAC 1e:19:c5:b1:e0:8b at 86:03:7c:d8:f9:1a(cutie)
INFO: 2019/09/27 14:00:47.213079 Discovered remote MAC f2:49:83:44:49:ee at 86:03:7c:d8:f9:1a(cutie)
INFO: 2019/09/27 14:00:47.336703 Discovered remote MAC f6:b4:ad:a3:99:46 at 4a:8d:75:d7:59:ff(serflex-argus-2)
ERRO: 2019/09/27 14:05:03.362763 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
INFO: 2019/09/27 14:07:18.647069 Discovered remote MAC 9a:20:d9:58:df:98 at 86:03:7c:d8:f9:1a(cutie)
INFO: 2019/09/27 14:08:04.055021 Discovered remote MAC 76:91:51:4d:c8:f4 at 86:03:7c:d8:f9:1a(cutie)
INFO: 2019/09/27 14:08:04.055442 Discovered remote MAC 7e:e7:5e:42:f4:9c at 86:03:7c:d8:f9:1a(cutie)

How to reproduce it?

Because I don't know what caused this problem I don't know how to reproduce this.

Versions:

Environment:

Network:

On K8s master:

$ ip route
default via 10.203.0.1 dev eno1 proto static 
10.32.0.0/12 dev weave proto kernel scope link src 10.32.0.1 
10.203.0.0/17 dev eno1 proto kernel scope link src 10.203.20.160 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.160/17 brd 10.203.127.255 scope global eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.161/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.162/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.163/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.164/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.165/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.166/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.167/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.168/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.169/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.170/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.171/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.172/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.173/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.174/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.175/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.176/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.177/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.178/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.179/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.180/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.181/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.182/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.183/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.184/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.185/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.186/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.187/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.188/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
6: eno1    inet 10.203.20.189/17 brd 10.203.127.255 scope global secondary eno1\       valid_lft forever preferred_lft forever
8: docker0    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\       valid_lft forever preferred_lft forever
11: weave    inet 10.32.0.1/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever
22: flannel.1    inet 10.1.56.0/32 scope global flannel.1\       valid_lft forever preferred_lft forever
bboreham commented 5 years ago

The only error that I can see in weave net is:

Please don't do this. Post the whole log, or at least the first 50KB.

bboreham commented 5 years ago

22: flannel.1 inet 10.1.56.0/32 scope global flannel.1\ valid_lft forever preferred_lft forever

You are running Flannel at the same time as Weave Net?

Can you post the kubelet log please?

avarf commented 5 years ago

After running the command that you suggested (ip -4 -o addr) I saw the Flannel myself and no we are not running it and never used it. I have to see where it came from.

Please find the logs for the weave-net pod of the master node attached (kubectl logs -n kube-system weave-net-8mdp5 -c weave > k8s-master-weave.log)

k8s-master-weave.log

avarf commented 5 years ago

I deleted the Weave and installed it again via below command but I am still facing the same problem. kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

bboreham commented 5 years ago

INFO: 2019/09/10 13:27:05.423705 ->[10.203.0.18:45065|86:03:7c:d8:f9:1a(cutie)]: connection shutting down due to error: Received update for IP range I own at 10.32.0.0 v3: incoming message says owner 02:01:5b:b9:8e:fd v14

This indicates an inconsistency in the data used by Weave Net. However the message does not repeat after this time, so maybe the source went away.

Let's try to break the problem down a little:

From yesterday

You mean it was working fine before with Weave Net? What changed on the 26th? Nothing much changes in your log after the 26th.

if I want to go to webpage I receive This site can’t be reached

What webpage? From where? Can you try curl -v to the webpage address on the host where your webserver pod runs and post the response here?

Can you reach one pod from another inside the cluster, using curl? By pod IP? By service IP? By DNS name?

avarf commented 5 years ago

Yes, everything was working properly but late 26th and early 27th we had some internal networking problems with the company DNS and some other minor issues.

We are running a platform consisting of different components. We have a reverse proxy using Nginx that redirects the requests to another component that hosts the static html pages and this is the easiest way to see if there was any request or not. Times that the traffic not going through I see no log at Nginx at all and the curl just stuck:

curl -v -k http://10.203.20.164
* Rebuilt URL to: http://10.203.20.164/
*   Trying 10.203.20.164...
* TCP_NODELAY set
* Connected to 10.203.20.164 (10.203.20.164) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.203.20.164
> User-Agent: curl/7.58.0
> Accept: */*
> 

But the times that traffic goes through I can see the static html as the response of the curl and also there are info logs in our Nginx.

I ran a small test, I used below command 100 times from inside a pod and it was 100% successful:

kubectl exec -n 164 -ti gateway-7cf68998b6-phwkf -- curl -v -k http://proxy:80

Proxy is our Nginx which I said and when I wanted to access it at the same time that the test was running I was not able to and I got similar results: This site can’t be reached 10.203.20.164 took too long to respond.

bboreham commented 5 years ago

What sort of address is 10.203.20.164? (e.g. host IP, cluster (virtual) IP, pod IP)

avarf commented 5 years ago

That is a virtual IP and we defined 40 virtual IPs so we can use each of them for one namespace. That IP is in the range of our physical machines IPs and we defined them on our K8s master node.