Intermittent packet loss between pods

cpressland commented 6 years ago

After seeking advice on Slack, @brb advised that I log this as an issue.

We're seeing a small amount of intermittent packet loss on our Production environment, @backwardspy wrote a small rust application which pretends to speak http in order to further troubleshoot the issue.

use std::io::Write;
use std::net::{TcpListener, TcpStream};

const DATA: &[u8] = b"HTTP/1.1 200 OK
Content-Type: text
Content-Length: 6

\"echo\"";

fn handle_client(mut stream: TcpStream) {
    stream.write(DATA).unwrap();
}

fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").unwrap();

    // accept connections and process them serially
    for stream in listener.incoming() {
        handle_client(stream.unwrap());
    }
}

Running the above locally allows us to make approximately 800 connections per second using a Python client so it's reasonably performant. Once we deployed this into our Kubernetes cluster I launched a Ubuntu pod and put curl in a while true loop while true; do curl my360-jeff:8080; done, after approximately 350 connections we see instances of curl: (56) Recv failure: Connection reset by peer every now and then.

On Slack I was advised to look at the Weave logs which saw instances of (about 100 per hour):

# prod-worker-01
ERRO: 2018/07/24 14:36:20.139952 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:20.136592 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-08:
ERRO: 2018/07/24 14:36:38.431249 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:38.426140 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

# prod-worker-10:
ERRO: 2018/07/24 14:35:48.005625 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:35:48.003711 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

Additionally netstat -i shows some TX-DRP on the vxlan-6784 interface on some nodes:

# prod-worker-01 (1972 TX-DRP)
vxlan-6784 65535 0  48350779        0      0 0      52282585      0   1972      0 BMRU
# prod-worker-02 (29 TX-DRP)
vxlan-6784 65535 0  659979670       0      0 0      774399398     0     29      0 BMRU
# prod-worker-03 (74 TX-DRP)
vxlan-6784 65535 0  3232560976      0      0 0      3122201921    0     74      0 BMRU
# prod-worker-04 (206 TX-DRP)
vxlan-6784 65535 0  1266458622      0      0 0      1270704282    0    206      0 BMRU
# prod-worker-05 (88 TX-DRP)
vxlan-6784 65535 0  316929245       0      0 0      268493199     0     88      0 BMRU
# prod-worker-06 (52 TX-DRP)
vxlan-6784 65535 0  280753898       0      0 0      268094608     0     52      0 BMRU
# prod-worker-07 (30 TX-DRP)
vxlan-6784 65535 0  268853355       0      0 0      99970242      0     30      0 BMRU
# prod-worker-08 (0 TX-DRP)
vxlan-6784 65535 0  76848987        0      0 0      78291696      0      0      0 BMRU
# prod-worker-09 (455 TX-DRP)
vxlan-6784 65535 0  255571354       0      0 0      267805374     0    455      0 BMRU
# prod-worker-10 (8 TX-DRP)
vxlan-6784 65535 0  1057054491      0      0 0      1059490874    0      8      0 BMRU

What you expected to happen?

Connections arrive at the correct pod

What happened?

curl produces curl: (56) Recv failure: Connection reset by peer

How to reproduce it?

For us, just sending lots of traffic over the network shows the issue quite well.

Anything else we need to know?

Microsoft Azure
Ubuntu 16.04
Kubernetes 1.11.1
Kubernetes configured manually, can provide systemd unit examples if required.

Versions:

Weave:

cpressland@prod-worker-08 ~ sudo docker exec -it $(sudo docker ps  | grep weave_weave | cut -c 1-12) ./weave --local version
weave 2.3.0

Docker:

cpressland@prod-worker-08 ~ sudo docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

uname -a:

cpressland@prod-worker-08 ~ uname -a
Linux prod-worker-08 4.15.0-1014-azure #14~16.04.1-Ubuntu SMP Thu Jun 14 15:42:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

kubectl:

cpressland@prod-controller-01 ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Weave Logs: weave-logs-1-hour.xlsx

Nothing interesting or concerning in the kubelet or docker logs.

Network:

ip route (worker-06):

cpressland@prod-worker-06 ~ ip route
default via 10.2.1.1 dev eth0
10.2.1.0/24 dev eth0  proto kernel  scope link  src 10.2.1.9
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.34.0.1
168.63.129.16 via 10.2.1.1 dev eth0
169.254.169.254 via 10.2.1.1 dev eth0
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown

ip -4 -o addr

cpressland@prod-worker-06 ~ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 10.2.1.9/24 brd 10.2.1.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.34.0.1/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever

sudo iptables-save: https://gist.github.com/cpressland/9745d9e2fd9547c06e84b6ba11aede5e

murali-reddy commented 6 years ago

I launched a Ubuntu pod and put curl in a while true loop while true; do curl my360-jeff:8080; done,

Just to narrow down the issue (taking out kube-proxy in the data path), would you mind trying the same tests directly from pod to pod i.e. without accessing the service my360-jeff and directly accessing the pod IP

cpressland commented 6 years ago

Thanks @murali-reddy - unfortunately we see the same thing:

root@bash-7df6778f79-574dm:/# while true; do curl 10.43.0.5:8080; done
"echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo""echo"curl: (56) Recv failure: Connection reset by peer"echo""echo"

Sorry, you may need to scroll a bit to see the curl: (56) Recv failure: Connection reset by peer

murali-reddy commented 6 years ago

thanks @cpressland for confirming. It does look like issue from weave. I will try and see if i can reproduce. If your rust application docker image is available publicly, could you please share it, it could be handy.

brb commented 6 years ago

It's worth checking whether the number of the Captured <..> errors is equal to TX-DRP (as we drop a packet in the case of the error) and whether the errors are logged due to https://github.com/weaveworks/weave/issues/2877.

backwardspy commented 6 years ago

@murali-reddy you can find the image in question here. please excuse any bad behaviour - it was a very quick hack!

murali-reddy commented 6 years ago

I am not able to reproduce this issue. Are you able to easily replicate this scenario or any other cluster?

cpressland commented 6 years ago

@murali-reddy - sorry for the delayed response.

I have confirmed that this happens in all three of our Kubernetes Clusters, which isn't all that surprising given then they're in the same Azure region, share the same Terraform deployment modules and are configured with the same Cookbooks. However, I did note that it does not happen if the curl pod and the jeff pod are on the same node, which I suppose makes sense as we should be hitting iptables before weave, right?

Is there any specific information I can provide to help troubleshoot this further? I'm on holiday for the next week so @BackwardSpy should be able to help with anything but I'll try to answer any questions as and when I see them.

@BackwardSpy - on the off chance this is some weirdness in Azure's Network can you load Jeff onto some new VMs in a new VNet and try again? Probably worth taking Docker out of the mix too and just running the binary on the system. If you are able to replicate, can you try in a different region too?

murali-reddy commented 6 years ago

However, I did note that it does not happen if the curl pod and the jeff pod are on the same node, which I suppose makes sense as we should be hitting iptables before weave, right?

Right packets might getting bridged locally without going through data-path setup by weave.

Is there any specific information I can provide to help troubleshoot this further?

Could you please share the details for the question @brb asked. Basically if the number of packet drops seen correlates to the errors

ERRO: 2018/07/24 14:36:20.139952 Captured frame from MAC (2e:2e:c5:80:80:c6) to (12:61:a3:9e:b0:b0) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)
ERRO: 2018/07/24 14:36:20.136592 Captured frame from MAC (2e:2e:c5:80:80:c6) to (ae:d5:37:9e:f4:77) associated with another peer 66:d9:25:58:38:7c(prod-worker-09)

backwardspy commented 6 years ago

@cpressland i just tested it on a fresh vm in a fresh vnet in azure, both with and without docker, and i could not replicate the connection issue at all.

cpressland commented 6 years ago

@murali-reddy / @brb - I’ll get those stats for you on Monday when I’m back in the office.

@BackwardSpy - thanks for this. I guess this proves there isn’t an issue with the underlay Azure Network. Can you task somebody with spinning up a cluster with kubeadm tomorrow to repeat the above test with Weave and Kubernetes?

cpressland commented 6 years ago

Finally had time to further test this. We've spun up a kubeadm Kubernetes Cluster in Azure with the following configuration, thankfully this is very easy to reproduce in Azure.

1 x Virtual Network with 2 Subnets 1 x Kubernetes Master in Subnet 1 2 x Kubernetes Nodes in Subnet 2

Kubernetes Master configured with latest docker version from https://get.docker.com/ and then kubeadm init, weave is installed with kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Kubernetes Nodes simply have latest docker and the kubeadm join command run on them.

All three servers are configured with the default Ubuntu 16.04 Cannonical image in Azure.

The "Jeff" container is deployed with:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: jeff
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: jeff
    spec:
      containers:
      - name: jeff
        image: backwardspy/jeff
        ports:
        - containerPort: 8080

Once deployed I can see that Jeff is running on node 02

laadmin@weave-controller:~$ kubectl get pods -o wide
NAME                    READY     STATUS    RESTARTS   AGE       IP          NODE
jeff-6c66c6855b-s75z2   1/1       Running   0          9m        10.38.0.1   weave-node-02

From weave-node-01 I can run while true; do curl 10.38.0.1:8080; done and see instances of curl: (56) Recv failure: Connection reset by peer however I do NOT observe the TX-DRP number increasing on any interfaces on weave-node-02:

Kernel Interface table
Iface           MTU   Met RX-OK   RX-ERR RX-DRP RX-OVR TX-OK   TX-ERR TX-DRP TX-OVR Flg
datapath        1376  0   1241    0      0      0      160     0      0      0      BMRU
docker0         1500  0   0       0      0      0      0       0      0      0      BMU
eth0            1500  0   5566011 0      0      0      5293097 0      0      0      BMRU
lo              65536 0   534327  0      0      0      534327  0      0      0      LRU
vethwe-bridge   1376  0   21050   0      0      0      24875   0      0      0      BMRU
vethwe-datapath 1376  0   24875   0      0      0      21050   0      0      0      BMRU
vethwepl1a4027b 1376  0   24672   0      0      0      19826   0      0      0      BMRU
vxlan-6784      65535 0   127498  0      0      0      132281  0      0      0      BMRU
weave           1376  0   1240    0      0      0      161     0      0      0      BMRU

Given that this is a cluster outside of any of our organisational assets if either @brb or @murali-reddy want to provide me with a public ssh key I'd be happy to give you access to help in any debugging efforts.

murali-reddy commented 6 years ago

@cpressland So i tried on AWS, i am able to reproduce exact issue (56) Recv failure: Connection reset by peer. Strangely on my local cluster I am not able to reproduce. I will try to find the root cause now that I can reproduce easily.

murali-reddy commented 6 years ago

I tired to find any possible issues from Weave side that is causing reported problem. While I am able to reproduce the connection failure with the image provided, I dont see any issues from Weave side. All the OVS flow rules are configured properly and in place, so there is no reason to belive packets are getting dropped in datapath setup by Weave. I have not seen any errors in Weave logs as well.

So I did while true; do curl --trace-ascii dump.txt 100.120.0.3:8080 || break; done

== Info: Hostname was NOT found in DNS cache
== Info:   Trying 100.120.0.3...
== Info: Connected to 100.120.0.3 (100.120.0.3) port 8080 (#0)
=> Send header, 80 bytes (0x50)
0000: GET / HTTP/1.1
0010: User-Agent: curl/7.38.0
0029: Host: 100.120.0.3:8080
0041: Accept: */*
004e:
== Info: Recv failure: Connection reset by peer
== Info: Closing connection 0

Clearly this is TCP connection reset from the peer which is the pod in the case.

Again to see there are no network issues while true; do ping -n -i 0.001 100.120.0.3; done runs fine.

I wonder if application is not able to take the load you are sending.

// accept connections and process them serially can you please make processing concurrent and see if you still see issue?

cpressland commented 6 years ago

I'll close this off as we've been able to replicate the issue with Flannel, it's clearly not an issue with Weave, but it's really interesting that this only happens in kubernetes environments on cloud providers. My local Kubernetes Clusters and Docker environments never experience this issue regardless of the speed I throw requests at it. I guess our journey continues. Thanks for the assist!

brb commented 6 years ago

@cpressland Thanks for the update.

Hashfyre commented 5 years ago

Just FYI we are seeing this issue weave-2.5.0

vxlan-6784 65485 0  786351303      0      0 0      416236109      0     39      0 BMRU
vxlan-6784 65485 0  585180470      0      0 0      274931757      0     52      0 BMRU
vxlan-6784 65485 0  771039758      0      0 0      382114976      0    133      0 BMRU
vxlan-6784 65485 0  7091928266      0      0 0      7466642540      0  11771      0 BMRU
vxlan-6784 65485 0  4782238458      0      0 0      4434919932      0  16847      0 BMRU

Should this cause a bunch of packet retransmissions?

murali-reddy commented 5 years ago

@Hashfyre Please open a new issue.

Issue reported in this bug is nothing to do with Weave.

weaveworks / weave