weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.61k stars 665 forks source link

weave prometheus metrics gatherer reports false blocked connections #3339

Open mmerrill3 opened 6 years ago

mmerrill3 commented 6 years ago

What you expected to happen?

We are running cronjobs in kubernetes, which spin up quickly and access existing services in our kubernetes cluster. We have network policies enabled, where we allow access from the client pods (the jobs) to the services in the same namespace. We expect the clients to be able to use the services correctly, and to see no blocked connections from weave.

What happened?

We do not see any blocked connections, but the metrics.go prometheus handler is reporting blocked connections. This appears to be a timing issue, where the IP address of the pod for the cronjob is added to the whitelist for the service by weave-npc, but not before the TCP SYN is ack'd. There is a backoff delay since the initial SYN was not ACK'd, but the SYN does eventually get ACK'd upon retries by the OS. So, the client job actually does work, but the metric reports blocked connections.

How to reproduce it?

We can easily reproduce by showing this with a tiny tools job. This will cause a prometheus scrape to generate a blocked connection metric. You can use any small OS that has curl to also duplicate this issue. Below is the job, and the network policy.

apiVersion: batch/v1 kind: Job metadata: name: curl-with-timeout spec: backoffLimit: 3 activeDeadlineSeconds: 100 template: metadata: labels: chime-gateway-batch-client: "true" spec: containers:

The network policy:

apiVersion: extensions/v1beta1 kind: NetworkPolicy metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"extensions/v1beta1","kind":"NetworkPolicy","metadata":{"annotations":{},"name":"chime-gateway-batch","namespace":"chime"},"spec":{"ingress":[{"ports":[{"port":8081,"protocol":"TCP"}]}],"podSelector":{"matchLabels":{"k8s-app":"chime-gateway-batch"}},"policyTypes":["Ingress"]}} creationTimestamp: null generation: 1 name: chime-gateway-batch selfLink: /apis/extensions/v1beta1/namespaces/chime/networkpolicies/chime-gateway-batch spec: ingress:

The service:

apiVersion: v1 kind: Service metadata: creationTimestamp: null labels: k8s-app: chime-gateway-batch prometheus-app: chime-gateway-batch name: chime-gateway-batch selfLink: /api/v1/namespaces/chime/services/chime-gateway-batch spec: ports:

Anything else we need to know?

Running k8s in AWS using kops

Versions:

$ weave version weave-npc 2.3.0
$ docker version 
$ uname -a
$ kubectl version Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:47Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

$ docker logs weave

or, if using Kubernetes:

Here are the logs where I added some additional info when a blocked connection is detected, I printed out the entire packet layers.  You see the job start, the IP get added to the list, but the metric is incremented for block connections, even though the TCP connection eventually does get ACK'd.

DEBU: 2018/07/02 14:15:04.808067 EVENT UpdatePod {"metadata":{"creationTimestamp":"2018-07-02T14:15:02Z","generateName":"chime-gateway-dtlap-1530540900-","labels":{"chime-gateway-batch-client":"true","controller-uid":"4f6108e4-7e02-11e8-a32f-0628c82b87a2","job-name":"chime-gateway-dtlap-1530540900","k8s-app":"chime-job"},"name":"chime-gateway-dtlap-1530540900-hr9dn","namespace":"dev-chime","resourceVersion":"162022719","selfLink":"/api/v1/namespaces/dev-chime/pods/chime-gateway-dtlap-1530540900-hr9dn","uid":"4f639489-7e02-11e8-a32f-0628c82b87a2"},"spec":{"containers":[{"image":"080385600816.dkr.ecr.us-east-1.amazonaws.com/batch-job-api-caller:0.0.4","imagePullPolicy":"IfNotPresent","name":"chime-gateway-dtlap-batch-job","terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-203-19-88.ec2.internal","restartPolicy":"Never","schedulerName":"default-scheduler","securityContext":{"runAsNonRoot":true},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:02Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:02Z","message":"containers with unready status: [chime-gateway-dtlap-batch-job]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:02Z","status":"True","type":"PodScheduled"}],"hostIP":"10.203.19.88","phase":"Pending","qosClass":"Guaranteed","startTime":"2018-07-02T14:15:02Z"}} {"metadata":{"creationTimestamp":"2018-07-02T14:15:02Z","generateName":"chime-gateway-dtlap-1530540900-","labels":{"chime-gateway-batch-client":"true","controller-uid":"4f6108e4-7e02-11e8-a32f-0628c82b87a2","job-name":"chime-gateway-dtlap-1530540900","k8s-app":"chime-job"},"name":"chime-gateway-dtlap-1530540900-hr9dn","namespace":"dev-chime","resourceVersion":"162022739","selfLink":"/api/v1/namespaces/dev-chime/pods/chime-gateway-dtlap-1530540900-hr9dn","uid":"4f639489-7e02-11e8-a32f-0628c82b87a2"},"spec":{"containers":[{"image":"080385600816.dkr.ecr.us-east-1.amazonaws.com/batch-job-api-caller:0.0.4","imagePullPolicy":"IfNotPresent","name":"chime-gateway-dtlap-batch-job","terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-203-19-88.ec2.internal","restartPolicy":"Never","schedulerName":"default-scheduler","securityContext":{"runAsNonRoot":true},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:02Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:04Z","status":"True","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2018-07-02T14:15:02Z","status":"True","type":"PodScheduled"}],"hostIP":"10.203.19.88","phase":"Running","podIP":"100.108.0.41","qosClass":"Guaranteed","startTime":"2018-07-02T14:15:02Z"}}
INFO: 2018/07/02 14:15:04.808940 adding entry 100.108.0.41 to weave-+EuyUS0gs%UMj^52y1Kx]BQi2 of 4f639489-7e02-11e8-a32f-0628c82b87a2
INFO: 2018/07/02 14:15:04.809007 added entry 100.108.0.41 to weave-+EuyUS0gs%UMj^52y1Kx]BQi2 of 4f639489-7e02-11e8-a32f-0628c82b87a2
INFO: 2018/07/02 14:15:04.810147 adding entry 100.108.0.41 to weave-B22E#]OQ{04fi6D8q@:V%eZQO of 4f639489-7e02-11e8-a32f-0628c82b87a2
INFO: 2018/07/02 14:15:04.810315 added entry 100.108.0.41 to weave-B22E#]OQ{04fi6D8q@:V%eZQO of 4f639489-7e02-11e8-a32f-0628c82b87a2
WARN: 2018/07/02 14:15:05.623377 raw packet PACKET: 60 bytes
- Layer 1 (20 bytes) = IPv4     {Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=29202 Flags=DF FragOffset=0 TTL=64 Protocol=TCP Checksum=65439 SrcIP=100.108.0.41 DstIP=100.115.0.2 Options=[] Padding=[]}
- Layer 2 (40 bytes) = TCP      {Contents=[..40..] Payload=[] SrcPort=36778 DstPort=8081(sunproxyadmin) Seq=2134678169 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=26720 Checksum=25059 Urgent=0 Options=[..5..] Padding=[]}

WARN: 2018/07/02 14:15:05.624474 TCP connection from 100.108.0.41:36778 to 100.115.0.2:8081 blocked by Weave NPC.

Network:

$ ip route
$ ip -4 -o addr
$ sudo iptables-save
mmerrill3 commented 6 years ago

The initial workaround right now is to have an init container in our cronjob that sleeps for a few seconds, before running the main container for the cron job.

bboreham commented 6 years ago

Blocked connections are logged in the weave-npc container logs.

You're right there is a timing window between when the pod is set up and when weave-npc opens the iptables rules up. Given that knowledge, I'm not sure why you call them "false" - they may not be important, but the connections almost certainly have been blocked.

The "PodReady++" feature of Kubernetes is intended for this kind of situation, to signal when these async tasks are all completed.

mmerrill3 commented 6 years ago

Hi @bboreham, thanks for the quick response. I only say false since from the client pod's (the job) perspective, everything works OK when talking to the service. Something does get blocked though, I agree (the initial SYN). So, the alert is working. But, if you are just interested in whether your applications are being affected, you would consider this a false alarm. From a network perspecitve, its a definite postive alarm.
Wouldn't the SYN be ACK'd eventually though through retransmission of the the SYN? For instance, during the next scrape, the ACK value would be true as opposed to when it was false from the previous scrape. It is tolerable to have an artificial delay built into our cronjob deployments, so we are not blocked in any ways. I wanted to highlight this edge case for the folks at weave.

bboreham commented 6 years ago

from the client pod's (the job) perspective, everything works OK when talking to the service

Right, as you say, the SYN is retransmitted until it gets through.

For instance, during the next scrape, the ACK value would be true as opposed to when it was false from the previous scrape

Are you suggesting we match up block reports to subsequent successful connections and remove the blocked ones from the count? That's an interesting idea, albeit quite hard to implement. All the TCP connection stuff, and the packet filtering, is happening inside the kernel.

Another approach would be to delay the return from CNI ADD operation until the network policy setup is complete, so your code won't start running inside the pod until after that time. Currently the network daemon and NPC work quite independently.

I wanted to highlight this edge case for the folks at weave.

Thanks; it's always useful to hear reports from the field.

Pending any software changes, I think it's worth putting a note in the docs that one or two counts per pod may come from this timing window and to set thresholds accordingly.

mmerrill3 commented 6 years ago

Thanks @bboreham , I just double checked, and realized that as packets are entering ulogd, they are being consumed by the go routine in metrics.go. So, the metrics check is not parsing from the beginning of the ulogd pcap file each time. Yeah, that would be painful to implement, and resource intensive. Thanks for your suggestions. Its plain to see that having the pod wait until the network policy is implemented is a much better approach.

kppullin commented 5 years ago

I've run into this, or a very similar, issue. We're using the Apache Airflow workflow engine, which runs each "task" of a workflow in a cron-like manner.

Upon startup of a new task/job, Airflow immediately opens a database connection (using python+sql alchemy). We see these connections occasionally fail with DNS resolution errors ("Temporary failure in name resolution") and correlate this with UDP connection from 100.122.0.4:51308 to 100.126.0.5:53 blocked by Weave NPC messages in weave logs.

When Airflow is unable to establish this database connection the process exits, the pod terminates, and the task fails.

To workaround this we've added a command to sleep 15 seconds, which works most of the time, but failures tend to pop up under high load. Depending on when the cron schedules align we launch upwards of 8 pods concurrently.

Note that there are no NetworkPolicies defined in this configuration (perhaps there is an implicit/default policy?).

murali-reddy commented 5 years ago

@kppullin Please see the comment https://github.com/weaveworks/weave/issues/3464#issuecomment-443730137

yes, there is race between application running in the pods and network policy controller permitting the egress traffic from the pods.

mmerrill3 commented 4 years ago

It's been a while, but coming back to this one. Ready++ will work for pods that make up a service. But, let's say I have a pod that is a job, and performs just one function. It's not part of a service. When the pod starts, it does its job, and exits. Ready++ doesn't capture this "job" use case.

I believe there needs to be a runningGate feature for k8s pods, much like Ready++, where conditions should be met before the pod's docker process is started. Somewhere between getting the IP, and the docker process starting. This way, the npc pod for weave can receive the event that a pod has been added, and the network policies can converge. Then, the npc can update the condition within the pod to satisfy the runningGate.

This doesn't exist in k8s today, to my knowledge.

I think every CNI implementation that relies upon receiving addPod events from the API Server to converge the overlay networks would have this same issue. It's not particular to weave.