Open mmerrill3 opened 6 years ago
The initial workaround right now is to have an init container in our cronjob that sleeps for a few seconds, before running the main container for the cron job.
Blocked connections are logged in the weave-npc
container logs.
You're right there is a timing window between when the pod is set up and when weave-npc
opens the iptables rules up. Given that knowledge, I'm not sure why you call them "false" - they may not be important, but the connections almost certainly have been blocked.
The "PodReady++" feature of Kubernetes is intended for this kind of situation, to signal when these async tasks are all completed.
Hi @bboreham, thanks for the quick response. I only say false since from the client pod's (the job) perspective, everything works OK when talking to the service. Something does get blocked though, I agree (the initial SYN). So, the alert is working. But, if you are just interested in whether your applications are being affected, you would consider this a false alarm. From a network perspecitve, its a definite postive alarm.
Wouldn't the SYN be ACK'd eventually though through retransmission of the the SYN? For instance, during the next scrape, the ACK value would be true as opposed to when it was false from the previous scrape.
It is tolerable to have an artificial delay built into our cronjob deployments, so we are not blocked in any ways. I wanted to highlight this edge case for the folks at weave.
from the client pod's (the job) perspective, everything works OK when talking to the service
Right, as you say, the SYN is retransmitted until it gets through.
For instance, during the next scrape, the ACK value would be true as opposed to when it was false from the previous scrape
Are you suggesting we match up block reports to subsequent successful connections and remove the blocked ones from the count? That's an interesting idea, albeit quite hard to implement. All the TCP connection stuff, and the packet filtering, is happening inside the kernel.
Another approach would be to delay the return from CNI ADD operation until the network policy setup is complete, so your code won't start running inside the pod until after that time. Currently the network daemon and NPC work quite independently.
I wanted to highlight this edge case for the folks at weave.
Thanks; it's always useful to hear reports from the field.
Pending any software changes, I think it's worth putting a note in the docs that one or two counts per pod may come from this timing window and to set thresholds accordingly.
Thanks @bboreham , I just double checked, and realized that as packets are entering ulogd, they are being consumed by the go routine in metrics.go. So, the metrics check is not parsing from the beginning of the ulogd pcap file each time. Yeah, that would be painful to implement, and resource intensive. Thanks for your suggestions. Its plain to see that having the pod wait until the network policy is implemented is a much better approach.
I've run into this, or a very similar, issue. We're using the Apache Airflow workflow engine, which runs each "task" of a workflow in a cron-like manner.
Upon startup of a new task/job, Airflow immediately opens a database connection (using python+sql alchemy). We see these connections occasionally fail with DNS resolution errors ("Temporary failure in name resolution"
) and correlate this with UDP connection from 100.122.0.4:51308 to 100.126.0.5:53 blocked by Weave NPC
messages in weave logs.
When Airflow is unable to establish this database connection the process exits, the pod terminates, and the task fails.
To workaround this we've added a command to sleep 15 seconds, which works most of the time, but failures tend to pop up under high load. Depending on when the cron schedules align we launch upwards of 8 pods concurrently.
Note that there are no NetworkPolicies defined in this configuration (perhaps there is an implicit/default policy?).
@kppullin Please see the comment https://github.com/weaveworks/weave/issues/3464#issuecomment-443730137
yes, there is race between application running in the pods and network policy controller permitting the egress traffic from the pods.
It's been a while, but coming back to this one. Ready++ will work for pods that make up a service. But, let's say I have a pod that is a job, and performs just one function. It's not part of a service. When the pod starts, it does its job, and exits. Ready++ doesn't capture this "job" use case.
I believe there needs to be a runningGate feature for k8s pods, much like Ready++, where conditions should be met before the pod's docker process is started. Somewhere between getting the IP, and the docker process starting. This way, the npc pod for weave can receive the event that a pod has been added, and the network policies can converge. Then, the npc can update the condition within the pod to satisfy the runningGate.
This doesn't exist in k8s today, to my knowledge.
I think every CNI implementation that relies upon receiving addPod events from the API Server to converge the overlay networks would have this same issue. It's not particular to weave.
What you expected to happen?
We are running cronjobs in kubernetes, which spin up quickly and access existing services in our kubernetes cluster. We have network policies enabled, where we allow access from the client pods (the jobs) to the services in the same namespace. We expect the clients to be able to use the services correctly, and to see no blocked connections from weave.
What happened?
We do not see any blocked connections, but the metrics.go prometheus handler is reporting blocked connections. This appears to be a timing issue, where the IP address of the pod for the cronjob is added to the whitelist for the service by weave-npc, but not before the TCP SYN is ack'd. There is a backoff delay since the initial SYN was not ACK'd, but the SYN does eventually get ACK'd upon retries by the OS. So, the client job actually does work, but the metric reports blocked connections.
How to reproduce it?
We can easily reproduce by showing this with a tiny tools job. This will cause a prometheus scrape to generate a blocked connection metric. You can use any small OS that has curl to also duplicate this issue. Below is the job, and the network policy.
apiVersion: batch/v1 kind: Job metadata: name: curl-with-timeout spec: backoffLimit: 3 activeDeadlineSeconds: 100 template: metadata: labels: chime-gateway-batch-client: "true" spec: containers:
The network policy:
apiVersion: extensions/v1beta1 kind: NetworkPolicy metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"extensions/v1beta1","kind":"NetworkPolicy","metadata":{"annotations":{},"name":"chime-gateway-batch","namespace":"chime"},"spec":{"ingress":[{"ports":[{"port":8081,"protocol":"TCP"}]}],"podSelector":{"matchLabels":{"k8s-app":"chime-gateway-batch"}},"policyTypes":["Ingress"]}} creationTimestamp: null generation: 1 name: chime-gateway-batch selfLink: /apis/extensions/v1beta1/namespaces/chime/networkpolicies/chime-gateway-batch spec: ingress:
The service:
apiVersion: v1 kind: Service metadata: creationTimestamp: null labels: k8s-app: chime-gateway-batch prometheus-app: chime-gateway-batch name: chime-gateway-batch selfLink: /api/v1/namespaces/chime/services/chime-gateway-batch spec: ports:
Anything else we need to know?
Running k8s in AWS using kops
Versions:
Logs:
or, if using Kubernetes:
Network: