weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 671 forks source link

ping directly after pod start fails after update to weave 2.5.0 #3464

Open damoon opened 6 years ago

damoon commented 6 years ago

Use case

I updated my setup from weave 2.4.1 to weave 2.5.0. All tests but one passed. The test is to create a kuberntes job and ping a single time.

I tested with weave 2.4.1 and weave 2.5.0 kubernetes 1.10 1.11 1.12 all the same docker version 17.03.3-ce

The problem only happens with 2.5.0 not with 2.4.1.

Because of the found workaround (wait for 1 sec) i assume 2.5.0 is delayed by some small amount of time to setup the network. Just to be clear, waiting for 1 sec all the time is not something i plan to do :D

What you expected to happen?

running a job to ping should ping once and succeed.

# kubectl logs google-dns-kw6ss 
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=61 time=12.615 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 12.615/12.615/12.615 ms

What happened?

pinging failed

# kubectl logs google-dns-ftlff 
PING 8.8.8.8 (8.8.8.8): 56 data bytes

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

How to reproduce it?

running the job creates the same result reliable.

Anything else we need to know?

apiVersion: batch/v1
kind: Job
metadata:
  name: google-dns
spec:
  template:
    metadata:
      name: google-dns
    spec:
      containers:
      - name: dig
        image: azukiapp/dig:0.3.0
        command:
          - sh
          - -c
          - "ping -c 1 8.8.8.8"
      restartPolicy: Never

kind of workaround is to add a artificial delay before the container starts it actual work

          - "sleep 1 && ping -c 1 8.8.8.8"

Versions:

working:
$ weave version
weave script 2.4.1
weave 2.4.1

broken:
$ weave version
weave script 2.5.0
weave 2.5.0

$ docker version
Client:
 Version:      17.03.3-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   e19b718
 Built:        Thu Aug 30 01:04:51 2018
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.3-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   e19b718
 Built:        Thu Aug 30 01:04:51 2018
 OS/Arch:      linux/amd64
 Experimental: false

$ uname -a
Linux node1 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:17:06Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
murali-reddy commented 5 years ago

If i disable weave-npc or change the weave-npc image from 2.5.0 to 2.4.0 there is no delay and pod runs to completion, so it appears weave-npc changes in 2.5 is adding the latency. Will investigate further.

murali-reddy commented 5 years ago

So it turned out to be due to egress network policies based on the running the job shared in this issue. weave-npc implementation by default starts dropping packets that it does know about it yet (desirable from security perspective). So there is window of time from pod getting started on a node to weave-npc pod running on that node receiving update from k8s API server. Only after weave-npc gets a chance to process the pod start event w.r.t network policies it will allow or deny the packets. Egress network policies are implemented in 2.4. So its possible it can happen on 2.4 as well.

In general given its eventual consistent system, is this is acceptable or its possible application will bail out early assuming no network connectivity?

FYI, pod that is getting created will take time to get populated in either WEAVE-NPC-EGRESS-DEFAULT or WEAVE-NPC-EGRESS-CUSTOM in this window of time last rule will result in dropping packets.

Chain WEAVE-NPC-EGRESS (2 references)
pkts bytes target     prot opt in     out     source               destination
3193  389K ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
   0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            PHYSDEV match --physdev-in vethwe-bridge
   0     0 RETURN     all  --  *      *       0.0.0.0/0            224.0.0.0/4
   5   420 WEAVE-NPC-EGRESS-DEFAULT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW
   5   420 WEAVE-NPC-EGRESS-CUSTOM  all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW mark match ! 0x40000/0x40000
   5   420 NFLOG      all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW mark match ! 0x40000/0x40000 nflog-group 86
   5   420 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x40000/0x40000
damoon commented 5 years ago

We have setups with php scripts that start via an cronjob every minute. Most will directly after startup ask a database (or newer setups will ask a queue) if there is work to be done. I do not trust the frameworks, i have to support, to do retries, circuit breaking or reliable timeouts. I do not like it but i am kind of sure most developers do not understand the a distributed eventual system enough to not get frustrated over time.

Is there any (nice) way to delay the container startup until the network is fully setup? I would like to avoid the only workaround i have in my mind: create a admission controller to add a initcontainer to every pod, to sleep until the network looks good.

bboreham commented 5 years ago

There’s a proposed future Kubernetes feature under the name “PodReady++” that is intended to allow waiting for things like network policy.

We might be able to predict which pod is being set up at the time the IP address is requested, and set up network policy at that time. Currently we give an IP address then set up policy rules after it has reflectedfrom the api-server.

I will add that generally you will be using TCP which will do retries, rather than depending on a framework. But you will likely notice DNS requests (over UDP) getting dropped as they typically have a 5 second timeout.

murali-reddy commented 5 years ago

Like Bryan mentioned, application using TCP should be inherently be able to contain this (given tcp syn retires). If you change job spec to do curl instead of ping you should see better result. But this is not fool proof. application developer should not be aware of this constraints.

Weave npc implementation to the extent possible acts aggressively (by watching update events the pods and react as soon as there is API assigned in the pods spec ) to permit pod egress. In case API server is under heavy load its possible latency in receiving the event from API server could be.

Its desirable to have a more controlled solution keeping pod provisioning and configuring network policies asynchronous. Will revert back if there is a clean solution with out major trade-offs.

bboreham commented 5 years ago

@damoon could I ask if you have any NetworkPolicys set up in your cluster?

We could change the logic so everything is open in the case there are none, which would mean only people who actually choose to use NetworkPolicy pay the price of waiting for the rules to open.

damoon commented 5 years ago

We do not use NetworkPolicys yet. Having everything open for this case and for users of NetworkPolicys everything closed and secure by default sounds nice to me.

faheem-nadeem commented 5 years ago

Got hit by this pretty badly recently. We had cassandra containers setting some stuff up, by hitting aws metadata service on container start. All calls led to timeouts :( Happens on weave 2.5.0, kops 1.11, No network policies Sleep does not do anything... Remedy for now was just to let the container die and retry. After some restarts we get a connection and proceed forward.

bboreham commented 5 years ago

@faheem-cliqz if "sleep does not do anything" then you are seeing something different. Please open a new issue and provide the information requested by the template.

SharpEdgeMarshall commented 5 years ago

We started being affected by this issue after updating to kops 1.11, it appears always only during rolling-update of the cluster.