weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.61k stars 668 forks source link

Weave fails to restart after running kops execContainer to install GPU requirements #3616

Open morganwalker opened 5 years ago

morganwalker commented 5 years ago

Current Setup

We are using kops to create and manage our kubernetes clusters at AWS. Within these clusters we're running GPU-specific Instance Groups. In order to enable our nodes with GPU support, we're running the NVIDIA GPU Driver and DevicePlugin. In short, this runs a kops hook execContainer that installs:

  1. Nvidia Device Drivers
  2. Cuda Libraries v9.1
  3. nvidia-docker
  4. docker-ce

A brief series of events:

  1. new GPU node is requested
  2. kops does the initial provisioning of the node installing docker-ce 17.03.2-ce
  3. the node is up and has joined the cluster - weave-net pod and containers are heathy
  4. the execContainer starts to run, which installs the Nvidia device drivers, Cuda libs, Nvidia docker along with the matching version of docker-ce, in this case 18.09.3
  5. Docker Application Container Engine restarts, forcing all the pods to restart
  6. weave-net pods are now in CrashLoopBackOff state

What you expected to happen?

After the kops execContainer scripts have finished and docker restarted, the weave-net pod restarts successfully.

What happened?

Weave-net pod is in CrashLoopBackOff state. Either

I don't know if one is causing the other. https://github.com/weaveworks/weave/issues/3275#issuecomment-380020578 does mention that restarting docker, which the kops execContainer is doing, could cause some issues. But why? Shouldn't weave-net recover from docker restarting?

How to reproduce it?

Add a GPU node to our cluster.

Anything else we need to know?

Once the weave-net pod is in the CrashLoopBackOff state, deleting it allows the weave containers to start successfully.

Versions:

$ weave version - weaveworks/weave-kube:2.5.1 and weaveworks/weave-npc:2.5.1 $ docker version - 17.03.2-ce installed by kops and then upgraded to 18.09.3 after the dcwangmit01/nvidia-device-plugin:0.1.0 kops execContainer runs $ uname -a - Linux 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux $ kops version - Version 1.11.0 $ kubectl version -

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:25:46Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Failure logs provided above. Here's the node syslog of the gpu node from initial boot, kops setup, dcwangmit01/nvidia-device-plugin:0.1.0 kops execContainer execution, and weave pod failing to start.

bboreham commented 5 years ago

Is it possible the old one is still running? Given the symptoms I suspect rebooting the node is a good thing to try.

morganwalker commented 5 years ago

After the execContainer finishes its scripts, I've confirmed that there is only one weave-net pod running on the GPU node. Deleting that weave-net pod triggers a new weave-net pod to be created, which fires up fine and stays healthy.

Rebooting the GPU node isn't a scalable solution for us. Our GPU nodes are only created when a workload requests them via:

    resources:
      limits:
        nvidia.com/gpu: 1

Once the workload finishes, the GPU node becomes idle and our autoscaler deletes the GPU node. I only mention this to give you some context around the lifetime of our GPU nodes; they are not long-lived.

Why is it that the weave-net pod can't survive a docker restart?

bboreham commented 5 years ago

Under Kubernetes, Weave Net is 100% unaware of Docker.

Both your errors indicate something else is blocking it; e.g.:

FATA: 2019/03/13 13:05:52.327108 Failed to bind metrics server: listen tcp :6781: bind: address already in use

see if you can spot what is using port 6781.

morganwalker commented 5 years ago

After the node joins the cluster, before the hook runs, and with weave-net running:

root@ip-172-20-57-254:~# netstat -pna | grep 6781
tcp6       0      0 :::6781                 :::*                    LISTEN      4576/weave-npc
tcp6       0      0 172.20.57.254:6781      172.20.50.209:51286     ESTABLISHED 4576/weave-npc
root@ip-172-20-57-254:~# netstat -pna | grep 6782
tcp6       0      0 :::6782                 :::*                    LISTEN      3729/weaver
tcp6       0      0 172.20.57.254:6782      172.20.50.209:59450     ESTABLISHED 3729/weaver

during docker restart

root@ip-172-20-57-254:~# netstat -pna | grep 6781
tcp6       0      0 :::6781                 :::*                    LISTEN      26437/weave-npc
tcp6       0      0 172.20.57.254:6781      172.20.50.209:51286     TIME_WAIT   -
root@ip-172-20-57-254:~# netstat -pna | grep 6782
tcp6       0      0 :::6782                 :::*                    LISTEN      26484/weaver
tcp6       0      0 172.20.57.254:6782      172.20.50.209:59450     TIME_WAIT   -

After the hook, after docker has restarted, and weave-net in a CrashLoopBackOff:

root@ip-172-20-57-254:~# netstat -pna | grep 6781
tcp6       0      0 :::6781                 :::*                    LISTEN      26437/weave-npc
tcp6       0      0 172.20.57.254:6781      172.20.50.209:52402     ESTABLISHED 26437/weave-npc
root@ip-172-20-57-254:~# netstat -pna | grep 6782
tcp6       0      0 :::6782                 :::*                    LISTEN      26484/weaver
tcp6       0      0 172.20.57.254:6782      172.20.50.209:60562     ESTABLISHED 26484/weaver

with the following errors in the containers: weave-npc:

INFO: 2019/03/14 19:09:03.142932 Starting Weaveworks NPC 2.5.1; node name "ip-172-20-57-254.ec2.internal"
INFO: 2019/03/14 19:09:03.143131 Serving /metrics on :6781
FATA: 2019/03/14 19:09:03.143309 Failed to bind metrics server: listen tcp :6781: bind: address already in use

weave

[boltDB] Unable to open /weavedb/weave-netdata.db: timeout

So it doesn't seem that something else is using 6781 other than weave itself. It's worth noting that it isn't reproducible every time. It probably happens 7/10 node creations.

bboreham commented 5 years ago

I would suspect the process that is reporting it can’t bind 6781 is not the one that already has it open, reported by netstat. CrashLoopBackoff means it’s dead, exited after the error which came 1ms after startup.

What is process 26437?

bboreham commented 5 years ago

I wonder if your install was actually running Docker the first time? If, say, it was running cri-o and you go through the steps to shut down Docker and re-add Docker, that would give you two of everything.

morganwalker commented 5 years ago

kops is definitely installing and running Docker. Our kops cluster config includes:

  docker:
    bridge: ""
    ipMasq: false
    ipTables: false
    logDriver: json-file
    logLevel: warn
    logOpt:
    - max-size=10m
    - max-file=5
    storage: overlay2,aufs

I've verified on non-GPU nodes, which use the same kops cluster configuration and AMI image but don't run the dcwangmit01/nvidia-device-plugin:0.1.0 execContainer scripts, that they're running:

root@ip-172-20-87-129:~# docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64
 Experimental: false

It's just that, for the GPU nodes, the dcwangmit01/nvidia-device-plugin:0.1.0 kops execContainer comes along afterwards and installs the nvidia docker runtime and upgrades Docker, forcing a restart.

When weave-net is failing, other daemonset pods are running fine, e.g. fluent-bit, kube-proxy, cattle-node-agent, node-exporter, etc. So I don't think there are two of everything running. Maybe weave-npc isn't letting go of it's port during the docker restart?

I'll work on getting more info on the process that is holding port 6781 after the docker restart.

bboreham commented 5 years ago

Maybe weave-npc isn't letting go of it's port during the docker restart?

That's impossible - the kernel cleans up ownership when the process exits.

You have shown in your output from netstat that new weave-npc processes start, and continue running. These are not the processes that are reporting an error. Find out what is running them.

morganwalker commented 5 years ago

@bboreham thanks for all the quick responses. We are still looking into and, in the meantime, have hacked together a cronjob that looks for failing weave-net pods and deletes them. This allows the GPU instances to finish coming up and pods to be scheduled on them. I'll report any findings.

priitpaasukene commented 5 years ago

+1 to this issue. Different org but same scenario. When booting GPU nodes in aws:

weave message:

[boltDB] Unable to open /weavedb/weave-netdata.db: timeout

weave-npc container messages:

INFO: 2019/03/25 16:36:42.135357 Starting Weaveworks NPC 2.5.0; node name "<actual_node_name>"
INFO: 2019/03/25 16:36:42.135510 Serving /metrics on :6781
FATA: 2019/03/25 16:36:42.135701 Failed to bind metrics server: listen tcp :6781: bind: address already in use

when I delete failing weave-net pod and daemonset recreates it - problem is gone until next time.

bboreham commented 5 years ago

Same thing - there is another instance of Weave Net running and locking the file. Please track down what is running that other instance on your computer.

adrianlyjak commented 5 years ago

FYI @morganwalker: upstream bug report in kops that sounds like the cause of your issue https://github.com/kubernetes/kops/issues/6728 (restart to nvidia-docker leaving orphaned containers from the original docker)

chrissng commented 5 years ago

@morganwalker I forgotten to post my fix in the Kops issue I reported.

You may consider using the alternative solution as specified in https://github.com/kubernetes/kops/issues/6728#issuecomment-483653258