Node loses pod connectivity after OOM

semekh commented 5 years ago

In random situations of OOMKiller getting triggered, after the node is back up again (i.e. in Ready state) the node loses its pod connectivity. Deleting the weave pod (and consequently it getting recreated) makes the issue go away.

What you expected to happen?

I expected the node to eventually recover from the OOM, and/or report its state as NotReady if it hasn't.

What happened?

The node reports its network state as ready, but one can not access pod IPs from that node or pods running on it.

How to reproduce it?

This is not fully reproducible, but almost all occurrences have been after some random pod causes OOMKiller to be triggered. We've successfully quarantined the bug on a node, and can examine it if further information is needed.

Versions:

$ weave version
weave 2.5.2
$ docker version
...
Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false
$ uname -a
Linux c4-b2 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
...
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Lots of occurrences of the following lines:

...connection shutting down due to error: read tcp4...
...connection deleted...
...connection shutting down due to error during handshake: failed to receive remote protocol header...

but these only appear during the OOM, after that it just goes back to normal logs (e.g. Discovered remote MAC)

Network:

$ ip route
...
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.33.128.0
$ ip -4 -o addr
...
6: weave    inet 10.33.128.0/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever
$ /home/weave/weave --local status # inside the weave container
        Version: 2.5.2 (up to date; next check at 2019/09/02 21:13:01)

        Service: router
       Protocol: weave 1..2
           Name: 66:cf:c7:9d:f2:00(c4-b2)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 16
    Connections: 16 (15 established, 1 failed)
          Peers: 16 (with 240 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

bboreham commented 5 years ago

Please send the whole logs.

semekh commented 5 years ago

https://gist.github.com/SeMeKh/e1f6e879fc9b855e48db0e2903ad2e34

semekh commented 5 years ago

@bboreham Any ideas regarding this one? This is causing unexpected outages of our cluster.

bboreham commented 5 years ago

There are many lines like this:

INFO: 2019/09/01 07:50:42.132727 overlay_switch ->[e6:38:26:4c:14:62(c4-b1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/09/01 07:50:44.185786 overlay_switch ->[e6:38:26:4c:14:62(c4-b1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/09/01 07:50:44.591874 overlay_switch ->[56:5f:d0:3e:9e:f9(c4-g1)] sleeve timed out waiting for UDP heartbeat

which indicate a general inability to function. All it does is send a few packets to establish connectivity. The pattern is it works for a bit, then fails, then a bit later works again. Maybe the node is still overloaded? Or the ones it is talking to?

semekh commented 5 years ago

The node might have been overloaded at first, but after it was isolated, there was nothing running on the node other than the cluster essentials (e.g. kube-proxy, weave-net) Given the symptoms, should I suspect any specific components when trying to reproduce it? During my investigations, I realized that if kube-utils -run-reclaim-daemon gets OOMKilled, it does not get recreated. Could that somehow result in this behavior?

bboreham commented 5 years ago

Could that somehow result in this behavior?

No.

Does the Weave Net daemon work better if you restart it again?

Given it seems to be trying and failing to talk to its friends, I would do some packet tracing at different points in the chain and see what's happening.

semekh commented 5 years ago

Does the Weave Net daemon work better if you restart it again?

Yes. I did restart it after a couple of days, and it worked perfectly.

Unfortunately, I cannot reproduce it manually. And it only happens in production, where we don't get to analyze it easily, since we'd rather decrease the downtime by restarting weave.

bboreham commented 5 years ago

Looking at the other side of the problem, do you have metrics to see how big the weaver process was when it got OOM-killed? It shouldn't be that big on a 15-node cluster.

semekh commented 5 years ago

It's currently using 350MiB, with no limits set on it. But I don't have its usage at the time it failed.

bboreham commented 5 years ago

Can you curl 127.0.0.1:6784/debug/pprof/heap into a file and post it here?

semekh commented 5 years ago

https://gist.github.com/SeMeKh/e1f6e879fc9b855e48db0e2903ad2e34#file-weave-pprof

bboreham commented 5 years ago

According to that profile the program is currently using 18KB. Did it just restart or something?

semekh commented 5 years ago

The issue has occurred again, and I was able to quarantine it on a single node. It should be possible to inspect it. An interesting aspect of the issue seems to be that the OOM has NOT killed weave itself. It has killed another container (in this case prometheus) but this has led to the node losing its connectivity.

semekh commented 5 years ago

The symptoms seems to be identical to this issue: https://github.com/weaveworks/weave/issues/3641

gokhansengun commented 4 years ago

We also experienced the issue in a weave pod running on a Node where 5 GB was available.

Similarities in our case:

another pod than weave with memory limit was OOM killed 40 times.
restarting weave pod resolved the issue
logs "sleeve timed out waiting" right before issue started

bboreham commented 4 years ago

I'm going to close this on the basis that Weave Net 2.6 uses far less memory. Please open a new issue rather than commenting on this; the template will request info that is essential to debug.

gokhansengun commented 4 years ago

For future readers, the pod that was OOM killed was not weave itself, it was another pod. Closing the issue merely helps here.

choseh commented 4 years ago

We're experiencing the same thing, this should not be closed.

bboreham commented 4 years ago

@choseh please open a new issue and supply the details for your specific case.

weaveworks / weave