Closed semekh closed 4 years ago
Please send the whole logs.
@bboreham Any ideas regarding this one? This is causing unexpected outages of our cluster.
There are many lines like this:
INFO: 2019/09/01 07:50:42.132727 overlay_switch ->[e6:38:26:4c:14:62(c4-b1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/09/01 07:50:44.185786 overlay_switch ->[e6:38:26:4c:14:62(c4-b1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/09/01 07:50:44.591874 overlay_switch ->[56:5f:d0:3e:9e:f9(c4-g1)] sleeve timed out waiting for UDP heartbeat
which indicate a general inability to function. All it does is send a few packets to establish connectivity. The pattern is it works for a bit, then fails, then a bit later works again. Maybe the node is still overloaded? Or the ones it is talking to?
The node might have been overloaded at first, but after it was isolated, there was nothing running on the node other than the cluster essentials (e.g. kube-proxy, weave-net)
Given the symptoms, should I suspect any specific components when trying to reproduce it? During my investigations, I realized that if kube-utils -run-reclaim-daemon
gets OOMKilled, it does not get recreated. Could that somehow result in this behavior?
Could that somehow result in this behavior?
No.
Does the Weave Net daemon work better if you restart it again?
Given it seems to be trying and failing to talk to its friends, I would do some packet tracing at different points in the chain and see what's happening.
Does the Weave Net daemon work better if you restart it again?
Yes. I did restart it after a couple of days, and it worked perfectly.
Unfortunately, I cannot reproduce it manually. And it only happens in production, where we don't get to analyze it easily, since we'd rather decrease the downtime by restarting weave.
Looking at the other side of the problem, do you have metrics to see how big the weaver
process was when it got OOM-killed?
It shouldn't be that big on a 15-node cluster.
It's currently using 350MiB, with no limits set on it. But I don't have its usage at the time it failed.
Can you curl 127.0.0.1:6784/debug/pprof/heap
into a file and post it here?
According to that profile the program is currently using 18KB. Did it just restart or something?
The issue has occurred again, and I was able to quarantine it on a single node. It should be possible to inspect it.
An interesting aspect of the issue seems to be that the OOM has NOT killed weave
itself. It has killed another container (in this case prometheus) but this has led to the node losing its connectivity.
The symptoms seems to be identical to this issue: https://github.com/weaveworks/weave/issues/3641
We also experienced the issue in a weave pod running on a Node where 5 GB was available.
Similarities in our case:
I'm going to close this on the basis that Weave Net 2.6 uses far less memory. Please open a new issue rather than commenting on this; the template will request info that is essential to debug.
For future readers, the pod that was OOM killed was not weave itself, it was another pod. Closing the issue merely helps here.
We're experiencing the same thing, this should not be closed.
@choseh please open a new issue and supply the details for your specific case.
In random situations of OOMKiller getting triggered, after the node is back up again (i.e. in
Ready
state) the node loses its pod connectivity. Deleting the weave pod (and consequently it getting recreated) makes the issue go away.What you expected to happen?
I expected the node to eventually recover from the OOM, and/or report its state as
NotReady
if it hasn't.What happened?
The node reports its network state as ready, but one can not access pod IPs from that node or pods running on it.
How to reproduce it?
This is not fully reproducible, but almost all occurrences have been after some random pod causes OOMKiller to be triggered. We've successfully quarantined the bug on a node, and can examine it if further information is needed.
Versions:
Logs:
Lots of occurrences of the following lines:
but these only appear during the OOM, after that it just goes back to normal logs (e.g.
Discovered remote MAC
)Network: