Closed carlpett closed 8 years ago
Investigating a bit, it looks like it could be a file descriptor leak:
$ pgrep weaveproxy
25883
$ ls -l /proc/25883/fd | wc -l
514146
$ ls -l /proc/25883/fd | wc -l
514150
$ ls -l /proc/25883/fd | head
total 0
lrwx------. 1 root root 64 Sep 29 17:30 0 -> /dev/null
l-wx------. 1 root root 64 Sep 29 17:30 1 -> pipe:[139593893]
lr-x------. 1 root root 64 Sep 30 10:59 10 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 1000 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 10000 -> net:[4026533534]
lr-x------. 1 root root 64 Sep 30 10:59 100000 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100001 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100002 -> net:[4026531968]
Thanks for the report @carlpett ; I'm not familiar with what Helios does.
Could you please try running with extra debugging info:
weave stop-proxy
weave launch-proxy --log-level=debug
Do that for a short while and see if it gives any clues, and/or upload the result here. Note we log docker commands received so please scan for any secrets before uploading.
it looks like it could be a file descriptor leak:
I can duplicate this if I launch the proxy with --rewrite-inspect
- we then leak two netns fds per invocation of docker inspect
.
@carlpett I've filed a separate bug in case there's more to your issue than this problem. Can you confirm you're using the proxy with --rewrite-inspect
?
@carlpett it turns out that the fd leak bug is already fixed in 1.6.2 and 1.7.0 - are you able to upgrade and test?
@awh Yes, we use --rewrite-inspect
. Wierd how I didn't find that issue when I searched (also didn't see it in the changelogs?).. Anyway, I will do an update and hopefully that will be it!
Wierd how I didn't find that issue when I searched
Not at all - it was listed as a fix for an unreliable integration test, we didn't realise at the time that it would cause a serious issue like this. Thanks for the report!
Yup, that fixed it. After 45 minutes still only ~25 fds open. Also, this forced me to do our first live update of Weave under production load, so that was interesting as well :)
@carlpett we'd really like to hear about the live update - could you drop into our Slack channel some time next week & say hi?
We're seeing issues with weave-proxy causing very high load on our servers (see image). It seems in turn to be caused by our deployment tool (https://github.com/spotify/helios/) polling the socket every 5-10 seconds for all containers it has scheduled on the host. But why is this causing ever-increasing cpu usage from the proxy? Currently we have to restart every now and then to avoid the host toppling over.
We're running Centos 7.2 on physical hardware (20 cores, 256 gb ram), docker version 1.10.2.
Weave status:
(The failed connection is to localhost, we need to fix our setup-script to not exclude self from the list of hosts at startup...)