weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Very high load caused by weave-proxy #2532

Closed carlpett closed 8 years ago

carlpett commented 8 years ago

We're seeing issues with weave-proxy causing very high load on our servers (see image). It seems in turn to be caused by our deployment tool (https://github.com/spotify/helios/) polling the socket every 5-10 seconds for all containers it has scheduled on the host. But why is this causing ever-increasing cpu usage from the proxy? Currently we have to restart every now and then to avoid the host toppling over.

image

We're running Centos 7.2 on physical hardware (20 cores, 256 gb ram), docker version 1.10.2.

Weave status:

$ weave status

        Version: 1.6.1 (version 1.7.0 available - please upgrade!)

        Service: router
       Protocol: weave 1..2
           Name: 4e:b7:74:b0:14:ca(sap-p-app201.ad.dex.nu)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 5 (4 established, 1 failed)
          Peers: 5 (with 20 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.36.192.0/19
  DefaultSubnet: 10.36.192.0/21

        Service: proxy
        Address: unix:///var/run/weave/weave.sock

(The failed connection is to localhost, we need to fix our setup-script to not exclude self from the list of hosts at startup...)

carlpett commented 8 years ago

Investigating a bit, it looks like it could be a file descriptor leak:

$  pgrep weaveproxy
25883
$ ls -l /proc/25883/fd | wc -l
514146
$ ls -l /proc/25883/fd | wc -l
514150
$ ls -l /proc/25883/fd | head
total 0
lrwx------. 1 root root 64 Sep 29 17:30 0 -> /dev/null
l-wx------. 1 root root 64 Sep 29 17:30 1 -> pipe:[139593893]
lr-x------. 1 root root 64 Sep 30 10:59 10 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 1000 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 10000 -> net:[4026533534]
lr-x------. 1 root root 64 Sep 30 10:59 100000 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100001 -> net:[4026531968]
lr-x------. 1 root root 64 Sep 30 10:59 100002 -> net:[4026531968]
bboreham commented 8 years ago

Thanks for the report @carlpett ; I'm not familiar with what Helios does.

Could you please try running with extra debugging info:

weave stop-proxy
weave launch-proxy --log-level=debug

Do that for a short while and see if it gives any clues, and/or upload the result here. Note we log docker commands received so please scan for any secrets before uploading.

awh commented 8 years ago

it looks like it could be a file descriptor leak:

I can duplicate this if I launch the proxy with --rewrite-inspect - we then leak two netns fds per invocation of docker inspect.

awh commented 8 years ago

@carlpett I've filed a separate bug in case there's more to your issue than this problem. Can you confirm you're using the proxy with --rewrite-inspect?

awh commented 8 years ago

@carlpett it turns out that the fd leak bug is already fixed in 1.6.2 and 1.7.0 - are you able to upgrade and test?

carlpett commented 8 years ago

@awh Yes, we use --rewrite-inspect. Wierd how I didn't find that issue when I searched (also didn't see it in the changelogs?).. Anyway, I will do an update and hopefully that will be it!

awh commented 8 years ago

Wierd how I didn't find that issue when I searched

Not at all - it was listed as a fix for an unreliable integration test, we didn't realise at the time that it would cause a serious issue like this. Thanks for the report!

carlpett commented 8 years ago

Yup, that fixed it. After 45 minutes still only ~25 fds open. Also, this forced me to do our first live update of Weave under production load, so that was interesting as well :)

pidster commented 8 years ago

@carlpett we'd really like to hear about the live update - could you drop into our Slack channel some time next week & say hi?