weaveworks / scope

Monitoring, visualisation & management for Docker & Kubernetes
https://www.weave.works/oss/scope/
Apache License 2.0
5.85k stars 708 forks source link

Speed issues under Virtualbox #769

Closed neo21670 closed 8 years ago

neo21670 commented 8 years ago

I'm trying to set up Scope under a Virtualbox 4 based environment. The box is set up by a basic vagrantfile, 2 interfaces, 1.5 gigs of RAM, 2 cores with no throttling. Actual host is a recent Core i7, virtualization is supported and on, etc. (Basically multiple VMs can run without a hitch.)

After setting up Scope and launching it with sudo scope launch, the whole VM halts to a crawl. Scope web UI runs fine reacts in an instant, attaching to a containers shell is fluid and instantaneous, anything else however is virtually paused. SSH shells do not time out, but they are unable to respond. Neither the host nor the guest shows excess CPU usage based on top output. Requests to a basic web service running in a test container loads after several seconds of delay.

Upon stopping the scope container from the GUI, the system keeps on being unusable for about 30-35 seconds, then everything returns to normal. CPU usage during the exit shows no change. RAM usage of the whole system+scope+test container setup stops around 300 megs of active RAM. Logs show no relevant information - actually nothing at all, besides the normal output of the web service.

tomwilkie commented 8 years ago

Thanks for the bug report. Would you mind giving us a bit more information?

With this info I'll try and get reproduce it here.

In 0.11 we've tried to improve the CPU usage of Scope, and we've also found various performance issues in Docker recently (https://github.com/docker/docker/issues/17720), that could be affecting this.

neo21670 commented 8 years ago

Sure, thanks!

Guest OS is an Ubuntu 15.04, vagrant@devbox:~$ uname -a Linux devbox 3.19.0-37-generic #42-Ubuntu SMP Fri Nov 20 18:22:05 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux vagrant@devbox:~$ docker --version Docker version 1.9.1, build a34a1d5

Vagrantfile.txt devboxup relevant.txt

Scope is version 0.11.0.

weavescopelog.txt - I booted up the VM, started scope with sudo, let it fire up it's web interface, confirmed the VM is unresponsive, then proceeded with halting the weavescope container from the GUI. I see two instances of received SIGINT/SIGTERM, even though I only issued the stop command once. (I repeated the experiment to make sure about that.)

topoutput.txt

Issue was present in 0.10.0 as well.

tomwilkie commented 8 years ago

Thats @neo21670, I'll look at this immediately

tomwilkie commented 8 years ago

Possible a dupe of #715?

tomwilkie commented 8 years ago

From the logs its looks like the CPU isn't busy, and I cannot reproduce this on my vagrant VM with Docker 1.9.1 (but Ubuntu 14.04). I'm going to upgrade to later ubuntu and try again.

neo21670 commented 8 years ago

Seems to be a good idea. I'm setting up a trusty VM right now.

neo21670 commented 8 years ago

Yeah 14.04 LTS is also affected by issue.

tomwilkie commented 8 years ago

Another report of something similar #699.

My slow internet is still downloading 15.04.

tomwilkie commented 8 years ago

Is there anything in dmesg? Would you mind posting the output here?

tomwilkie commented 8 years ago

I've run an almost identical ubuntu 15.04 virtual box, with docker 1.9.1 and a bunch of containers, and I can't get this to reproduce. I believe there is a problem though (3 reports of this), so would you mind trying a few things for me:

And let me know what affect they have. Also, are you running the weave network too? Just checking. Final question: any sign of OOM killer in dmesg?

tomwilkie commented 8 years ago

Not sure its relevant, but I found my virtual box VM with 2 vcpus was significantly slower than with 1. May be anecdotal.

neo21670 commented 8 years ago

On dmesg: nothing special after booting. Downloading a fresh copy of 0.11.1, fired it up, issue still persists. Here's what has changed after shutting down scope from the web gui.

[  533.426550] cgroup: docker (23568) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[  533.426554] cgroup: "memory" requires setting use_hierarchy to 1 on the root.
[  533.656410] Netfilter messages via NETLINK v0.30.
[  533.659540] ctnetlink v0.93: registering with nfnetlink.

I haven't seen anything related to OOM or memory at all.

--probe.conntrack false - no improvement. --probe.docker false - no improvement. Obviously I was unable to shut down the container w/o Docker probing.

And finally: no, haven't introduced Weave Net to my stack yet.

Also, here's a /proc/cpuinfo - I thought the flags might be useful.

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 70
model name  : Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
stepping    : 1
microcode   : 0x19
cpu MHz     : 2498.760
cache size  : 6144 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 5
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips    : 4997.52
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
neo21670 commented 8 years ago

Also, I tried to fire up the same VM with 4 cores, no result. Same VM, double the RAM, 1 core: same issue.

neo21670 commented 8 years ago

I tried the same procedure on a VMWare based VM (Ubuntu 14.04 LTS, 1.5GB, 2 cores) - no issues experienced. I guess I'll try to set up a Virtualbox 5 based Vagrant env to see if the issue persists.

neo21670 commented 8 years ago

Virtualbox 4.3.34 - same issue. It might be interesting, that I chose to force shut down (Close -> Power Off) the VM through the VBox GUI. (So far I've been using vagrant destroy -f.) Despite the function being usually instantaneous on any VMs, when Scope is running, it lasts quite a few seconds. Proceeding with VBox 5 install now.

tomwilkie commented 8 years ago

Thanks for all the info! Unfortunately this doesn't point to anything obvious. I'm running virtual box 4.3 myself, interesting that VMWare doesn't experience the issue.

When the issue occurs, what does top report on machine hosting the VM?

I'll speak with my colleages and see if they have any suggestions.

2opremio commented 8 years ago

I can reproduce this with Scope 0.11.1, observing similar symptoms to the ones described by @neo21670:

  1. When I have enough containers running other than Scope, the VM freezes (it's completely unresponsive through ssh) as soon as I launch the Scope probe container (I happen to run the probe and app separately in a kubernetes cluster)
  2. During the freeze, the Scope web UI keeps being responsive.
  3. As soon as I pause the probe container using the Scope UI everything is responsive again.

The VM has 4GB of RAM reserved and can use 2 CPUs, i.e:

    vb.memory = 4096
    vb.cpus = 2

The problem surfaces when spawning scope while running around 30 containers.

If I spawn Scope while running less containers (20) the system freezes for a while (~2 minutes) but then it recovers.

I am running Docker 1.8.3.

I will gather more information next week.

2opremio commented 8 years ago

This has high chances of being a duplicate of https://github.com/weaveworks/scope/issues/812

2opremio commented 8 years ago

@neo21670 If you are still interested in this, could you please try to reproduce with --probe.spy.interval=3s ?

neo21670 commented 8 years ago

Sure, I'll check that in a day, thanks for the heads up. On this being the duplicate of #812, I doubt if that could be it. I'd like to emphasize that neither the host, nor the guest running docker experiences high CPU usage.

2opremio commented 8 years ago

@neo21670 Did you get around to run the test?

2opremio commented 8 years ago

I can now reproduce the problem systematically while running the Kubernetes cluster I mentioned above. The memory/CPU consumption of the probe is only at 1.5%/20% . So, the resource usage shouldn't be triggering the problem as @neo21670 mentioned.

As surprising as it may be, doing

sudo bash -c "echo 'nameserver 8.8.8.8' > /etc/resolv.conf"

before launching Scope makes the problem go away. As soon as I restore the original file and restart Scope the problem comes back.

My VM is using the VirtualBox's DNS proxy in NAT mode using the host resolver, which has caused us application-level problems in the past.

In this case, it renders the NAT network interface of the VM unusable. More specifically, it won't allow making any external connections from the VM and it will freeze ssh sessions started with vagrant ssh (since it uses port forwarding through the NAT interface).

However, all the other interfaces are unaffected. In particular, ssh sessions and connections to the Scope UI made through a host-only network interface won't be affected, which why I believe @neo21670 observes that

Scope web UI runs fine reacts in an instant

I don't know enough about VirtualBox to see how its NAT DNS proxy can cause the NAT interface to fail and there don't seem to be any upstream tickets covering the issue, but I am pretty sure this is the cause. @neo21670 Can you confirm?

I also don't know why Scope triggers the problem when the number of containers is >20, my guess it has something to do with the reverse DNS-lookups from Scope. Maybe we are hammering the Virtualbox DNS proxy. I will investigate, but in any case I don't think it should render the NAT interface unusable.

2opremio commented 8 years ago

It seems that Virtualbox's NAT proxy resolver in NAT mode chokes on IPv6 DNS lookups also causing the NAT interface to be unresponsive.

I have managed to reproduce the problem by simply doing:

while true; do host -t AAAA scope.weave.local; sleep 1; done

Without launching scope at all.

Every time an IPv6 query is made to VirtualBox's proxy resolver, the whole NAT interface is unresponsive for a while. We can see the impact of it by pinging 8.8.8.8 (which goes through the NAT network) and executing the executing host -t AAAA scope.weave.local, which causes the NAT interface to be unresponsive for 8 seconds.

$ ping www.google.com
PING www.google.com (216.58.213.164) 56(84) bytes of data.
64 bytes from 216.58.213.164: icmp_seq=1 ttl=63 time=3.55 ms
64 bytes from 216.58.213.164: icmp_seq=2 ttl=63 time=3.88 ms
64 bytes from 216.58.213.164: icmp_seq=3 ttl=63 time=3.73 ms
64 bytes from 216.58.213.164: icmp_seq=4 ttl=63 time=3.95 ms
64 bytes from 216.58.213.164: icmp_seq=5 ttl=63 time=4.68 ms
64 bytes from 216.58.213.164: icmp_seq=6 ttl=63 time=4.98 ms
64 bytes from 216.58.213.164: icmp_seq=7 ttl=63 time=3.73 ms
64 bytes from 216.58.213.164: icmp_seq=8 ttl=63 time=4.89 ms
64 bytes from 216.58.213.164: icmp_seq=9 ttl=63 time=3.72 ms
64 bytes from 216.58.213.164: icmp_seq=10 ttl=63 time=2.83 ms
64 bytes from 216.58.213.164: icmp_seq=11 ttl=63 time=3.04 ms
64 bytes from 216.58.213.164: icmp_seq=12 ttl=63 time=3.91 ms
64 bytes from 216.58.213.164: icmp_seq=13 ttl=63 time=4.85 ms
64 bytes from 216.58.213.164: icmp_seq=14 ttl=63 time=3.70 ms
64 bytes from 216.58.213.164: icmp_seq=15 ttl=63 time=5.06 ms
64 bytes from 216.58.213.164: icmp_seq=16 ttl=63 time=5.02 ms # <- I execute 'host -t AAAA scope.weave.local'
64 bytes from 216.58.213.164: icmp_seq=17 ttl=63 time=9541 ms
64 bytes from 216.58.213.164: icmp_seq=18 ttl=63 time=8541 ms
64 bytes from 216.58.213.164: icmp_seq=19 ttl=63 time=7533 ms
64 bytes from 216.58.213.164: icmp_seq=21 ttl=63 time=5532 ms
64 bytes from 216.58.213.164: icmp_seq=20 ttl=63 time=6533 ms
64 bytes from 216.58.213.164: icmp_seq=22 ttl=63 time=4525 ms
64 bytes from 216.58.213.164: icmp_seq=23 ttl=63 time=3515 ms
64 bytes from 216.58.213.164: icmp_seq=24 ttl=63 time=2517 ms
64 bytes from 216.58.213.164: icmp_seq=25 ttl=63 time=1517 ms
64 bytes from 216.58.213.164: icmp_seq=26 ttl=63 time=517 ms
64 bytes from 216.58.213.164: icmp_seq=27 ttl=63 time=3.72 ms
64 bytes from 216.58.213.164: icmp_seq=28 ttl=63 time=3.53 ms
64 bytes from 216.58.213.164: icmp_seq=29 ttl=63 time=4.47 ms
2opremio commented 8 years ago

Related: http://stackoverflow.com/questions/12751597/centos-virtualbox-5-second-latency-on-http-requests/12751862#12751862 http://superuser.com/questions/370559/10-second-delay-for-local-tld-in-mac-os-x-lion

tomwilkie commented 8 years ago

Any way we can work around this? Scope does a bunch of reverse lookups, could they be triggering this?

On Monday, 25 January 2016, Alfonso Acosta notifications@github.com wrote:

Related: http://stackoverflow.com/questions/12751597/centos-virtualbox-5-second-latency-on-http-requests/12751862#12751862

— Reply to this email directly or view it on GitHub https://github.com/weaveworks/scope/issues/769#issuecomment-174546616.

2opremio commented 8 years ago

@tomwilkie I am waiting for @neo21670 to confirm, but in my setup it's not the reverse DNS lookups which causes it but simply the IPv6 lookups on the .local domain (see above).

Any way we can work around this?

I believe that disabling AAAA record lookups for scope targets would circumvent this. But I am not sure we should, since Virtualbox is clearly at fault.

I just confirmed that natdnshostresolver is not enabled by default. Without it, the problem doesn't seem to arise so I would say the impact of this is rather low.

It can be easily reproduced by doing ...

vagrant up
vagrant ssh
host -t AAAA scope.weave.local

... with this Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/ubuntu-15.04-amd64"
  config.vm.box_url = "https://cloud-images.ubuntu.com/vagrant/vivid/current/vivid-server-cloudimg-amd64-vagrant-disk1.box"
  config.vm.provider :virtualbox do |vb|
    vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
  end
end

The problem seems to be caused by OSX's (my host) DNS resolver timing out when looking up AAAA records in the .local domain.

$ host www.google.com # fine
www.google.com has address 216.58.213.164
www.google.com has IPv6 address 2a00:1450:4009:811::2004
$ host -t AAAA www.google.com # fine
www.google.com has IPv6 address 2a00:1450:4009:811::2004
$ host something.local # not found, but fails as expected
Host something.local not found: 3(NXDOMAIN)
$ host -t AAAA something.local # causes a timeout!
;; connection timed out; no servers could be reached

This shouldn't cause the NAT network to collapse though. I will just stop digging and report it upstream.

2opremio commented 8 years ago

Upstream bug report: https://www.virtualbox.org/ticket/15074

2opremio commented 8 years ago

I will close in a few days if there's no more activity.

@neo21670 We would appreciate if you could confirm our diagnosis is correct.

neo21670 commented 8 years ago

Hi, and sorry for the delay, I've been quite busy.

First, I pulled up a fresh install from everything (Virtualbox, Vagrant, Docker, etc.) and tried my usual workflow, which failed as my original report said.

Next, I commented out the natdnsresolver part of my vagrantfile. After provisioning my VM was able to resolve our currently required domains - so anything that was behind my requirement for using it got fixed in the background. Most importantly, both Scope and the other processes now works fine, so I guess this confirms that Scope works fine as long as natdnshostresolverN is set to off.

Thanks for your time on investigating this issue!

2opremio commented 8 years ago

Thank you for confirming @neo21670

What vagrant box are you using? And, are you running OSX? It would help us to understand how common this problem is.

I am closing this now since it's a Virtual Box bug, but we may add a workaround if the upstream bug is not fixed soon and a lot of people are affected.

In the mean time, feel free to comment on https://www.virtualbox.org/ticket/15074

neo21670 commented 8 years ago

I'm using official Ubuntu Trusty and Vivid boxes, usually the latest ones. Same goes for my OS: OS X, latest releases. When it comes to Virtualbox, I'm usually a bit more conservative with upgrading, however always review the changelog, and upgrade when I'm affected.

2opremio commented 8 years ago

The upstream ticket doesn't show any progress. However, I don't think we do any AAAA lookups to .local domains after https://github.com/weaveworks/scope/pull/867 . @tomwilkie Can you confirm?

tomwilkie commented 8 years ago

It would still be possible for a user to trigger this by providing a .local address as a Scope peer; but yes, by default our scope.weave.local lookups will only look for A records:

https://github.com/weaveworks/scope/blob/master/probe/appclient/resolver.go#L65

2opremio commented 8 years ago

but yes, by default our scope.weave.local lookups will only look for A records

Great, that's what I meant.

2opremio commented 8 years ago

This was marked as fixed upstream https://www.virtualbox.org/ticket/15074