Open alex-sherwin opened 9 years ago
During some debugging it became apparent that weave may be having issues with MTU discovery. Between the client and server exists a firewall whose MTU is set to 1460, but weave is somehow choosing to use an MTU of 490.
When the same test is run from a client/server on the same subnet, there is no issue.
Did a raw tcpdump on the weave interface on server and client in this manner:
this hangs on step 9 as described above
the raw tcpdump captures for the client/server can be accessed here: https://drive.google.com/folderview?id=0Bxv3aLTVo2eKfkhNSUI3Y2JWb1pnM1VTdVBxdzZoQzBUTTZqTzkwbEVPdldIcTFMNVk4dm8&usp=sharing
Server docker logs weave
output:
weave 2015/03/18 16:30:58.466718 Command line options: map[iface:ethwe name:7a:79:3b:63:81:cf wait:20]
weave 2015/03/18 16:30:58.466773 Command line peers: []
weave 2015/03/18 16:30:59.467884 Communication between peers is unencrypted.
weave 2015/03/18 16:30:59.468369 Our name is 7a:79:3b:63:81:cf
weave 2015/03/18 16:30:59.495009 Sniffing traffic on &{380 65535 ethwe 52:4b:83:e4:a3:5a up|broadcast|multicast}
weave 2015/03/18 16:30:59.495038 Discovered our MAC 52:4b:83:e4:a3:5a
weave 2015/03/18 16:30:59.674486 Discovered local MAC 46:0c:6c:8e:73:d7
weave 2015/03/18 16:30:59.730473 Discovered local MAC 7a:79:3b:63:81:cf
weave 2015/03/18 16:31:35.070505 Discovered local MAC 5a:67:c3:4e:d9:9d
weave 2015/03/18 16:31:55.316913 ->[192.168.70.25:57528] connection accepted
weave 2015/03/18 16:31:55.318143 ->[192.168.70.25:57528] completed handshake with 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:55.318467 ->[7a:6b:e0:1c:2f:77]: connection added
weave 2015/03/18 16:31:55.322428 ->[7a:6b:e0:1c:2f:77]: connection fully established
weave 2015/03/18 16:31:55.322465 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 16:31:55.322481 ->[7a:6b:e0:1c:2f:77]: Effective PMTU set to 1438
weave 2015/03/18 16:31:55.363283 ->[7a:6b:e0:1c:2f:77]: Effective PMTU verified at 1438
weave 2015/03/18 16:31:55.813341 Discovered remote MAC 7a:6b:e0:1c:2f:77 at 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:32:19.511586 Discovered remote MAC 36:45:01:da:ff:89 at 7a:6b:e0:1c:2f:77
Client docker logs weave
output
weave 2015/03/18 16:31:15.005083 Command line options: map[iface:ethwe name:7a:6b:e0:1c:2f:77 wait:20]
weave 2015/03/18 16:31:15.005169 Command line peers: []
weave 2015/03/18 16:31:16.006236 Communication between peers is unencrypted.
weave 2015/03/18 16:31:16.006657 Our name is 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:16.061072 Sniffing traffic on &{121 65535 ethwe f6:a6:9a:dc:6e:79 up|broadcast|multicast}
weave 2015/03/18 16:31:16.061114 Discovered our MAC f6:a6:9a:dc:6e:79
weave 2015/03/18 16:31:16.373416 Discovered local MAC 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:16.608464 Discovered local MAC 36:45:01:da:ff:89
weave 2015/03/18 16:31:55.520610 ->[192.168.130.95:6783] attempting connection
weave 2015/03/18 16:31:55.523243 ->[192.168.130.95:6783] completed handshake with 7a:79:3b:63:81:cf
weave 2015/03/18 16:31:55.523862 ->[7a:79:3b:63:81:cf]: connection added
weave 2015/03/18 16:31:55.525327 ->[7a:79:3b:63:81:cf]: connection fully established
weave 2015/03/18 16:31:55.525429 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 16:31:55.525472 ->[7a:79:3b:63:81:cf]: Effective PMTU set to 1438
weave 2015/03/18 16:31:55.527641 ->[7a:79:3b:63:81:cf]: Effective PMTU verified at 1438
weave 2015/03/18 16:31:59.800015 Discovered remote MAC 7a:79:3b:63:81:cf at 7a:79:3b:63:81:cf
weave 2015/03/18 16:32:03.018063 Discovered remote MAC 5a:67:c3:4e:d9:9d at 7a:79:3b:63:81:cf
weave 2015/03/18 16:32:11.897951 EMSGSIZE on send, expecting PMTU update (IP packet was 927 bytes, payload was 919 bytes)
weave 2015/03/18 16:32:11.897977 ->[7a:79:3b:63:81:cf]: Effective PMTU set to 490
weave 2015/03/18 16:32:11.899324 ->[7a:79:3b:63:81:cf]: Effective PMTU verified at 490
weave 2015/03/18 16:32:12.099969 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.100170 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.304684 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.712316 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:13.528495 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:15.164134 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:18.436008 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:24.981010 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:38.085151 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:38.492605 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:39.308104 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:40.944347 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:44.212285 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:50.756138 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:33:03.844305 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:33:30.052198 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
Thanks for the pcap files.
There's something conspicuously odd in the client-side capture right from the start. Although its is sending packets to the server at 10.0.9.19, the resulting packets sent by the sent always come back with the source IP address set to 172.17.0.32. This is probably an address on the docker network bridge. I don't know for sure that this is connected to the MTU issue, but I'm surprised the client and server are able to communicate at all.
That strange address is probably due to a poor interaction between the NAT that docker does in order to allow containers on the docker bridge to talk to the outside world, and how 'weave expose' is implemented.
It's hard to say whether this is really a bug or not, because the way you are trying to use weave seems strange. As I understand it, you want to run a mongo server in a container, and a mongo client in a container, talking over the weave network. Why not do this with 'weave run' on both ends? You already do that on the server side (and there the 'weave expose' and the port mapping option after 'weave run' appear to be redundant). But on the client you are starting a container with 'docker run' and relying on 'weave expose' to allow it to talk to the server container. This is unusual: The intended use of 'weave expose' is to allow things that are not in containers to talk to a weave network.
Does this make sense? Is there a particular reason why you are trying to use 'weave expose' in that way? If you use 'weave run' for both the client and server containers, do the large requests work?
I'm using weave expose on the hosts just so that I can confirm that the weave network is working easily using ping
Also, regardless if the mongo client is run from within a weave-enabled container or not, it's the same end result.
However your notes on that 172.17.0.32 address made me think twice, this is the VMWare VM NAT...
I just re-tested everything with these scenarios with the VM networking setup:
So you're right, it's a problem isolated to my VM network configuration (albeit not working with the default VMWare networking setup of NAT)
But that seems logical, how else would the server host know?
The odd thing here is that quite a lot appeared to work, all the way to doing small inserts in mongo. Ideally at least weave should have detected that something isn't well. Even better if it could have coped and got things to work.
Perhaps 'weave should have detected' could apply equally to #416/#417, which had the similar property that a few things would work even though the network configuration was unhelpful.
@alex-sherwin, thanks for your testing. I'm still very puzzled by this!
Some thoughts:
@alex-sherwin, if you are able to do some more testing for us, could you record pcap files as before, but in the case with bridged networking, where the large queries across the weave network succeed? That will indicate whether the destination address rewriting is really related to the problem or not.
@dpw, I've re-run the exact same test when my VM is running with only a bridged interface. As mentioned before, this all works with no issues. Here are the pcap files for this scenario:
@dpw, I've re-run the exact same test when my VM is running with only a bridged interface. As mentioned before, this all works with no issues. Here are the pcap files for this scenario:
Thanks!
On the client side, the destination IP address is still being replaced, so that is not directly the cause of the problem. The difference in this capture is that there are no ICMP "frag needed" packets generated by weave, suggesting that the effective PMTU on the underlying network was not restricted to 490 as before.
So I have a theory about what is happening when it hangs: The modification of the destination address is not sufficient to prevent the TCP connection operating (surprisingly). But it is sufficient to prevent the kernel registering the ICMP "frag needed". When the client sends a packet that is too large (due to the large write), weave drops it and sends a "frag needed" back to the client side, which ignores it, and after a while TCP resends the packet, which gets dropped with another "frag needed", and so on, hence the hang. This is visible in the client capture where it hanged.
As I mentioned before, the modification of the destination address suggests an iptables rule. So @alex-sherwin, could you show the output of sudo iptables-save
in the client VM? It doesn't matter what kind of networking the client has for this. Also, are you using any firewall software within the client VM?
The PTMU of 490 when vmware NAT is used seems strangely low, but is allowed.
@dpw, here's the output of iptables-save
in the VM (currently the VM is back to a single NAT interface)
sudo iptables-save
# Generated by iptables-save v1.4.21 on Fri Mar 20 13:23:21 2015
*nat
:PREROUTING ACCEPT [6:468]
:INPUT ACCEPT [6:468]
:OUTPUT ACCEPT [1022:77063]
:POSTROUTING ACCEPT [922:71063]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A WEAVE ! -s 10.0.1.0/24 -o weave -j MASQUERADE
-A WEAVE -s 10.0.1.0/24 ! -o weave -j MASQUERADE
COMMIT
# Completed on Fri Mar 20 13:23:21 2015
# Generated by iptables-save v1.4.21 on Fri Mar 20 13:23:21 2015
*filter
:INPUT ACCEPT [75982:28474773]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [81854:30034942]
:DOCKER - [0:0]
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j DROP
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Fri Mar 20 13:23:21 2015
There is no other firewall of any kind in use in the VM or the VM host
Thanks. I don't see anything obviously wrong there. One question though:
...
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
...
That suggests docker is using the 172.17.x.x address range on its internal bridge. Earlier, you said that the same address range is being used by the VMware NAT interface. Could you double-check that?
I must have looked too quickly.. they are close but different subnets. VMWare NAT is 172.16 and docker bridge is 172.17
So the IP's in the pcap are the docker bridge as you originally suspected
docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:38 errors:0 dropped:0 overruns:0 frame:0
TX packets:118 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2568 (2.5 KB) TX bytes:11810 (11.8 KB)
eth0 Link encap:Ethernet HWaddr 00:0c:29:39:54:c1
inet addr:172.16.158.101 Bcast:172.16.158.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe39:54c1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:381947 errors:0 dropped:0 overruns:0 frame:0
TX packets:221838 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:508118046 (508.1 MB) TX bytes:19471935 (19.4 MB)
So, I had been trying to reproduce the 172.17.x.x destination addresses from your capture files, without success: tcpdump on the client-side weave bridge just showed 10.0.x.x addresses. But I was using a Fedora 21 test VM I had lying around to act as the client side.
So today tried with a fresh Ubuntu 14.04 VM. And it does show the 172.17.x.x destination addresses on the client side! In exactly the same test scenarios that fedora didn't. The iptables rules look the same, so I my guess is that a kernel change or patch in one of the distros is responsible, but I haven't looked into that yet (I have docker 1.5 in both cases).
I haven't been able to reproduce the hangs yet. The PMTU between my weave routers is 1500, so I have to make the "large write" case even large to trigger a "frag needed". But the write still seems to complete successfully. I'll work more on this, constraining the MTU to 490 in case that matters.
One thing I didn't appreciate until late yesterday is that the 'weave expose' on the client side is significant. Without the 'weave expose', if you use 'weave run' for the client container, what happens is that the mongo client tries the request before 'weave run' has had time to set up the ethwe interface, and so the mongo client's attempt to connect to the server simply fails. With 'weave expose', the ethwe interface still isn't set up in time, so the connection attempt goes to the container's eth0, over the docker bridge onto the host (being masqueraded in the process), and then onto the weave bridge and across the weave network. I don't think this convoluted path is really anticipated to work, and while it doesn't explain the destination address rewriting, it certainly seems like ample opportunity for odd things like that to happen. And so, another approach to fixing the hangs might be to run the client under 'weave run' but with a delay so that it does use the ethwe interface. But I'm going to try to reproduce the hangs before I resort to that.
another approach to fixing the hangs might be to run the client under 'weave run' but with a delay so that it does use the ethwe interface
Instead of a simple delay, the recommended approach is to loop (with a delay) until the interface comes up.
Instead of a simple delay, the recommended approach is to loop (with a delay) until the interface comes up.
Yes, but I imagine that could be hard to achieve in the context of a stripped-down mongodb container. Whereas it looks like a simple delay is expressible in the mongo shell.
I have found the cause of the hang: although the minimum IPv4 MTU is 68, Linux has a minimum limit on the PMTU of 552: http://cschramm.blogspot.co.uk/2012/12/why-does-linux-enforce-minimum-mtu-of.html
You can change this limit. So @alex-sherwin, if you do
echo 200 >/proc/sys/net/ipv4/route/min_pmtu
in your client VM, your hang should go away.
I'm still trying to get my head around why Linux sets this limit.
It also goes some way to explaining where the "Effective PMTU set to 490" comes from: effectiveOverhead()
= 28 + 6 + (6 + 6 + 2) + 16 = 62. 552 - 62 = 490.
The whole destination address modification thing seems to be a red herring. If I tcpdump on the weave router containers ethwe interface, I don't see that rewriting. And if I tcpdump on the docker0 bridge, everything is fully NATted. tcpdumping on the weave bridge seems to show things in a strange halfway state.
Re-tested with a lower /proc/sys/net/ipv4/route/min_pmtu
setting and it works this way
Re-tested with a lower /proc/sys/net/ipv4/route/min_pmtu setting and it works this way
Well, that at least confirms that I'm reproducing your problem.
I'm still not sure why the PMTU on the underlying network would drop to the default Linux minimum of 552. I wondered if it might be because ICMP "frag needed" packets were being blocked, but I spent some time trying to reproduce that without success.
Also, I'm still trying to think of anything useful weave can do to work around this problem, but I don't have any bright ideas so far. The best might simply be to show the problem in the weave status
output.
@dpw, I'd be fine with that.. the worst part is always not knowing. I (and hope anyone else) wouldn't be using a VMWare NAT in this manner for anything other then development purposes
@dpw
The best might simply be to show the problem in the
weave status
output.
How hard is that?
How hard is that?
Detect when PMTU for a connection falls such that the overlay MTU is below 552, stash the fact in the connection, and gossip it around. Sounds doable, but I don't claim to know the gossipy side of things well.
However, I doubt that is the right solution. When the problem occurs, weave will still be broken, and in circumstances that are already fairly clear from docker logs weave
(now we know what to look for). So we don't gain much at all.
One important consideration that is still not clear is: what conditions cause the PTMU to fall to 552, and how likely do we think they are to occur? If it seems rare, maybe we don't need to do anything. If it seems potentially common, we should think further about whether weave can reasonably work around the problem.
Last week I talked to a friend who has lots of experience with MTUs and what happens to them when ICMP packets are being blocked, and it sounds like there is a good chance that is the cause. More experimentation is needed though.
I've been working on other things, but I can drop them to return to this if it's a priority.
MongoDB host machine: dell R620 server (40 cores, 128GB ram) Ubuntu 14.04, Docker 1.4.1 uname -a:
Client machine: Ubuntu 14.04 VMWare Fusion 7.1.1 VM (hardware version 11) (provisioned 12GB RAM + 8 cores), Docker 1.5.0 uname -a:
Both are using the latest weave release:
The client is able to insert small documents into MongoDB over the weave network, but unable to insert large documents, output from
sudo tcpdump -i weave
on the client machine shows messages about "need to frag"Sample server setup (server physical IP is
192.168.130.95
):Client setup:
Small write over weave network from client:
Large write over the weave network from client (hangs):
Same large write over the physical network (works fine):
Output from client
sudo tcpdump -i weave
Server
docker logs weave
outputClient
docker logs weave
output