Closed MaximeHeckel closed 9 years ago
Thanks @MaximeHeckel. Some questions:
10.7.255.254
is suspiciously close to the broadcast address for 10.7.0.0/16
. Is there anything out of the ordinary about the use of that address?This one is observed on two hosts, both of them running jboss/wildfly and supposed to form a cluster based on multicast udp packets The cluster does not behave as it is supposed to (and it does behave ok in testing environment without overlay network), that is why we think these issues could be connected. I believe @karm could have bit more to say regarding the problems observed and the technologies involved.
Thanks @robcza. On the last point, you've answered already - JBoss WildFly. It's not something I am familiar with - is it enough to run up some instances of jboss/wildfly to trigger the problem, or do I need to deploy an application on it too?
@awh @robcza It's necessary to deploy an application that takes advantage of JGroups UDP communication. I'll prepare a simple, self-contained reproducer and link it here.
I'll prepare a simple, self-contained reproducer and link it here.
That's super helpful - thankyou @karm!
230.0.0.4 -> 10.7.255.254
Looking at the code, I think that warning has the source and destination reversed. The message makes more sense when correcting for that: 230.0.0.4
is the default multicast group used by jgroups.
PMTU discovery is not possible for multicast - see #419. So multicast packets will probably get fragmented with the interface's MTU. Which may well be higher than the MTU weave can transmit to all peers. It might be worth trying to lower the the MTU on the interface; e.g. subtract 1450-1414=36 from the value it is set at. Or look at fragmentation in the application - googling suggests jgroups has a FRAG2 module for that.
Changing the MTU is actually the the step i've suggested to @MaximeHeckel The MTU on the container level is "quite high". 1414 is indeed the MTU I successfully tested between those two containers.
ethwe Link encap:Ethernet HWaddr de:8f:b3:63:70:92
inet addr:10.7.0.33 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::dc8f:b3ff:fe63:7092/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
The MTU on the container level is "quite high".
64k is what weave sets it to. The question is where the 1450 comes from that we see in the logs. Probably the kernel somehow.
perhaps weave should set a "safe" MTU on the multicast route it is adding to containers.
btw, this issue is dead easy to reproduce with https://github.com/Redsift/sockperf; just drop the --net=host
from the examples and run via the weave proxy. With a message size (-m
) of 1410 (it is not quite clear how this relates to the packet size) that works for me, above that it doesn't and produces the warnings in the logs as reported by the OP.
With a modified weave script that sets the mtu of the multicast route to 1438 (which is what pmtu discovery produces for unicast), the test works with larger packets and there are no warnings in the logs.
@dpw what do you think of my suggestion of setting a "safe" MTU on the multicast route? perhaps to the same value as you picked for FDP (1410)? And perhaps make it configurable in the same way?
I'm missing something here. How does setting the MTU on the multicast route actually work around the problem? My initial guess was that it did it by causing multicast packets to be fragmented. But that's not consistent with "dropping too big DF broadcast frame". So what does the before and after picture look like in terms of packets on the weave network?
before:
08:49:07.427175 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 40, options (RA))
10.40.0.0 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 224.18.7.81 to_ex { }]
08:49:07.535471 IP (tos 0x0, ttl 2, id 16710, offset 0, flags [DF], proto UDP (17), length 2028)
10.40.0.0.57129 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
08:49:07.535533 IP (tos 0x0, ttl 2, id 16711, offset 0, flags [DF], proto UDP (17), length 2028)
10.40.0.0.57129 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
after:
08:53:29.095162 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 40, options (RA))
10.40.0.0 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 224.18.7.81 to_ex { }]
08:53:29.188074 IP (tos 0x0, ttl 2, id 45081, offset 0, flags [none], proto UDP (17), length 2028)
10.40.0.0.34119 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
08:53:29.188143 IP (tos 0x0, ttl 2, id 45082, offset 0, flags [none], proto UDP (17), length 2028)
10.40.0.0.34119 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
The plot thickens... setting the multicast route's MTU to 1500 - i.e. smaller than the packet size but larger than the supported MTU - produces the same picture as "after", and the test succeeds.
The plot thickens... setting the multicast route's MTU to 1500 - i.e. smaller than the packet size but larger than the supported MTU - produces the same picture as "after", and the test succeeds.
So probably the critical thing is whether the DF bit is getting set (yes before, no after). But why is the DF bit being set in the first place? I tried taking a look at the sockperf code, but it is hilariously bad, and hard to work out what is going on. Still, I suspect it is the kernel that is setting DF, not the apps. But why?
what do you think of my suggestion of setting a "safe" MTU on the multicast route
That is exactly what I was going to suggest this morning - sounds good to me
Still, I suspect it is the kernel that is setting DF, not the apps.
Agreed.
But why?
Why indeed.
what do you think of my suggestion of setting a "safe" MTU on the multicast route
That is exactly what I was going to suggest this morning - sounds good to me
Perhaps it's obvious to everyone else, but I'm not sure what effect this is supposed to have. If it's just about fragmenting the packet, why is there a problem in the first place?
From Matthias' tcpdumps, it seems the root problem might be that the kernel is deciding to set DF on the packets. If we can get to the bottom of why it does that, we will be in a better position to decide what to do about it. Perhaps there is a good reason why that behaviour is changed by setting the MTU on the multicast route, but it seems a bit like "twiddling the knobs" without understanding what is really going on.
Also note that the FDP's "safe MTU" is not supposed to be a guarantee, and it is based on the overhead of vxlan encapsulation. If the underlying network does not permit that MTU to work, we'll fall back to sleeve. So it is more of a compromise: the largest MTU that will support vxlan encapsulation in environments we care about.
"IP_PMTUDISC_WANT will fragment a datagram if needed according to the path MTU, or will set the don't-fragment flag otherwise."
...might be a clue but still leaves some unanswered questions.
Ok, it looks like the kernel is applying the rules described under IP_MTU_DISCOVER in ip(7) to all packets, including multicast (I don't see any relevant special casing for them). A simple experiment with nc -u
while turning ip_no_pmtu_disc on and off seems to confirm this.
That seems dodgy to me, since the kernel has no way to do MTU discovery for multicast, and it doesn't know whether the application is prepared to. We might want to report it.
So I guess ip route ... mtu lock
has an impact because it disables MTU discovery on the route, and the MTU value set doesn't actually matter. We could do it with the weave script $MTU of 65535, and it would still have the desired effect.
I wasn't using lock
in my tests.
I wasn't using lock in my tests.
If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it. So it seems safer to use lock
. Although since PMTU discovery won't succeed for multicast routes, it might be academic in this case.
ah, so what's happening in the 'before' case is that the kernel sees that there is no PMTU for the destination, and hence sets DF in order to trigger PMTU discovery?
If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it.
Right, but why would that be problematic?
ah, so what's happening in the 'before' case is that the kernel sees that there is no PMTU for the destination, and hence sets DF in order to trigger PMTU discovery?
Yes. Which can never succeed for multicast due to the lack of frag-needed responses. So this behaviour seems a bit silly to me.
If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it.
Right, but why would that be problematic?
I'm not saying it's necessarily problematic. But from the ip-route man page:
If the modifier lock is used, no path MTU discovery will be tried, all packets will be sent without the DF bit in IPv4 case or fragmented to MTU for IPv6.
Which sounds like exactly what we want to happen. The setting of the MTU is incidental, and if there was a cleaner way to say "don't do mtu discovery for multicast packets" I'd recommend that instead.
As far as I can remember, too big packets are silently discarded when sent to a multicast group, so the DF bit should be irelevant here, right?
@inercia as per the discussion above, it's the kernel that sets DF, in order to trigger PMTU discovery. That has the unfortunate effect of causing multicast packets to be dropped when they are too large. So the fix is to stop the kernel from attempting to perform PMTU discovery for multicast destinations.
One of our user got the following logs from weave recently on one host:
and the following ones on the other host (which is of course linked to the first one)
This user is on Ubuntu 14.04 with kernel version 3.19.0-28-generic and uses weave 1.0.3