large multicast packets get dropped

MaximeHeckel commented 9 years ago

One of our user got the following logs from weave recently on one host:

weave 2015/10/04 20:35:19.622956 ->[137.116.197.12:58697|d6:47:40:a7:fd:43(core-02)]: Dropping too big frame during forwarding: frame len: 1450 ; effective PMTU: 1414

weave 2015/10/04 20:35:19.622977 ->[137.116.197.12:58697|d6:47:40:a7:fd:43(core-02)]: Dropping too big frame during forwarding: frame len: 1450 ; effective PMTU: 1414

weave 2015/10/04 20:35:19.622988 ->[137.116.197.12:58697|d6:47:40:a7:fd:43(core-02)]: Dropping too big frame during forwarding: frame len: 1450 ; effective PMTU: 1414

and the following ones on the other host (which is of course linked to the first one)

weave 2015/10/04 20:32:34.611330 dropping too big DF broadcast frame (230.0.0.4 -> 10.7.255.254): PMTU= 1414

weave 2015/10/04 20:32:34.611436 dropping too big DF broadcast frame (230.0.0.4 -> 10.7.255.254): PMTU= 1414

weave 2015/10/04 20:32:34.611556 dropping too big DF broadcast frame (230.0.0.4 -> 10.7.255.254): PMTU= 1414

weave 2015/10/04 20:32:34.611663 dropping too big DF broadcast frame (230.0.0.4 -> 10.7.255.254): PMTU= 1414

weave 2015/10/04 20:32:34.611781 dropping too big DF broadcast frame (230.0.0.4 -> 10.7.255.254): PMTU= 1414

This user is on Ubuntu 14.04 with kernel version 3.19.0-28-generic and uses weave 1.0.3

awh commented 9 years ago

Thanks @MaximeHeckel. Some questions:

What is the overall topology - is it just two hosts?
What is the nature of the underlying network (e.g. is it the internet, GCE/AWS/DO etc)
Has the user noticed any ill effects in conjunction with these log messages?
The address 10.7.255.254 is suspiciously close to the broadcast address for 10.7.0.0/16. Is there anything out of the ordinary about the use of that address?
If you are able to disclose, the services/container images involved

robcza commented 9 years ago

This one is observed on two hosts, both of them running jboss/wildfly and supposed to form a cluster based on multicast udp packets The cluster does not behave as it is supposed to (and it does behave ok in testing environment without overlay network), that is why we think these issues could be connected. I believe @karm could have bit more to say regarding the problems observed and the technologies involved.

underyling network is internet
hosts are running as azure virtual machines
regarding the "broadcast" address - could be connected to the cluster multicast
I'm not sure how to disclose the services/images involved, could you please advise?

awh commented 9 years ago

Thanks @robcza. On the last point, you've answered already - JBoss WildFly. It's not something I am familiar with - is it enough to run up some instances of jboss/wildfly to trigger the problem, or do I need to deploy an application on it too?

Karm commented 9 years ago

@awh @robcza It's necessary to deploy an application that takes advantage of JGroups UDP communication. I'll prepare a simple, self-contained reproducer and link it here.

awh commented 9 years ago

I'll prepare a simple, self-contained reproducer and link it here.

That's super helpful - thankyou @karm!

rade commented 9 years ago

230.0.0.4 -> 10.7.255.254

Looking at the code, I think that warning has the source and destination reversed. The message makes more sense when correcting for that: 230.0.0.4 is the default multicast group used by jgroups.

PMTU discovery is not possible for multicast - see #419. So multicast packets will probably get fragmented with the interface's MTU. Which may well be higher than the MTU weave can transmit to all peers. It might be worth trying to lower the the MTU on the interface; e.g. subtract 1450-1414=36 from the value it is set at. Or look at fragmentation in the application - googling suggests jgroups has a FRAG2 module for that.

robcza commented 9 years ago

Changing the MTU is actually the the step i've suggested to @MaximeHeckel The MTU on the container level is "quite high". 1414 is indeed the MTU I successfully tested between those two containers.

ethwe     Link encap:Ethernet  HWaddr de:8f:b3:63:70:92                         
          inet addr:10.7.0.33  Bcast:0.0.0.0  Mask:255.255.0.0                  
          inet6 addr: fe80::dc8f:b3ff:fe63:7092/64 Scope:Link                   
          UP BROADCAST RUNNING MULTICAST  MTU:65535  Metric:1

rade commented 9 years ago

The MTU on the container level is "quite high".

64k is what weave sets it to. The question is where the 1450 comes from that we see in the logs. Probably the kernel somehow.

rade commented 9 years ago

perhaps weave should set a "safe" MTU on the multicast route it is adding to containers.

rade commented 9 years ago

btw, this issue is dead easy to reproduce with https://github.com/Redsift/sockperf; just drop the --net=host from the examples and run via the weave proxy. With a message size (-m) of 1410 (it is not quite clear how this relates to the packet size) that works for me, above that it doesn't and produces the warnings in the logs as reported by the OP.

With a modified weave script that sets the mtu of the multicast route to 1438 (which is what pmtu discovery produces for unicast), the test works with larger packets and there are no warnings in the logs.

rade commented 9 years ago

@dpw what do you think of my suggestion of setting a "safe" MTU on the multicast route? perhaps to the same value as you picked for FDP (1410)? And perhaps make it configurable in the same way?

dpw commented 9 years ago

I'm missing something here. How does setting the MTU on the multicast route actually work around the problem? My initial guess was that it did it by causing multicast packets to be fragmented. But that's not consistent with "dropping too big DF broadcast frame". So what does the before and after picture look like in terms of packets on the weave network?

rade commented 9 years ago

before:

08:49:07.427175 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 40, options (RA))
    10.40.0.0 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 224.18.7.81 to_ex { }]
08:49:07.535471 IP (tos 0x0, ttl 2, id 16710, offset 0, flags [DF], proto UDP (17), length 2028)
    10.40.0.0.57129 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
08:49:07.535533 IP (tos 0x0, ttl 2, id 16711, offset 0, flags [DF], proto UDP (17), length 2028)
    10.40.0.0.57129 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000

after:

08:53:29.095162 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 40, options (RA))
    10.40.0.0 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 224.18.7.81 to_ex { }]
08:53:29.188074 IP (tos 0x0, ttl 2, id 45081, offset 0, flags [none], proto UDP (17), length 2028)
    10.40.0.0.34119 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000
08:53:29.188143 IP (tos 0x0, ttl 2, id 45082, offset 0, flags [none], proto UDP (17), length 2028)
    10.40.0.0.34119 > 224.18.7.81.5001: [udp sum ok] UDP, length 2000

rade commented 9 years ago

The plot thickens... setting the multicast route's MTU to 1500 - i.e. smaller than the packet size but larger than the supported MTU - produces the same picture as "after", and the test succeeds.

dpw commented 9 years ago

The plot thickens... setting the multicast route's MTU to 1500 - i.e. smaller than the packet size but larger than the supported MTU - produces the same picture as "after", and the test succeeds.

So probably the critical thing is whether the DF bit is getting set (yes before, no after). But why is the DF bit being set in the first place? I tried taking a look at the sockperf code, but it is hilariously bad, and hard to work out what is going on. Still, I suspect it is the kernel that is setting DF, not the apps. But why?

awh commented 9 years ago

what do you think of my suggestion of setting a "safe" MTU on the multicast route

That is exactly what I was going to suggest this morning - sounds good to me

rade commented 9 years ago

Still, I suspect it is the kernel that is setting DF, not the apps.

Agreed.

But why?

Why indeed.

dpw commented 9 years ago

what do you think of my suggestion of setting a "safe" MTU on the multicast route

That is exactly what I was going to suggest this morning - sounds good to me

Perhaps it's obvious to everyone else, but I'm not sure what effect this is supposed to have. If it's just about fragmenting the packet, why is there a problem in the first place?

From Matthias' tcpdumps, it seems the root problem might be that the kernel is deciding to set DF on the packets. If we can get to the bottom of why it does that, we will be in a better position to decide what to do about it. Perhaps there is a good reason why that behaviour is changed by setting the MTU on the multicast route, but it seems a bit like "twiddling the knobs" without understanding what is really going on.

Also note that the FDP's "safe MTU" is not supposed to be a guarantee, and it is based on the overhead of vxlan encapsulation. If the underlying network does not permit that MTU to work, we'll fall back to sleeve. So it is more of a compromise: the largest MTU that will support vxlan encapsulation in environments we care about.

rade commented 9 years ago

"IP_PMTUDISC_WANT will fragment a datagram if needed according to the path MTU, or will set the don't-fragment flag otherwise."

...might be a clue but still leaves some unanswered questions.

dpw commented 9 years ago

Ok, it looks like the kernel is applying the rules described under IP_MTU_DISCOVER in ip(7) to all packets, including multicast (I don't see any relevant special casing for them). A simple experiment with nc -u while turning ip_no_pmtu_disc on and off seems to confirm this.

That seems dodgy to me, since the kernel has no way to do MTU discovery for multicast, and it doesn't know whether the application is prepared to. We might want to report it.

So I guess ip route ... mtu lock has an impact because it disables MTU discovery on the route, and the MTU value set doesn't actually matter. We could do it with the weave script $MTU of 65535, and it would still have the desired effect.

rade commented 9 years ago

I wasn't using lock in my tests.

dpw commented 9 years ago

I wasn't using lock in my tests.

If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it. So it seems safer to use lock. Although since PMTU discovery won't succeed for multicast routes, it might be academic in this case.

rade commented 9 years ago

ah, so what's happening in the 'before' case is that the kernel sees that there is no PMTU for the destination, and hence sets DF in order to trigger PMTU discovery?

rade commented 9 years ago

If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it.

Right, but why would that be problematic?

dpw commented 9 years ago

ah, so what's happening in the 'before' case is that the kernel sees that there is no PMTU for the destination, and hence sets DF in order to trigger PMTU discovery?

Yes. Which can never succeed for multicast due to the lack of frag-needed responses. So this behaviour seems a bit silly to me.

dpw commented 9 years ago

If I understand the ip route man page correctly, without that the kernel can decide to update the MTU value on the route table entry after you set it.

Right, but why would that be problematic?

I'm not saying it's necessarily problematic. But from the ip-route man page:

If the modifier lock is used, no path MTU discovery will be tried, all packets will be sent without the DF bit in IPv4 case or fragmented to MTU for IPv6.

Which sounds like exactly what we want to happen. The setting of the MTU is incidental, and if there was a cleaner way to say "don't do mtu discovery for multicast packets" I'd recommend that instead.

inercia commented 9 years ago

As far as I can remember, too big packets are silently discarded when sent to a multicast group, so the DF bit should be irelevant here, right?

rade commented 9 years ago

@inercia as per the discussion above, it's the kernel that sets DF, in order to trigger PMTU discovery. That has the unfortunate effect of causing multicast packets to be dropped when they are too large. So the fix is to stop the kernel from attempting to perform PMTU discovery for multicast destinations.

weaveworks / weave

large multicast packets get dropped #1507