Don't rely on setting `ethtool tx off` on guest interfaces

squaremo commented 8 years ago

In https://www.weave.works/blog/bridge-over-troubled-weavers/, Bryan says

Linux’ TCP stack will sometimes attempt to send packets with an incorrect checksum, or that are far too large for the network link, with the result that the packet is rejected and TCP has to re-transmit. This slows down network throughput enormously. When using weave as designed, the virtual ethernet device assigned to each container is configured to avoid this, but with --bridge=weave Docker creates the virtual ethernet device and weave doesn’t get a chance to do its configuration.

In configurations where weave does not create veth pairs (e.g., when working with CNI https://www.weave.works/blog/weave-and-rkt/), there is no opportunity to run ethtool to do this configuration. It is hard to see this as a limitation of the rest of the world, rather than a limitation of "weave as designed".

So the question is, is this a problem for all users of bridge networking? (For instance, Docker -- a search of issues suggests it's not, or not yet recognised anyway) If it's just a problem with weave, how can it be fixed?

dpw commented 8 years ago

Bryan's statement is correct, although it is easy to read it as though it is describing a bug in the Linux TCP stack. In fact, it is working as designed (and you can replace Bryan's "sometimes" with "always"): the kernel will delegate TCP segmentation and checksumming to the network interface if possible. It doesn't do something different just because you are sniffing the traffic. So if you capture outgoing traffic with a raw socket (e.g. pcap), you see what the kernel sent to the network interface, not what would actually appear on the wire.

If you run an iperf TCP bandwidth test to a VM over a virtual bridge, and look at the traffic in wireshark, two things can be observed:

The TCP checksums on all the outgoing packets are wrong (because the kernel never computed them).
The TCP segment size starts off respecting the MTU (of 1500), but doubles with each successfully sent packet, until it reaches a limit at 64K.

The same effects occur in the context of weave.

Weave captures packets via pcap with incorrect checksums, and relays them to the other end with incorrect checksums. But the kernel does not verify the checksums of the injected packets (because they count as locally produced?). So that is not a problem.

On the other hand, the effective segment size is a problem. As soon as the kernel produces an over-large TCP packet (with DF set, as PTMU discovery is routine for TCP), weave drops it and sends back an ICMP fragmentation needed. The kernel sees this, and drops the effective segment size on the TCP connection down to the one you might expect. The data is resent, and gets through. But on the next TCP packet it tries to grow the segment size once again, ... and so on. The data gets through, but it is slow.

I expect there are various ways to influence this kernel behaviour, but if the point is to make weave work well for a virtual bridge in its default state we need to fix it within the weave router. And finding a way to do that that is simple and clean seems challenging. A simple hack might be to ignore DF on TCP packets, so that the over-large TCP packets simply pass through (that won't work if Linux checks that injected packets conform to the nominal MTU, but I find that unlikely).

Fast datapath is not affected by this issue when the VXLAN encapsulation is handled by the kernel. I need to check whether we receive over-large packets on ODP misses (if so, the issue would re-appear for a connection using the sleeve fallback).

dpw commented 8 years ago

To get back the actual question:

So the question is, is this a problem for all users of bridge networking? (For instance, Docker -- a search of issues suggests it's not, or not yet recognised anyway)

As discussed, its not a bug. Bridge networking works as intended, and there is no issue for Docker users to notice.

More generally, you might wonder why flannel's udp backend is not affected. It is because: a) flannel does not bother with MTUs and DF and generating frag-needed packets, and possibly also b) flannel's udp backend uses a tun device, and maybe this issue does not occur for tun devices (if the kernel does the GSO segmentation step before delivering to the tun recipient).

squaremo commented 8 years ago

Thank you for the analysis @dpw.

On the basis that this is due to how weave operates, and that it prevents weave from working with other tooling, I think this ought to be addressed.

bboreham commented 8 years ago

the kernel will delegate TCP segmentation and checksumming to the network interface if possible

The only interface is a veth, and, from my reading of the kernel code, veths don't implement segmentation and checksumming. From offline discussion, a better interpretation is that, given the expected use of veths, they are self-consistent: they don't care how big the packets are and they don't check checksums for in-memory copies.

dpw commented 8 years ago

Another way of looking at it is that the kernel defers segmentation/checksumming as late as possible before an outgoing packet hits the wire. If the outgoing device hardware supports it, then it is left to the hardware. If the hardware doesn't support it, then the kernel does it just before handing it to the hardware (GSO). But for virtual devices like veths and bridges, it can be deferred entirely: If the packet reaches a physical device, it gets handled then; if it doesn't, why bother?

msackman commented 8 years ago

IIRC, if an injected packet has a non-local MAC then the checksum is inspected and the packet is dropped if it's wrong. Disabling checksum offloading is not about changing things on the capturing side, it's about allowing injection to work by ensuring the checksums are valid.

dpw commented 8 years ago

IIRC, if an injected packet has a non-local MAC then the checksum is inspected and the packet is dropped if it's wrong.

Are you suggesting this happens in the kernel?

Disabling checksum offloading is not about changing things on the capturing side, it's about allowing injection to work by ensuring the checksums are valid.

If this explanation was correct, then surely without tx off, no packets would successfully cross the weave network (because their checksums would be wrong)? But that is not what happens: some packets do get through. And it is straightforward to confirm with wireshark that even the packets which get through have incorrect checksums.

Disabling checksum offload necessarily disables segmentation offload (even if you use the more fine-grained options to ethtool, disabling the former disables the latter). But I believe it is segmentation offload that is the cause of low throughput without tx off.

msackman commented 8 years ago

Well, doubtless you know more about this than me.

Yes, I believe it happens in the kernel, but I never had time to go chasing through the source. And yes, I also saw that some packets got through and some didn't, but it was sufficient to cripple performance. I have no idea whether, for example, checksum verification is stochastic in some way for performance reasons (I've no idea how expensive checksum verification really is. Certainly for some traffic like UDP, it's pretty expensive as it can only be done after fragment reassembly. That said, I've no idea what "pretty expensive" really amounts to - could all be small beer). I cannot remember now whether the packets that did get through had correct checksums or not. If you don't disable checksum offloading, and then capture packets on the injected side, does wireshark/tcpdump claim they have correct checksums?

Ahh, fair enough - I never had time to properly dig into the segmentation options. I think you can just turn off the seg offload though can't you? If that is the case then you could test with seg offload disabled, but with the "wrong" checksums and see what happens.

rade commented 8 years ago

I am quite sure I spent a fair amount of time narrowing the options to the minimum. Which suggests that disabling checksum offloading is necessary.

awh commented 8 years ago

The TCP segment size starts off respecting the MTU (of 1500), but doubles with each successfully sent packet, until it reaches a limit at 64K.

For my edification - what is the mechanism that implements this doubling? Is it a way for the kernel to dynamically discover the maximum offloadable write that a TSO supporting NIC will accept?

dpw commented 8 years ago

The TCP segment size starts off respecting the MTU (of 1500), but doubles with each successfully sent packet, until it reaches a limit at 64K.

For my edification - what is the mechanism that implements this doubling? Is it a way for the kernel to dynamically discover the maximum offloadable write that a TSO supporting NIC will accept?

I'm not sure, but I suspect it is the just the TCP congestion window growing during the "slow start" phase.

dpw commented 8 years ago

I am quite sure I spent a fair amount of time narrowing the options to the minimum. Which suggests that disabling checksum offloading is necessary.

I just tried replacing tx off with tso off gso off ufo off. ethtool -k weave confirms that tcp-segmentation-offload: off, and tx-checksumming: on (i.e. checksumming is being offloaded, i.e. skipped). iperf reports TCP throughput (between weave on my host and weave in a VM) similar to the tx off case. I checked the output of ethtool -k weave on both ends, and I confirmed with wireshark that the TCP checksums on the receiving side are incorrect.

So I continue to believe that it is segmentation offload that is the culprit, and the checksum issue is incidental.

dpw commented 8 years ago

If you don't disable checksum offloading, and then capture packets on the injected side, does wireshark/tcpdump claim they have correct checksums?

If you leave checksum offloading enabled, then tcpdump shows that all injected packets have incorrect TCP checksums. (Well, maybe about 1 in 65536 have a correct checksum; even a stopped clock tells the right time twice a day.)

msackman commented 8 years ago

Ahh cool; not sure where I got the idea from it was about packets being dropped on the injected side then.

bboreham commented 8 years ago

In most cases we are now no longer using pcap, so this issue is less important.

I just tested:

using WEAVE_NO_FASTDP on both ends of the wire (which drops back to the old pcap), there is a marked difference in throughput between the proxy and the plugin (which does not run ethtool).
using WEAVE_NO_FASTDP on just the 'receive' end (so the send side uses vxlan rather than pcap, but the packets still go through userspace) there is a small drop in throughput between proxy and plugin.
and, as you would expect, using fastdp on both ends goes at the fastest speed for both proxy and plugin.

bboreham commented 8 years ago

Just to note that, post #2307, we are doing the tx off in all Weave-mediated modes (weave run, Weave docker plugin and Weave CNI plugin).

Doing "get Docker to use the Weave bridge" in WEAVE_NO_FASTDP mode remains an issue.

weaveworks / weave

Don't rely on setting `ethtool tx off` on guest interfaces #1255