weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 668 forks source link

data compression #204

Open inercia opened 9 years ago

inercia commented 9 years ago

Maybe Weave could support data compression for traffic. Encapsulation packets could include a compression field where we could specify an optional compression algorithm for payload. The standard compress/lzw algorithm could be used, or maybe something like lz4... A feature like this could be specially helpful for containers running text-based services like Memcache, Redis, etc...

rade commented 9 years ago

We would need to see evidence that this provides some tangible benefits in realistic use cases.

Implementation wise this shouldn't be too hard. The main potential stumbling blocks I can see are interaction with PMTU discovery and packet coalescing.

Would we compress per captured packet or per UDP packet sent over the network (the latter potentially containing multiple packets, due to the aforementioned coalescing)? The former is more efficient for multi-hop since intermediaries don't need to decompress/recompress. The latter is likely to give better compression ratios.

Regarding compression algorithms...Ideally I would avoid making that yet another option, i.e. hopefully we can pick one that is good enough across a broad spectrum of use cases.

inercia commented 9 years ago

As you said, it is not clear to me what to compress: the captured packet or the UDP packet. I guess it would depend on the average number of intermediate hops: if packets usually traverse a couple of hops, maybe UDP packets should be compressed...

I agree that compression should not be an user option. In fact, I think I would enable it by default and apply an adaptative algorithm that could switch it off. I would apply the same technique proposed in this RFC (even when it is focused on IP payloads, it gives some interesting hints on packets compression), where they propose:

1) for a (source, destination) packet, try to compress the payload and compare the output lenght with the original length 2) if things are getting worse after compression for N consecutive packets, disable compression for a while. 3) if we have disabled compression too many times for (source, destination), disable it for good

I think I would add a 2-bits field to the UDP packet for indicating possible compression schemes for the packet (ie, no compression, algorithm-1, etc). Ideally, Weave should have at least two possible compression schemes: low-CPU and high-CPU compressions. high-CPU would be used when the number of active, compressed connections is below a given threshold-1. Above that value, new connections would use the low-CPU algorithm, and compression could be even disabled completely for new connections when a threshold-2 is reached...

rade commented 9 years ago

enable it by default and apply an adaptative algorithm that could switch it off

Nice idea, but it makes this issue an order of magnitude more complex. So best left to a follow-on.

low-CPU and high-CPU compressions

In most deployments, weave is going to be CPU-bound, so I reckon low-CPU compression is all we need.

As for how we determine whether to compress/decompress...

  1. make that an all-or-nothing choice, like encryption, i.e. all peers in a weave network must be configured with the same choice, and will refuse to establish connections to peers otherwise.
  2. make compression a per-router choice, i.e. when selected with, say, weave launch --compressed, a router will compress all outbound traffic. The flag gets exchanged with peers on connection establishment, so they know which inbound udp packets require decompression based on that. Alternatively, we can add a flag to the udp packet itself, which is less fragile and more amenable to packet inspection.
  3. make compression a per-link choice. Now, we wouldn't want a user to have to specify that per-link, since there are potentially O(peer_count^2) links. The zones idea from #82 would work well here, I think.
inercia commented 9 years ago

I think that, ideally, compression should be per link, as it depends on where packets go to/come from. For example, it does not make sense to compress packets going to peers in the same LAN, but compressing packets that go to the WAN could be a big performance gain.

So I would leave the option to the user for enabling compression (with weave launch --enable-compress or something like that), but I would leave to Weave the decision of where/when to use it...

rade commented 9 years ago

it does not make sense to compress packets going to peers in the same LAN

We don't know that. With compression, less data is crossing between kernel and user space.

should be per link [...] I would leave the option to the user for enabling compression (with weave launch --enable-compress or something like that)

If compression is per-link then I don't see the point of enabling it per host. See my option 3, in particular the zones idea.

inercia commented 9 years ago

Not sure about the compression benefits for userspace-kernel copies. If we could measure the cost of this kind of operations, I would bet most of the cost would come from the syscall, and only only a small fraction would depend on the data length (unless you are moving big chunks of data)... not sure, though... I guess it would depend on the nature of the traffic

Regarding issue #82, it would be a good addition, but I think it would involve some difficulties. I will write some questions in that issue...

rade commented 9 years ago

Not sure about the compression benefits for userspace-kernel copies

Me neither. Experiments/measurements of where compression yields benefits is very much part of this issue.