Stream compression: lz4

keith-dev commented 7 years ago

I'd like to have a stream pass compressed data using the lz4 option. This would offer improved performance in applications passing text messages across network boundaries in a simple way.

A new socket option, ZMQ_LZ4 (70), would be used to identify the feature. setsockopt/getsockopt would be used to set/query the compression state.

The feature would be a build option, set by autoconfig. I initially envisage the implementing the feature in options_t/socket_base_t and only touching actual sockets if necessary.

I would be grateful to hear the thoughts of the community on this. I need such a facility on a project I am currently working on as we handle potentially large JSON payloads. It seems consistent to have zmq do the compression, and lz4 is publicly exposed by zfs and other systems.

bluca commented 7 years ago

Isn't this something the application should do if required? It's very dependent on the actual data being passed, which only the application can know about

Perhaps a better place would be CZMQ, in the zstr class, as a new API

bbdb68 commented 7 years ago

One of core ZMQ assumptions is to be agnostic about message content, I feel this would break it. As @bluca says, CZMQ is a bit less agnostic : it has specific APIs to handle string messages, so I would agree with his suggestion.

keith-dev commented 7 years ago

The idea of this change is to increase the network throughput of zmq by using good general purpose compression.

In response to comments:

The application can do it and it is true that text compresses very well, but this does not solely apply to text payloads. This is necessary to improve throughput over networks. ZFS handles general purpose data and is faced with the same problem, optimising the physical media, and use the same solution with one caveat. ZFS checks if it's worth compressing before using compression. ZFS does not push the problem back into the application layer.
This is a transport not an API matter. Moving the functionality into an different library isn't right in my view because it's a transport matter, not application functionality. zmq has a security facility, arguably, that's an application requirement too. I think czmq is the wrong place for it as all the czmq object can benefit if the feature is recognised to be a transport matter.
zmq would remain payload agnostic, and still just process binary blocks. Compressing those blocks doesn't bind it in any way to anything besides the compression algorithm.

bluca commented 7 years ago

The problem is that this would not increase throughput across the board, it would actually decrease it significantly by increasing memory and cpu usage for a lot of data types where compression just doesn't make sense

It makes absolutely no difference where the compression happens, so having it as a new CZMQ zstr API where the application can choose to use it if the payload is suitable makes the most sense

keith-dev commented 7 years ago

I am proposing adding the feature as an option. This would not change the behaviour of zmq across the board because its use has to be explicitly enabled in the build and switched on using setsockopt() by an application.

Can you explain why this would decrease throughput at all? And why have you described the decrease as significant?

Increased CPU or memory use does not necessarily correspond to lower performance. If your bottleneck is the wire, using less of it will increase performance. And if you have to run a compressor anyway, it doesn't change system performance at all.

It does make a difference to where the compression happens, in much the same way as it matters where the communication happens, or where sockets are handled and so on.

I can't see how implementing this in czmq will help someone using zactor for example. In any event, why force someone to use czmq when all they want to do is pass messages over zmq sockets?

Rather than just disagreeing with my idea, can you please explain in a bit more detail your objections. That will save a lot of time and frustration on my part.

rodgert commented 7 years ago

I am proposing adding the feature as an option. This would not change the behaviour of zmq across the board because its use has to be explicitly enabled in the build and switched using setsockopt() by an application.

So, both sides of a connection would have to agree to use compression, and then explicitly set the socket option? Or, do you anticipate changing ZMTP to indicate compressed payload, method, etc?

On Wed, May 10, 2017 at 5:30 AM, keith-dev notifications@github.com wrote:

I am proposing adding the feature as an option. This would not change the behaviour of zmq across the board because its use has to be explicitly enabled in the build and switched using setsockopt() by an application.

Can you explain why this would decrease throughput at all? And why have you described the decrease significant?

Increased CPU or memory use does not necessarily correspond to lower performance. If your bottleneck is the wire, using less of it will increase performance. And if you have to do run compressor anyway, it doesn't change system performance at all.

It does make a difference to where the compression happens, in much the same way as it matters where the communication happens, or where sockets are handled and so on.

I can't see how implementing this in czmq will help someone using zactor for example. In any event, why force someone to use czmq when all they want to do is pass messages over zmq sockets?

Rather than just disagreeing with my idea, can you please explain in a bit more detail your objections. That will save a lot of time and frustration on my part.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zeromq/libzmq/issues/2568#issuecomment-300442248, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYBySXaF9Zd2VydykItCukiHH9ytgTks5r4ZHWgaJpZM4NWHWF .

bluca commented 7 years ago

I'm not saying this use case is not realistic or not important - on the contrary, I think it can be very useful, so your idea is very good and interesting.

I think the issue is that adding a compressor would benefit only a very specific use case: when handling a large payload with very good compression rate (eg: text) and CPU/memory are very over-provisioned and the network is under-provisioned.

If the payload is small or not good for compression then it won't make any meaningful difference, and if the CPU/memory are not very overprovisioned with respect to the network then sacrificing the zero-copy feature will most likely cause a regression rather than an improvement (large malloc&copies are a bottleneck even before considering compression algorithms).

By the nature of the use case, having a socket option isn't really the best place to implement it in my opinion: it naturally fits as a per-message feature. So that applications can decide when to use it or not in a simple, straigthforward way following a correct pattern, depending on the message.

Having a socket option would for example require juggling with it depending on the message type. Or worse, it would encourage the very bad anti-pattern of creating and destroying different sockets per different messages.

On the contrary we already have a lot of similar use cases and patterns implemented in CZMQ, specifically for strings. This would be a perfect match for the zstr class.

As a bonus, implementing in in zstr would be very easy (and fun!) and risk-free - it's a new API, so if there's a bug it won't cause a regression for everyone else. I recon you could have a fully working solution in a couple hours of work, if that. On the other hand messing with the internal of libzmq is extremely tricky. Let's face it, it's not the most simple and straightforward code base... And extending the ZMTP protocol would be even more of a headache.

keith-dev commented 7 years ago

Both sides would agree on whether to use compression or not. There's no change to ZMTP (I think). Further more, devices can decide to compress between themselves. For example, a router/dealer may handle compression internally without affecting the outer rep/req.

In the case of a network device such as router/dealer, it does require juggling the compression state, but that is contained with the device and it becomes a component issue, not a system wide issue. For example, the routine information is passed without compression and it's switched on to handle the payload messages, then switched off.

I implemented this feature in a local fork and have been test driving it for some months now. This particular implementation is handy because the feature is available to different languages in the standard way. Putting the feature in czmq/zstr doesn't quite help as I don't use the higher level interface across multiple languages. I've used libzmq to implement component devices. We could keep the lz4 compression external, but you could say the same about curve and gss.

jimklimov commented 7 years ago

@bluca, Regarding overheads - lz4 was chosen as default in ZFS because of little overhead with incompressible data. It is not as pathological as say bzip2 or gzip that generally compress better ;)

Inclusion in the common engine is also a good position to keep statistics - e.g. this dialogue didn't compress well, or this message is too small, to do or not do compression as chosen at runtime per-message (so, some flag is needed whether to decompress this message at all when received).

Of course, since generally the endpoints can be running any version and build-variant of libzmq, feature like this should be negotiated, and lack of support should be expected of the counterpart, to start with.

Sent from my Xiaomi Redmi Note 4 using FastHub

bluca commented 7 years ago

Here's an implementation of zstr APIs for compression: https://github.com/zeromq/czmq/pull/1747

packages builds: https://github.com/zeromq/czmq/pull/1747

As I already said it doesn't make sense in the engine - small payloads would have huge drops in performance and no space savings. Even if there's no overhead, for uncompressable data it would still get the performance penalty with no gain. And very small messages would even get larger!

And the engine can't just send compressed data without negotiation, it would have to be added to the handshake to avoid incompatibility.

bjovke commented 7 years ago

@keith-dev Since this requires significant changes of libzmq and possibly ZMTP we need some code for this and tests done with small/big message size and slow/fast network while we observe CPU and latency. I think this should really be a place for issues for the libzmq instead of development discussion. Because of that it's best if you can start this subject on libzmq-dev mailing lists. This discussion could potentially drag on for a substantial time period and list of issues is becoming larger and larger. I think we should concentrate here on solving really urgent existing problems. Thank you for your understanding.

sigiesec commented 7 years ago

@bjovke I disagree here. IMO it is ok to have issues with the "Feature Request" label applied for a longer time, we should somehow put effort in getting rid of the heap of non-Feature Request issues: https://github.com/zeromq/libzmq/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aopen%20%20-label%3A%22Feature%20Request%22%20

On the mailing list, feature ideas/requests become more of less invisible/inaccessible when they are not in a hot discussion mode. The current mailing list archive is barely usable. Regardless of my personal opinion, there should be some agreed policy for the use of the issue tracker, maybe in form of a ZeroMQ RFC. I would really like to reopen this issue, but it would make no sense if it is permanently closed and reopened because of different preferences ;)

bjovke commented 7 years ago

@sigiesec Ok. Will reopen. I'm just afraid that a lot of time will pass until this feature request gets a usable completed code.

jimklimov commented 7 years ago

I'm just afraid that a lot of time will pass until this feature request gets a usable completed code.

I think that's not a problem in itself. Someone trawling the open (unsolved) issues looking for a fun project would find and perhaps fix it - more likely than if the issue is closed-but-unsolved. It might takw a while to happen, true ;)

zeromq / libzmq

Stream compression: lz4 #2568