Significant performance decrease compared to 4.1.5

AxelVoitier commented 8 years ago

Hello,

I noticed a drop of performance between 4.1.5 and the current 4.2.0 HEAD.

I ran the performance tests on the two versions with 1MB message size (to clearly see it, but it seems to happen as well with 100B messages, with about x2 difference factor in throughtput) repeated 1000 times. The local and remote are on the same physical machines, but in two separate docker containers (I can provide all the dockerfiles & co. if you have trouble reproducing it).

Version 4.1.5 throughput test: local_thr tcp://*:7000 1048576 1000 remote_thr tcp://local:7000 1048576 1000

ZMQ Version: 4.1.5
message size: 1048576 [B]
message count: 1000
mean throughput: 1614 [msg/s]
mean throughput: 13539.213 [Mb/s]

real    0m0.954s
user    0m0.064s
sys 0m0.620s

Version 4.2.0 throughput test: local_thr tcp://*:7000 1048576 1000 remote_thr tcp://local:7000 1048576 1000

ZMQ Version: 4.2.0
message size: 1048576 [B]
message count: 1000
mean throughput: 227 [msg/s]
mean throughput: 1904.214 [Mb/s]

real    0m4.693s
user    0m0.144s
sys 0m3.916s

Version 4.1.5 latency test: local_lat tcp://*:7000 1048576 1000 remote_lat tcp://local:7000 1048576 1000

ZMQ Version: 4.1.5

real    0m2.598s
user    0m0.068s
sys 0m0.800s
message size: 1048576 [B]
roundtrip count: 1000
average latency: 646.348 [us]

Version 4.2.0 latency test: local_lat tcp://*:7000 1048576 1000 remote_lat tcp://local:7000 1048576 1000

ZMQ Version: 4.2.0

real    0m10.371s
user    0m0.248s
sys 0m5.004s
message size: 1048576 [B]
roundtrip count: 1000
average latency: 4517.717 [us]

I notice this performance drop also in the real application as well as in this application own performance tests (using different socket types than from the zmq perf testers).

Any idea which change could have brought such drop?

Cheers, Axel

somdoron commented 8 years ago

it might be the thread safe features, however those are not enabled for regular socket, but does add an if statement at the beginning of every method, however it doesn't make sense.

Anyway I think we should do some kind of binary search on the commit causing the performance penalty.

AxelVoitier commented 8 years ago

That can be done in a dichotimic way if we know from which date/rev the current 4.1.5 departed from the current 4.2.0.

Are the binary releases produced by CI stored somewhere? That could be faster than recompiling for every commit.

somdoron commented 8 years ago

You don't have to do it for every commit, we don't even need to know the date of 4.1.5.

Take the github commit log, jump to page 5 in history and take the first commit hash

on terminal run

git reset --hard HASH

Run performance test, if no issue go to page 10 take first commit, when you found the performance test try to decrease it to the commit.

This is my plan, I can execute it later today

AxelVoitier commented 8 years ago

Yes, that would also work :). I can also do it later today (either in 2h, or this evening).

scemama commented 8 years ago

git bisect does a binary search and is scriptable.

hitstergtd commented 8 years ago

@AxelVoitier What platform is this happening on? So that it can be tagged appropriately.

AxelVoitier commented 8 years ago

x86, desktop computer.

hitstergtd commented 8 years ago

Sorry, I should have been more specific. OS?

AxelVoitier commented 8 years ago

Ubuntu 14.04 (both host and within containers). Gcc 4.8.4.

hitstergtd commented 8 years ago

@AxelVoitier Thanks. Hopefully we can get to the bottom of it quickly! 👍

AxelVoitier commented 8 years ago

Progressing: It is somewhere between c15817879889f5ea994fb649cf18886345fb539f (good) and 42ab88e486ed52cba0d3d05896e27157bdd8913c (bad).

Basically, it happened within the commits of the 9th of Feb, on page 10.

hitstergtd commented 8 years ago

Perhaps (in that order): 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0, 62c66ae7f764b2cfc24e7b8594549bd4c08064cb, 273b54715e3b0ed0da67df6d863ac44293239640 (although I don't think it would be the last one).

AxelVoitier commented 8 years ago

Got it: 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 (the one just before, 884c7f78e984f99389a9cd3ed3784b56452c5fa0, is good).

(By the way, only realising now that "binary search" is the english way of saying "dichotomic" ... sorry for the confusion earlier).

bluca commented 8 years ago

Nice job tracking it down! Could it be that the buffer size before was not actually 8192?

hitstergtd commented 8 years ago

It was 4096 if I am reading that commit correctly.

AxelVoitier commented 8 years ago

That seems to be it, indeed. Before, by default it used to be 0, which means OS default (== /proc/sys/net/core/wmem_default (rmem_default) == 212992 on my system).

If I set the snd/rcv buf options to 8192 when using 4.1.5 then I get the same performance drop than on 4.2.0. And If I set snd/rcv buf options to 212992 on 4.2.0 version then I get back the performance of 4.1.5 :).

However, there seems to be some issues with what should be the default values:

The documentation now says that the default value is -1 (which is not, as it is 8192 actually) and that passing a value of -1 correspond to OS defaults.
Previously, the default value was 0, and it meant OS default. So, we may have a break of compatibility (haven't checked yet what would be the behaviour of 4.1.5 if we set -1).
On 4.2.0, if I try to set snd/rcv options to -1 with zsock_set_snd/rcvbuf then I get an error. As this might be from czmq, I am in the process of trying with libzmq API directly.

And obviously, this case raise the concern that we may want to ship 4.2.0 with a different value than 8192 for these options as it may change the out-of-the-box performances (some applications may rely on the defaults and see the performance impact).

AxelVoitier commented 8 years ago

Also, if an application that used 4.1.5 and enforced the then default value 0, with 4.2.0 it seems to make zero-length buffers and then you dont send/receive anything anymore!

hitstergtd commented 8 years ago

Excellent write-up Axel. I just wanted to add, it seems that the relative performance drop is narrower on OS X.

AxelVoitier commented 8 years ago

;) It could be that on OS X the default OS value are different (look for the Mac doc about SO_SNDBUF and SO_RCVBUF socket's options).

I checked and with 4.2.0 I cannot even set the supposedly default value of -1 for SNDBUF and RDCVBUF! zmq_setsockopt() returns -1. Same with 4.1.5 (so, no backward compatibility :S).

hitstergtd commented 8 years ago

Just for reference to anyone else reading this (particularly the excerpt in the answer): http://stackoverflow.com/questions/2031109/understanding-set-getsockopt-so-sndbuf

hitstergtd commented 8 years ago

Agreed. I am looking into it now.

On a side note - the auto-tuning will result in different values naturally, but it would be good to do a comparative study. Whilst I would understand the "its a library and not a shrink-wrapped COTS" argument, providing a relative case study for library users would be a good thing I think. It is also interesting that Linux implicitly doubles the SO_SNDBUF / SO_RECVBUF values provided by the setsockopt/getsockopt caller.

AxelVoitier commented 8 years ago

That could be an additional page on http://zeromq.org/area:results, or extending the base one http://zeromq.org/results:perf-howto with a paragraph on the influence of certain options.

Linux kernel doubles it for making space for internal structures. While it is important to know that in terms of memory consumption (in particular when creating lots of sockets). On the application itself it doesn't impact it much as you would only see a different value if you ever tried to read these options from the native socket.

hitstergtd commented 8 years ago

Ack, and ack.

bluca commented 8 years ago

Nice analysis. I'm especially worried about the change in the API. We should not break compatibility if possible.

somdoron commented 8 years ago

My opinion is that we should make the default -1 again (or 0, I'm not sure). If the SNDBUF/RCVBUF is -1 (or 0) all the places that set the OS buffers should ignore it.

This was the behavior previous, I think

hitstergtd commented 8 years ago

Ack. Anything that restores API behaviour would be good IMO. It would be also good to see what, if any, effect has transpired on the other transports, aside from TCP and INPROC.

AxelVoitier commented 8 years ago

Default used to be 0 in 4.1: http://api.zeromq.org/4-1:zmq-setsockopt#toc38

Also, commit 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 did not actually made the switch to -1 default values. A previous one seems to have done it as I see in the diff of that commit that the 8192 values are replacing -1 values.

But as 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 puts it, these options are now modifying two buffer settings. One for OS_SNDBUF/SO_RCVBUF, and one for the syscall buffers.

Which means, for the syscall buffers, the special value of 0 will have to be treated as not zero-length buffer (which I guess it does right now, as if you set these options to 0 you receive nothing). The question remains as what would 0 means for these syscall buffers? 8192 maybe?

AxelVoitier commented 8 years ago

Commit f7b933f570545def0d654b9408c92b82ac0e54f4 introduced the -1 default value. The commit message reference a Windows bug on which it says:

Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to send from your application buffer directly.

So, changing back these options to 0 default value will break this fix for windows...

AxelVoitier commented 8 years ago

The unraveling continue :). That commit f7b933f570545def0d654b9408c92b82ac0e54f4 from last year was to fix an issue reported in 2011 concerning a WinXP setup. As the Microsoft case mention, it affects only very old systems.

In the Jira case Martin mentioned:

Setting the buffers to 0 doesn't seem to be a good idea. AFAIU it means that send function then blocks until the data is actually pushed to the network. I.e. it block any other processing going on in the same I/O thread.

I don't know if that still holds true today. If so, then maybe commit f7b933f570545def0d654b9408c92b82ac0e54f4 shouldn't have allowed for 0 values to be passed at all. (And that might actually be the reason why if I set these options to 0 on 4.2.0 then I get no message to be sent/received).

AxelVoitier commented 8 years ago

It seems that Windows bug was also hitting on the Win7 system of the committer of f7b933f570545def0d654b9408c92b82ac0e54f4. Associated PR is actually #1472, which is also linked to #1471 by the Jira case number.

I would suggest the following behaviour, but it is a bit convoluted in my opinion:

Restore default value of 0
Restore the meaning of 0 == OS default for SO_SNDBUF and SO_RCVBUF, and 8192 for syscall buffer (a bit convoluted). Or separate again the two options.
Allow for specifying -1 on Windows system. It would actually mean passing 0 to the windows socket option, such that it uses the application buffer directly instead of the stack buffer (quite convoluted). Or make a special option just for this.
Update documentation to tell all of this in a way that doesn't look crazy :-P.

somdoron commented 8 years ago

following commit is the beginning of this mess I think

https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4

somdoron commented 8 years ago

The windows fix is fine I think, changing the default to -1 is making sense. I making a PR with following:

Allowing -1 and 0 for sndbuf & rcvbuf
Default to -1
Ignore sndbuf & rcvbuf when set to -1
Encoder & decoder are using the in_batch_size/out_batch size (revert commit from following comment)

hitstergtd commented 8 years ago

@AxelVoitier, Just replying to one of your earlier messages about socket buffer defaults:

By default, it would not have been wmem_default. The default socket send/recv buffers in Linux are set to the tcp_wmem/tcp_rmem sysctl, respectively, upon socket initialisation. For more information on this, see #L312 and #L401 in net/ipv4/tcp_input.c and my points below.
The send and receive socket buffer sizes may get retuned again once the TCP connection is in an Established state, due to detected Window Scaling and Fast Recovery factors. See #L422 in net/ipv6/tcp.c.
The rmem_max and wmem_max sysctls are only honoured upon any calls to setsockopt(2) post socket creation, not as an outright default, not at least for TCP sockets anyway. The only way around that is to set rmem_max/wmem_max sysctls to the same values as in tcp_rmem/tcp_wmem[2], respectively. It seems the setsockopt(2) logic doesn't check per-protocol buffer maxima and it is not even possible to override this even if you used SO_SNDBUFFORCE or SO_RCVBUFFORCE; maybe this is behaviour in the Linux kernel socket layer that needs to be changed, so that SO_SNDBUF can check its max bound against tcp_wmem, and respectively the same for SO_RCVBUF.
Whilst FreeBSD allows one to set the TCP send/recv space to available memory maximums, per socket, if I recall correctly! :)

Incidentally, the default send/recv buffers are huge on OS X. Quick test reveals the following:

OS X:

$ ./test_sockbuf_osx
send buffer size = 131072
$ get-nettune-sysctls 
net.inet.tcp.mssdflt: 1460
net.inet.tcp.minmss: 216
kern.ipc.maxsockbuf: 6291456
net.inet.tcp.tso: 1
net.inet.tcp.sendspace: 131072
net.inet.tcp.recvspace: 131072
kern.ipc.maxsockbuf: 6291456

Linux 4.2.0-35-generic:

$ ./test_sockbuf_linux
send buffer size = 16384
sets the send buffer to 98304
send buffer size = 196608
$ cat /proc/sys/net/ipv4/tcp_wmem
4096    16384   4194304

Hopefully it helps any on-lookers who are interested in tuning ZeroMQ performance.

somdoron commented 8 years ago

the author of the (https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4) is not recognized, anyway I'm removing this feature, so no way no to set the encoder/decoder batch size and using values from config.hpp.

If someone wants this in the future I suggest moving it from config to options but to keep the name and not calling it tcp_recv_buffer so someone won't changing it again as happen now

hitstergtd commented 8 years ago

Also, Linux floors the minimum socket send buffer size to 4068 when using SO_SNDBUF, which includes the original size of 2048, doubled, and some other alignment related magic. So it is not possible to set anything less than that, even iF SO_SNDBUFFORCE is used.

hitstergtd commented 8 years ago

@somdoron it is @jens-auer. PR #1638. The rationale behind it is explained in there.

somdoron commented 8 years ago

@jens-auer during some misunderstanding with your commit the zeromq performance got worsen. With time the option you added got merged with sndbuf and rcvbuf which seems uncorrect. To fix this I'm changing encoder/decoder to use the config constant values again. If you need your change I suggest to do it again, this time it might be better to call it in_batch_size/out_batch_size vs the previous names which cause the confusion.

[1] https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4

hitstergtd commented 8 years ago

@somdoron Also, I think that if the same behaviour is required, it can be achieved by use of ZMQ_SRC_FD and then setting the buffers sizes manually through setsockopt, if there really was a need.

somdoron commented 8 years ago

the changes is not to the underlying socket, that is not being changed.

The change is for the decoder/encoder.

You can still set the buffer of the underlying socket through sndbuf/rcvbuf.

Also src_fd is buggy. usually socket is managing more than one socket, src_fd only work when you have one socket.

hitstergtd commented 8 years ago

Ah, OK.

hitstergtd commented 8 years ago

If it is buggy, as you say, should we think about deprecating it, moving forward?

somdoron commented 8 years ago

yes, it also take up precious space from msg.

SRC_FD only works when you have one socket connected.

Remember the accept socket, so srcfd just return the last accept socket, which the message doesn't necessarily came form https://github.com/zeromq/libzmq/blob/b65fc903cdbcc52b0a73f2f82e5d440c81bf7c1c/src/tcp_listener.cpp#L109

however it was part of 4.1 so we cannot just delete it (I really want too), maybe mark it as deprecated.

We can just return -1 when someone try to receive SRCFD, which will not break the API but will just won't work.

We can ask the author of SRCFD if he is using it and if it even works

somdoron commented 8 years ago

@AxelVoitier can you confirm performance are good now?

hitstergtd commented 8 years ago

Yes. It is pointless carrying forward broken features, especially if we can gracefully deprecate with -1.

hitstergtd commented 8 years ago

@somdoron - the difference is day and night. Para-virtualised VM running Linux (Ubuntu 14):

Before your change:

$ ./bench.sh 1000
message size: 1048576 [B]
message count: 1000
mean throughput: 398 [msg/s]
mean throughput: 3338.666 [Mb/s]

After your change:

./bench.sh 1000
message size: 1048576 [B]
message count: 1000
mean throughput: 3213 [msg/s]
mean throughput: 26952.598 [Mb/s]

Thanks to @AxelVoitier and @somdoron - 👍 💯 💯 💯

bluca commented 8 years ago

Awesome stuff folks!

AxelVoitier commented 8 years ago

Yes, performance is back :)

Throughput:

ZMQ Version: 4.2.0
message size: 1048576 [B]
message count: 1000
mean throughput: 1481 [msg/s]
mean throughput: 12423.528 [Mb/s]

real    0m0.971s
user    0m0.048s
sys 0m0.684s

Latency:

ZMQ Version: 4.2.0

real    0m2.499s
user    0m0.080s
sys 0m0.736s
message size: 1048576 [B]
roundtrip count: 1000
average latency: 605.311 [us]

Also affected positively my application performance :).

I just noticed that if you set these sndbuf and rcvbuf options to 0 (as would do an application trying to enforce the old default) then it works but the performance drop is back.

I don't think it is a big problem though (it is not an issue for my own use cases).

AxelVoitier commented 8 years ago

And for documentation for the on-lookers interested in tuning performances, and reacting to your earlier post @hitstergtd: According to the man page of my system (and http://man7.org/linux/man-pages/man7/socket.7.html) on linux the SO_SNDBUF and SO_RCVBUF default values are said to come from the /proc filesystem.

I actually didn't tried changing them manually. Maybe it does require to use sysctl to change them?

hitstergtd commented 8 years ago

Yes the default values come from two sysctls when it's a TCP socket, ie. net.ipv4.tcp_wmem for the socket send buffer and net.ipv4.tcp_rmem for the socket receive buffer, respectively. But what it means is this, if you really want your TCP socket to get the available maximum that is allotable to it, then you have to set wmem_max to the third value in tcp_wmem, otherwise you will not be able to set the socket send buffer in your application to the maximum allowable by the TCP stack.

It comes down to this - by default - wmem_max and wmem_default are nowhere the 3rd value in net.ipv4.tcp_wmem sysctl. It is easy to overlook this point if you are going to set both of those sysctls to the same value.

jens-auer commented 8 years ago

Hi,

I am the author of the original patch that added the new options. The options itself are not that important for me because I won't be able to update to the new ZeroMQ version on my project anyway. However, when I evaluated ZeroMQ, I was able to achieve significant performance improvements by increasing the buffers in config.hpp. My application sends messages of 1129 bytes with a rate of up to 32000 messages per second. Larger encoder/decoder buffers reduce the number of syscalls significantly, and since the best size depends on the message size and rate, I added the options.

If anybody else is interested I can probably implement the options on the new master during the week-end.

Cheers, Jens

zeromq / libzmq

Significant performance decrease compared to 4.1.5 #1947