Closed AxelVoitier closed 8 years ago
it might be the thread safe features, however those are not enabled for regular socket, but does add an if statement at the beginning of every method, however it doesn't make sense.
Anyway I think we should do some kind of binary search on the commit causing the performance penalty.
That can be done in a dichotimic way if we know from which date/rev the current 4.1.5 departed from the current 4.2.0.
Are the binary releases produced by CI stored somewhere? That could be faster than recompiling for every commit.
You don't have to do it for every commit, we don't even need to know the date of 4.1.5.
Take the github commit log, jump to page 5 in history and take the first commit hash
on terminal run
git reset --hard HASH
Run performance test, if no issue go to page 10 take first commit, when you found the performance test try to decrease it to the commit.
This is my plan, I can execute it later today
Yes, that would also work :). I can also do it later today (either in 2h, or this evening).
git bisect
does a binary search and is scriptable.
@AxelVoitier What platform is this happening on? So that it can be tagged appropriately.
x86, desktop computer.
Sorry, I should have been more specific. OS?
Ubuntu 14.04 (both host and within containers). Gcc 4.8.4.
@AxelVoitier Thanks. Hopefully we can get to the bottom of it quickly! 👍
Progressing: It is somewhere between c15817879889f5ea994fb649cf18886345fb539f (good) and 42ab88e486ed52cba0d3d05896e27157bdd8913c (bad).
Basically, it happened within the commits of the 9th of Feb, on page 10.
Perhaps (in that order): 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0, 62c66ae7f764b2cfc24e7b8594549bd4c08064cb, 273b54715e3b0ed0da67df6d863ac44293239640 (although I don't think it would be the last one).
Got it: 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 (the one just before, 884c7f78e984f99389a9cd3ed3784b56452c5fa0, is good).
(By the way, only realising now that "binary search" is the english way of saying "dichotomic" ... sorry for the confusion earlier).
Nice job tracking it down! Could it be that the buffer size before was not actually 8192?
It was 4096 if I am reading that commit correctly.
That seems to be it, indeed. Before, by default it used to be 0, which means OS default (== /proc/sys/net/core/wmem_default (rmem_default) == 212992 on my system).
If I set the snd/rcv buf options to 8192 when using 4.1.5 then I get the same performance drop than on 4.2.0. And If I set snd/rcv buf options to 212992 on 4.2.0 version then I get back the performance of 4.1.5 :).
However, there seems to be some issues with what should be the default values:
And obviously, this case raise the concern that we may want to ship 4.2.0 with a different value than 8192 for these options as it may change the out-of-the-box performances (some applications may rely on the defaults and see the performance impact).
Also, if an application that used 4.1.5 and enforced the then default value 0, with 4.2.0 it seems to make zero-length buffers and then you dont send/receive anything anymore!
Excellent write-up Axel. I just wanted to add, it seems that the relative performance drop is narrower on OS X.
;) It could be that on OS X the default OS value are different (look for the Mac doc about SO_SNDBUF and SO_RCVBUF socket's options).
I checked and with 4.2.0 I cannot even set the supposedly default value of -1 for SNDBUF and RDCVBUF! zmq_setsockopt() returns -1. Same with 4.1.5 (so, no backward compatibility :S).
Just for reference to anyone else reading this (particularly the excerpt in the answer): http://stackoverflow.com/questions/2031109/understanding-set-getsockopt-so-sndbuf
Agreed. I am looking into it now.
On a side note - the auto-tuning will result in different values naturally, but it would be good to do a comparative study. Whilst I would understand the "its a library and not a shrink-wrapped COTS" argument, providing a relative case study for library users would be a good thing I think. It is also interesting that Linux implicitly doubles the SO_SNDBUF / SO_RECVBUF values provided by the setsockopt/getsockopt caller.
That could be an additional page on http://zeromq.org/area:results, or extending the base one http://zeromq.org/results:perf-howto with a paragraph on the influence of certain options.
Linux kernel doubles it for making space for internal structures. While it is important to know that in terms of memory consumption (in particular when creating lots of sockets). On the application itself it doesn't impact it much as you would only see a different value if you ever tried to read these options from the native socket.
Ack, and ack.
Nice analysis. I'm especially worried about the change in the API. We should not break compatibility if possible.
My opinion is that we should make the default -1 again (or 0, I'm not sure). If the SNDBUF/RCVBUF is -1 (or 0) all the places that set the OS buffers should ignore it.
This was the behavior previous, I think
Ack. Anything that restores API behaviour would be good IMO. It would be also good to see what, if any, effect has transpired on the other transports, aside from TCP and INPROC.
Default used to be 0 in 4.1: http://api.zeromq.org/4-1:zmq-setsockopt#toc38
Also, commit 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 did not actually made the switch to -1 default values. A previous one seems to have done it as I see in the diff of that commit that the 8192 values are replacing -1 values.
But as 7470c00d4de82fd18bba40e7ed7e390c7a43ebd0 puts it, these options are now modifying two buffer settings. One for OS_SNDBUF/SO_RCVBUF, and one for the syscall buffers.
Which means, for the syscall buffers, the special value of 0 will have to be treated as not zero-length buffer (which I guess it does right now, as if you set these options to 0 you receive nothing). The question remains as what would 0 means for these syscall buffers? 8192 maybe?
Commit f7b933f570545def0d654b9408c92b82ac0e54f4 introduced the -1 default value. The commit message reference a Windows bug on which it says:
Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to send from your application buffer directly.
So, changing back these options to 0 default value will break this fix for windows...
The unraveling continue :). That commit f7b933f570545def0d654b9408c92b82ac0e54f4 from last year was to fix an issue reported in 2011 concerning a WinXP setup. As the Microsoft case mention, it affects only very old systems.
In the Jira case Martin mentioned:
Setting the buffers to 0 doesn't seem to be a good idea. AFAIU it means that send function then blocks until the data is actually pushed to the network. I.e. it block any other processing going on in the same I/O thread.
I don't know if that still holds true today. If so, then maybe commit f7b933f570545def0d654b9408c92b82ac0e54f4 shouldn't have allowed for 0 values to be passed at all. (And that might actually be the reason why if I set these options to 0 on 4.2.0 then I get no message to be sent/received).
It seems that Windows bug was also hitting on the Win7 system of the committer of f7b933f570545def0d654b9408c92b82ac0e54f4. Associated PR is actually #1472, which is also linked to #1471 by the Jira case number.
I would suggest the following behaviour, but it is a bit convoluted in my opinion:
following commit is the beginning of this mess I think
https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4
The windows fix is fine I think, changing the default to -1 is making sense. I making a PR with following:
@AxelVoitier, Just replying to one of your earlier messages about socket buffer defaults:
wmem_default
. The default socket send/recv buffers in Linux are set to the tcp_wmem
/tcp_rmem
sysctl, respectively, upon socket initialisation. For more information on this, see #L312 and #L401 in net/ipv4/tcp_input.c
and my points below.net/ipv6/tcp.c
.rmem_max
and wmem_max
sysctls are only honoured upon any calls to setsockopt(2)
post socket creation, not as an outright default, not at least for TCP sockets anyway. The only way around that is to set rmem_max
/wmem_max
sysctls to the same values as in tcp_rmem
/tcp_wmem[2]
, respectively. It seems the setsockopt(2)
logic doesn't check per-protocol buffer maxima and it is not even possible to override this even if you used SO_SNDBUFFORCE
or SO_RCVBUFFORCE
; maybe this is behaviour in the Linux kernel socket layer that needs to be changed, so that SO_SNDBUF can check its max bound against tcp_wmem
, and respectively the same for SO_RCVBUF
.Incidentally, the default send/recv buffers are huge on OS X. Quick test reveals the following:
OS X:
$ ./test_sockbuf_osx
send buffer size = 131072
$ get-nettune-sysctls
net.inet.tcp.mssdflt: 1460
net.inet.tcp.minmss: 216
kern.ipc.maxsockbuf: 6291456
net.inet.tcp.tso: 1
net.inet.tcp.sendspace: 131072
net.inet.tcp.recvspace: 131072
kern.ipc.maxsockbuf: 6291456
Linux 4.2.0-35-generic:
$ ./test_sockbuf_linux
send buffer size = 16384
sets the send buffer to 98304
send buffer size = 196608
$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 4194304
Hopefully it helps any on-lookers who are interested in tuning ZeroMQ performance.
the author of the (https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4) is not recognized, anyway I'm removing this feature, so no way no to set the encoder/decoder batch size and using values from config.hpp.
If someone wants this in the future I suggest moving it from config to options but to keep the name and not calling it tcp_recv_buffer so someone won't changing it again as happen now
Also, Linux floors the minimum socket send buffer size to 4068 when using SO_SNDBUF
, which includes the original size of 2048, doubled, and some other alignment related magic. So it is not possible to set anything less than that, even iF SO_SNDBUFFORCE
is used.
@somdoron it is @jens-auer. PR #1638. The rationale behind it is explained in there.
@jens-auer during some misunderstanding with your commit the zeromq performance got worsen. With time the option you added got merged with sndbuf and rcvbuf which seems uncorrect. To fix this I'm changing encoder/decoder to use the config constant values again. If you need your change I suggest to do it again, this time it might be better to call it in_batch_size/out_batch_size vs the previous names which cause the confusion.
[1] https://github.com/zeromq/libzmq/commit/cdeec4c115b3d2f62417ae57c2cd5f5dcdf348f4
@somdoron Also, I think that if the same behaviour is required, it can be achieved by use of ZMQ_SRC_FD
and then setting the buffers sizes manually through setsockopt, if there really was a need.
the changes is not to the underlying socket, that is not being changed.
The change is for the decoder/encoder.
You can still set the buffer of the underlying socket through sndbuf/rcvbuf.
Also src_fd is buggy. usually socket is managing more than one socket, src_fd only work when you have one socket.
Ah, OK.
If it is buggy, as you say, should we think about deprecating it, moving forward?
yes, it also take up precious space from msg.
SRC_FD only works when you have one socket connected.
Remember the accept socket, so srcfd just return the last accept socket, which the message doesn't necessarily came form https://github.com/zeromq/libzmq/blob/b65fc903cdbcc52b0a73f2f82e5d440c81bf7c1c/src/tcp_listener.cpp#L109
however it was part of 4.1 so we cannot just delete it (I really want too), maybe mark it as deprecated.
We can just return -1 when someone try to receive SRCFD, which will not break the API but will just won't work.
We can ask the author of SRCFD if he is using it and if it even works
@AxelVoitier can you confirm performance are good now?
Yes. It is pointless carrying forward broken features, especially if we can gracefully deprecate with -1.
@somdoron - the difference is day and night. Para-virtualised VM running Linux (Ubuntu 14):
Before your change:
$ ./bench.sh 1000
message size: 1048576 [B]
message count: 1000
mean throughput: 398 [msg/s]
mean throughput: 3338.666 [Mb/s]
After your change:
./bench.sh 1000
message size: 1048576 [B]
message count: 1000
mean throughput: 3213 [msg/s]
mean throughput: 26952.598 [Mb/s]
Thanks to @AxelVoitier and @somdoron - 👍 💯 💯 💯
Awesome stuff folks!
Yes, performance is back :)
Throughput:
ZMQ Version: 4.2.0
message size: 1048576 [B]
message count: 1000
mean throughput: 1481 [msg/s]
mean throughput: 12423.528 [Mb/s]
real 0m0.971s
user 0m0.048s
sys 0m0.684s
Latency:
ZMQ Version: 4.2.0
real 0m2.499s
user 0m0.080s
sys 0m0.736s
message size: 1048576 [B]
roundtrip count: 1000
average latency: 605.311 [us]
Also affected positively my application performance :).
I just noticed that if you set these sndbuf and rcvbuf options to 0 (as would do an application trying to enforce the old default) then it works but the performance drop is back.
I don't think it is a big problem though (it is not an issue for my own use cases).
And for documentation for the on-lookers interested in tuning performances, and reacting to your earlier post @hitstergtd: According to the man page of my system (and http://man7.org/linux/man-pages/man7/socket.7.html) on linux the SO_SNDBUF and SO_RCVBUF default values are said to come from the /proc filesystem.
I actually didn't tried changing them manually. Maybe it does require to use sysctl to change them?
Yes the default values come from two sysctls when it's a TCP socket, ie. net.ipv4.tcp_wmem for the socket send buffer and net.ipv4.tcp_rmem for the socket receive buffer, respectively. But what it means is this, if you really want your TCP socket to get the available maximum that is allotable to it, then you have to set wmem_max to the third value in tcp_wmem, otherwise you will not be able to set the socket send buffer in your application to the maximum allowable by the TCP stack.
It comes down to this - by default - wmem_max and wmem_default are nowhere the 3rd value in net.ipv4.tcp_wmem sysctl. It is easy to overlook this point if you are going to set both of those sysctls to the same value.
Hi,
I am the author of the original patch that added the new options. The options itself are not that important for me because I won't be able to update to the new ZeroMQ version on my project anyway. However, when I evaluated ZeroMQ, I was able to achieve significant performance improvements by increasing the buffers in config.hpp. My application sends messages of 1129 bytes with a rate of up to 32000 messages per second. Larger encoder/decoder buffers reduce the number of syscalls significantly, and since the best size depends on the message size and rate, I added the options.
If anybody else is interested I can probably implement the options on the new master during the week-end.
Cheers, Jens
Hello,
I noticed a drop of performance between 4.1.5 and the current 4.2.0 HEAD.
I ran the performance tests on the two versions with 1MB message size (to clearly see it, but it seems to happen as well with 100B messages, with about x2 difference factor in throughtput) repeated 1000 times. The local and remote are on the same physical machines, but in two separate docker containers (I can provide all the dockerfiles & co. if you have trouble reproducing it).
Version 4.1.5 throughput test:
local_thr tcp://*:7000 1048576 1000
remote_thr tcp://local:7000 1048576 1000
Version 4.2.0 throughput test:
local_thr tcp://*:7000 1048576 1000
remote_thr tcp://local:7000 1048576 1000
Version 4.1.5 latency test:
local_lat tcp://*:7000 1048576 1000
remote_lat tcp://local:7000 1048576 1000
Version 4.2.0 latency test:
local_lat tcp://*:7000 1048576 1000
remote_lat tcp://local:7000 1048576 1000
I notice this performance drop also in the real application as well as in this application own performance tests (using different socket types than from the zmq perf testers).
Any idea which change could have brought such drop?
Cheers, Axel