Open sigiesec opened 7 years ago
We have multiple implementations for Linux too, so IMHO it's fine if you want to add a new one
I just noticed that there already is an implementation using WSAWaitForMultipleEvents, but it is only used when there are sockets from multiple address families in a poller.
It also seems to be broken, as after calling WSAWaitForMultipleEvents, it just falls through to the select code... Instead, probably WSAEnumNetworkEvents would need to be called on the event that was signalled.
@Kentzo since you added this code with 538e5d47421f45d88a8b3b005ed9ee4026623a7b, can you say something about whether you tested this? It is not tested on appveyor at the moment, unfortunately.
@Kentzo since you added this code with 538e5d47421f45d88a8b3b005ed9ee4026623a7b, can you say something about whether you tested this? It is not tested on appveyor at the moment, unfortunately.
@sigiesec I did not benchmark it, because the goal was to make it to work at all. We run this code on thousands, mostly Windows, machines every day.
I thought about WSAEnumNetworkEvents
, but decided to limit my intrusion as there was no WSAPoll
implementation for Windows. Probably both changes can be unified with WSAEnumNetworkEvents
. One thing I recall is the constant global limit on number of sockets WSAEnumNetworkEvents
can take care of.
Ok I now had another look and I think I understand how it works. I think a different implementation might do without select completely and outperform this.
What is suboptimal about the current implementation is that the events are created and configured on every call to loop, and even within each while iteration. It would be better to only create those once and reconfigure them when the poll set changes.
In addition, the events were even created when there was only one address family, in which case they were never used.
I thought WSAEVENT
is a simple C struct and time it takes to create it is nothing. I'd suggest to benchmark it before adding any caches.
WSAEVENTS is but the wsa_eventst ctor calls WSACreateEvent which is expensive: https://github.com/zeromq/libzmq/commit/538e5d47421f45d88a8b3b005ed9ee4026623a7b#diff-872ce26b9fb528e0ec0abd474883ca8aR457
I will benchmark/profile it, the expensive WSACreateEvent turned up by that. What is also expensive is get_fd_family, getsockname in particular.
loop, set_poll and reset_poll are among the most frequently called functions in libzmq, so their performance is critical.
Something like LRU cache can be used for get_fd_family
.
If you can implement it via WSAEnumNetworkEvents
, select
and associated methods can be avoided I think.
I have added a cache in 37914d1be23b89f7bd747d02ee5a56a18a12d7c3, it is not LRU, but just randomly overwrites entries, this could still be improved.
I attempted an implementation wsa_event_select_t in https://github.com/sigiesec/libzmq/tree/add-wsa-eventselect-poller, but I got stuck somehow. I believe the problem is that FD_WRITE is only edge-triggered, i.e. it is only triggered again if a send failed with WSAEWOULDBLOCK, but that is not quite compatible with the use of the poller in libzmq (that is probably not true since epoll also behaves this way). If anyone has an idea if this is really the problem and/or how to solve it, that would be great.
What might be even better in terms of performance/scalability were to use I/O completion ports, which is what NetMQ does: http://somdoron.com/2014/11/netmq-iocp/ However, I think this requires more extensive changes to libzmq.
@somdoron Are you planning to come to the ZeroMQ Pre-Fosdem Hackaton in February? Maybe we could work on porting your approach to the native libzmq then.
@sigiesec Could you provide a simple benchmark that uses getsockopt
on Windows alone? E.g. creation of 1000 000 sockets in a loop and getting family vs just creation?
An even faster way appears to be the Winsock Registered I/O extensions, which are supported from Windows 8.1 (RIOCloseCompletionQueue, RIOCreateCompletionQueue, RIOCreateRequestQueue, RIODequeueCompletion, RIODeregisterBuffer, RIONotify, RIOReceive, RIOReceiveEx, RIORegisterBuffer, RIOResizeCompletionQueue, RIOResizeRequestQueue, RIOSend, RIOSendEx).
@sigiesec not sure if it helpful but NetMQ is using IO Completion Ports. It required some work to make it work.
You can read more here: http://somdoron.com/2014/11/netmq-iocp/
@somdoron Thanks, I already read your post ;) If I understand it correctly, all users of polling within libzmq must be changed to the reactive model, and then there can be an adapter that maps to the current poller_t API for the existing polling mechanisms. Very high level, but would you agree insofar?
only the internal poller have to change, the external one (aka zmq_poll and zmq_poller) should stay as is.
The internal one and internal threading have to change, I can dig the commit of netmq if it will help you. It is kind of big change and not sure will give the same performance for non-windows OSes.
Yes, a link to the relevant changes would be great.
Of course the internal and external pollers can be changed independently, but why do you suggest only to change the internal poller?
the external poller is polling over very few FDs, usually 2 or 3. Even on linux zeromq is using poll and not epoll for the external poller. So there is no need to invest in IOCP for the external poller, you will not see any performance gain.
Ok good to know that, thanks!
this is the main commit:
https://github.com/zeromq/netmq/commit/99abdf8e84b3e341fffc1a2d8cd20741882eb1d0
Also I created a library for that called AsyncIO which wrap IOCP with nice API that fits .net.
I think it will also make your life easier to create such a library to wrap kqueue,epoll,IOCP,poll and select and provide simple API to zeromq.
It might be much easier to integrate https://github.com/piscisaureus/wepoll
That sounds quite promising!
This might be even more interesting: https://github.com/truebiker/epoll_windows/commit/32442e432b3376cb98f5c0bda5a9a0a5e832b857 It is a fork based on a quite old version of wepoll, which adds support for eventfd-like abilities.
I tested the performance of the "old" select-based zmq_poller_poll implementation vs. the "new" WSAPoll-based zmq_poller_poll implementation on a Windows 7 x64 machine, and found the "old" select-based variant to perform about 10% better in an overall test scenario which uses libzmq quite deep inside.
Is this as expected? Is WSAPoll known to perform better in some scenarios?
These results I found speaks against that: https://groups.google.com/forum/#!topic/openpgm-dev/9qA1u-aTIKs
In fact, they seem to indicate that using WSAWaitForMultipleEvents would even perform better than select, so maybe it would make sense to add another implementation using that? Any thoughts?