Performance of WSAPoll vs. select vs. WSAWaitForMultipleEvents on Windows

sigiesec commented 7 years ago

I tested the performance of the "old" select-based zmq_poller_poll implementation vs. the "new" WSAPoll-based zmq_poller_poll implementation on a Windows 7 x64 machine, and found the "old" select-based variant to perform about 10% better in an overall test scenario which uses libzmq quite deep inside.

Is this as expected? Is WSAPoll known to perform better in some scenarios?

These results I found speaks against that: https://groups.google.com/forum/#!topic/openpgm-dev/9qA1u-aTIKs

In fact, they seem to indicate that using WSAWaitForMultipleEvents would even perform better than select, so maybe it would make sense to add another implementation using that? Any thoughts?

bluca commented 7 years ago

We have multiple implementations for Linux too, so IMHO it's fine if you want to add a new one

sigiesec commented 7 years ago

I just noticed that there already is an implementation using WSAWaitForMultipleEvents, but it is only used when there are sockets from multiple address families in a poller.

It also seems to be broken, as after calling WSAWaitForMultipleEvents, it just falls through to the select code... Instead, probably WSAEnumNetworkEvents would need to be called on the event that was signalled.

sigiesec commented 7 years ago

@Kentzo since you added this code with 538e5d47421f45d88a8b3b005ed9ee4026623a7b, can you say something about whether you tested this? It is not tested on appveyor at the moment, unfortunately.

sigiesec commented 7 years ago

@Kentzo since you added this code with 538e5d47421f45d88a8b3b005ed9ee4026623a7b, can you say something about whether you tested this? It is not tested on appveyor at the moment, unfortunately.

Kentzo commented 7 years ago

@sigiesec I did not benchmark it, because the goal was to make it to work at all. We run this code on thousands, mostly Windows, machines every day.

I thought about WSAEnumNetworkEvents, but decided to limit my intrusion as there was no WSAPoll implementation for Windows. Probably both changes can be unified with WSAEnumNetworkEvents. One thing I recall is the constant global limit on number of sockets WSAEnumNetworkEvents can take care of.

sigiesec commented 7 years ago

Ok I now had another look and I think I understand how it works. I think a different implementation might do without select completely and outperform this.

What is suboptimal about the current implementation is that the events are created and configured on every call to loop, and even within each while iteration. It would be better to only create those once and reconfigure them when the poll set changes.

In addition, the events were even created when there was only one address family, in which case they were never used.

Kentzo commented 7 years ago

I thought WSAEVENT is a simple C struct and time it takes to create it is nothing. I'd suggest to benchmark it before adding any caches.

sigiesec commented 7 years ago

WSAEVENTS is but the wsa_eventst ctor calls WSACreateEvent which is expensive: https://github.com/zeromq/libzmq/commit/538e5d47421f45d88a8b3b005ed9ee4026623a7b#diff-872ce26b9fb528e0ec0abd474883ca8aR457

sigiesec commented 7 years ago

I will benchmark/profile it, the expensive WSACreateEvent turned up by that. What is also expensive is get_fd_family, getsockname in particular.

loop, set_poll and reset_poll are among the most frequently called functions in libzmq, so their performance is critical.

Kentzo commented 7 years ago

Something like LRU cache can be used for get_fd_family.

If you can implement it via WSAEnumNetworkEvents, select and associated methods can be avoided I think.

sigiesec commented 7 years ago

I have added a cache in 37914d1be23b89f7bd747d02ee5a56a18a12d7c3, it is not LRU, but just randomly overwrites entries, this could still be improved.

I attempted an implementation wsa_event_select_t in https://github.com/sigiesec/libzmq/tree/add-wsa-eventselect-poller, but I got stuck somehow. I believe the problem is that FD_WRITE is only edge-triggered, i.e. it is only triggered again if a send failed with WSAEWOULDBLOCK, ~~but that is not quite compatible with the use of the poller in libzmq~~ (that is probably not true since epoll also behaves this way). If anyone has an idea if this is really the problem and/or how to solve it, that would be great.

sigiesec commented 7 years ago

What might be even better in terms of performance/scalability were to use I/O completion ports, which is what NetMQ does: http://somdoron.com/2014/11/netmq-iocp/ However, I think this requires more extensive changes to libzmq.

@somdoron Are you planning to come to the ZeroMQ Pre-Fosdem Hackaton in February? Maybe we could work on porting your approach to the native libzmq then.

Kentzo commented 7 years ago

@sigiesec Could you provide a simple benchmark that uses getsockopt on Windows alone? E.g. creation of 1000 000 sockets in a loop and getting family vs just creation?

sigiesec commented 6 years ago

An even faster way appears to be the Winsock Registered I/O extensions, which are supported from Windows 8.1 (RIOCloseCompletionQueue, RIOCreateCompletionQueue, RIOCreateRequestQueue, RIODequeueCompletion, RIODeregisterBuffer, RIONotify, RIOReceive, RIOReceiveEx, RIORegisterBuffer, RIOResizeCompletionQueue, RIOResizeRequestQueue, RIOSend, RIOSendEx).

somdoron commented 6 years ago

@sigiesec not sure if it helpful but NetMQ is using IO Completion Ports. It required some work to make it work.

You can read more here: http://somdoron.com/2014/11/netmq-iocp/

sigiesec commented 6 years ago

@somdoron Thanks, I already read your post ;) If I understand it correctly, all users of polling within libzmq must be changed to the reactive model, and then there can be an adapter that maps to the current poller_t API for the existing polling mechanisms. Very high level, but would you agree insofar?

somdoron commented 6 years ago

only the internal poller have to change, the external one (aka zmq_poll and zmq_poller) should stay as is.

The internal one and internal threading have to change, I can dig the commit of netmq if it will help you. It is kind of big change and not sure will give the same performance for non-windows OSes.

sigiesec commented 6 years ago

Yes, a link to the relevant changes would be great.

Of course the internal and external pollers can be changed independently, but why do you suggest only to change the internal poller?

somdoron commented 6 years ago

the external poller is polling over very few FDs, usually 2 or 3. Even on linux zeromq is using poll and not epoll for the external poller. So there is no need to invest in IOCP for the external poller, you will not see any performance gain.

sigiesec commented 6 years ago

Ok good to know that, thanks!

somdoron commented 6 years ago

this is the main commit:

https://github.com/zeromq/netmq/commit/99abdf8e84b3e341fffc1a2d8cd20741882eb1d0

Also I created a library for that called AsyncIO which wrap IOCP with nice API that fits .net.

I think it will also make your life easier to create such a library to wrap kqueue,epoll,IOCP,poll and select and provide simple API to zeromq.

sigiesec commented 6 years ago

It might be much easier to integrate https://github.com/piscisaureus/wepoll

bluca commented 6 years ago

That sounds quite promising!

sigiesec commented 6 years ago

This might be even more interesting: https://github.com/truebiker/epoll_windows/commit/32442e432b3376cb98f5c0bda5a9a0a5e832b857 It is a fork based on a quite old version of wepoll, which adds support for eventfd-like abilities.

zeromq / libzmq

Performance of WSAPoll vs. select vs. WSAWaitForMultipleEvents on Windows #2805