On our ZYRE production server, once in a while, we observe never-closed sockets.
Sometimes, it goes up to 200 sockets to the same remote ZYRE node.
Environment
libzmq version (commit hash if unreleased): 3.4
OS: reproduced on
Linux CentOS (32 & 64 bits - x86 and ARM),
Rocky (64 bits) (x86)
Minimal test code / Steps to reproduce the issue
Start ZYRE node A
Start ZYRE node B
On node A, 2 TCP sockets are seen with Node B:
Node A connected to Node B (used to send data to B).
Node B connected to Node A (used to receive data from B).
Node B goes offline (out of WIFI coverage, Ethernet cable unplugged, Windows hybernation, ...)
On node A, after some time, the ZYRE layer detects that node B is no more present and the PEER B is destroyed with the socket to it (node A to B).
What's the actual result? (include assertion message & call stack if applicable)
Socket from node B to node A is never closed, even if
node B application is restarted or
node B is rebooted.
Note:
This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).
What's the expected result?
Sockets from remote nodes should be automatically closed when the remote disappear:
Either the ZYRE peer destruction should do,
Use of TCP KEEPALIVE from the ZYRE application,
I failed to have a working implementation in any of those 2 cases.
Possible solution
I digged into LIBZMQ and ZYRE for quite some time.
I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket
in this particular scenario.
Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in tcp_listener.cpp.
Basically, the idea is like:
sock = accept(s_);
...
tune_tcp_keepalives(sock, x, y, y);
Issue description
On our ZYRE production server, once in a while, we observe never-closed sockets. Sometimes, it goes up to 200 sockets to the same remote ZYRE node.
Environment
Minimal test code / Steps to reproduce the issue
What's the actual result? (include assertion message & call stack if applicable)
Socket from node B to node A is never closed, even if
Note: This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).
What's the expected result?
Sockets from remote nodes should be automatically closed when the remote disappear:
I failed to have a working implementation in any of those 2 cases.
Possible solution
I digged into LIBZMQ and ZYRE for quite some time. I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket in this particular scenario.
Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in
tcp_listener.cpp
. Basically, the idea is like: