Issue description

On our ZYRE production server, once in a while, we observe never-closed sockets. Sometimes, it goes up to 200 sockets to the same remote ZYRE node.

Environment

libzmq version (commit hash if unreleased): 3.4
OS: reproduced on
- Linux CentOS (32 & 64 bits - x86 and ARM),
- Rocky (64 bits) (x86)

Minimal test code / Steps to reproduce the issue

Start ZYRE node A
Start ZYRE node B
On node A, 2 TCP sockets are seen with Node B:
- Node A connected to Node B (used to send data to B).
- Node B connected to Node A (used to receive data from B).
Node B goes offline (out of WIFI coverage, Ethernet cable unplugged, Windows hybernation, ...)
On node A, after some time, the ZYRE layer detects that node B is no more present and the PEER B is destroyed with the socket to it (node A to B).

What's the actual result? (include assertion message & call stack if applicable)

Socket from node B to node A is never closed, even if

node B application is restarted or
node B is rebooted.

Note: This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).

What's the expected result?

Sockets from remote nodes should be automatically closed when the remote disappear:

Either the ZYRE peer destruction should do,
Use of TCP KEEPALIVE from the ZYRE application,

I failed to have a working implementation in any of those 2 cases.

Possible solution

I digged into LIBZMQ and ZYRE for quite some time. I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket in this particular scenario.

Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in tcp_listener.cpp. Basically, the idea is like:

  sock = accept(s_);
  ...
  tune_tcp_keepalives(sock, x, y, y);

zeromq / libzmq