zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.79k stars 2.36k forks source link

Never closed sockets when remote ZYRE node goes offline #4729

Open stephan57160 opened 3 months ago

stephan57160 commented 3 months ago

Issue description

On our ZYRE production server, once in a while, we observe never-closed sockets. Sometimes, it goes up to 200 sockets to the same remote ZYRE node.

Environment

Minimal test code / Steps to reproduce the issue

  1. Start ZYRE node A
  2. Start ZYRE node B
  3. On node A, 2 TCP sockets are seen with Node B:
    • Node A connected to Node B (used to send data to B).
    • Node B connected to Node A (used to receive data from B).
  4. Node B goes offline (out of WIFI coverage, Ethernet cable unplugged, Windows hybernation, ...)
  5. On node A, after some time, the ZYRE layer detects that node B is no more present and the PEER B is destroyed with the socket to it (node A to B).

What's the actual result? (include assertion message & call stack if applicable)

Socket from node B to node A is never closed, even if

Note: This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).

What's the expected result?

Sockets from remote nodes should be automatically closed when the remote disappear:

I failed to have a working implementation in any of those 2 cases.

Possible solution

I digged into LIBZMQ and ZYRE for quite some time. I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket in this particular scenario.

Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in tcp_listener.cpp. Basically, the idea is like:

  sock = accept(s_);
  ...
  tune_tcp_keepalives(sock, x, y, y);