zeromq / jeromq

JeroMQ is a pure Java implementation of the ZeroMQ messaging library, offering high-performance asynchronous messaging for distributed or concurrent applications.
https://zeromq.org
Mozilla Public License 2.0
2.36k stars 485 forks source link

ZMQ.Context.term() hangs even if setLinger(0) in presence of unstable network #598

Open lultimouomo opened 6 years ago

lultimouomo commented 6 years ago

The following program expects a hartbeat from a PUB socket once a second (actually waits 2 seconds for margin); if it doesn't receive it, it disconnects and reconnects the SUB socket. If I disconnect and reconnect the WiFi on the computer running it (with PUB on a separate computer) I can get the closing of the ZMQ.Context to hang forever. The logs also shows that somehow the socket connection and subscribing thinkgs it succeeds even though the network is disconnected (obviously the heartbeat is not received).

package it.awtech.awdoc.log_listener;

import org.zeromq.ZMQ;
import org.zeromq.ZMQ.Context;
import org.zeromq.ZMQ.Socket;

import java.util.Arrays;
import java.util.concurrent.CountDownLatch;

public class App2
{
    private static volatile boolean shutDown;
    private static final CountDownLatch shutDownLatch = new CountDownLatch(1);

    public static void main( String[] args ) {
        System.out.println("SUB started with args " + Arrays.toString(args));

        try (final Context context = ZMQ.context(1)) {
             try (final Socket subscriber = context.socket(ZMQ.SUB)) {
                 subscriber.setLinger(0);
                 subscriber.setSendTimeOut(2000);
                 subscriber.setReceiveTimeOut(2000);
                 Runtime.getRuntime().addShutdownHook(new Thread(() -> {
                     shutDown = true;
                     try {
                         shutDownLatch.await();
                     } catch (InterruptedException e) {
                         e.printStackTrace();
                     }
                 }));
                 boolean connected = false;
                 while (!shutDown) {
                     try {
                         if (!connected) {
                             try {
                                 subscriber.connect(args[0]);
                                 System.out.println("Connected to server");
                                 subscriber.subscribe("");
                                 System.out.println("Subscribed topic");
                                 connected = true;
                             } catch (Exception e) {
                                 System.out.println("Error connecting socket");
                                 e.printStackTrace();
                                 Thread.sleep(1000);
                                 continue;
                             }
                         }
                         // Ignore organization
                         String topic = subscriber.recvStr();
                         if (topic == null) {
                             System.out.println("Didn't get heartbeat, disconnecting.");
                             subscriber.disconnect(args[0]);
                             connected = false;
                         }
                     } catch (Exception e) {
                         if (shutDown) {
                             System.out.println("Exiting ZMQ thread");
                             break;
                         }
                         System.out.println("Error receiving message");
                         e.printStackTrace();
                     }
                 }
                 System.out.println("Exiting socket try with resources");
             }
            System.out.println("Exiting context try with resources");
        }
        System.out.println("Exited try with resources");
        shutDownLatch.countDown();
    }
}

This is the log for a run, lines starting with "###" are comments about when I connected or disconnected the wifi or stopped the program.

SUB started with args [tcp://devel.awdoc.com:10082]
Connected to server
Subscribed topic
### DISCONNECT WIFI ###
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
### RECONNECT WIFI ###
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Didn't get heartbeat, disconnecting.
Error connecting socket
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Error connecting socket
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Connected to server
Subscribed topic
### DISCONNECT WIFI ###
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Connected to server
Subscribed topic
Didn't get heartbeat, disconnecting.
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
### RECONNECT WIFI ###
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Error connecting socket
java.lang.IllegalArgumentException: java.net.UnknownHostException: devel.awdoc.com
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:119)
    at zmq.io.net.tcp.TcpAddress.<init>(TcpAddress.java:35)
    at zmq.io.net.Address.resolve(Address.java:96)
    at zmq.SocketBase.connect(SocketBase.java:543)
    at org.zeromq.ZMQ$Socket.connect(ZMQ.java:2531)
    at it.awtech.awdoc.log_listener.App2.main(App2.java:36)
Caused by: java.net.UnknownHostException: devel.awdoc.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at zmq.io.net.tcp.TcpAddress.resolve(TcpAddress.java:110)
    ... 5 more
Connected to server
Subscribed topic
### STOP PROGRAM WITH SIGTERM ###
Exiting socket try with resources
Exiting context try with resources
### HANGS FOREVER HERE ###
lultimouomo commented 6 years ago

Please not that I get the same result if instead of terminating the program with SIGTERM I replace the shutdown hook with this code to cause the main loop to exit by pressing enter in the console:

                 new Thread(() -> {
                     try {
                         System.in.read();
                         shutDown = true;
                     } catch (Exception e) {

                     }
                 }).start();
                 Runtime.getRuntime().addShutdownHook(new Thread(() -> {
                     try {
                         shutDownLatch.await();
                     } catch (InterruptedException e) {
                         e.printStackTrace();
                     }
                 }));
lultimouomo commented 6 years ago

Also note that this doesn't seem to happen with the JNI bindings

fredoboulo commented 6 years ago

Hi, I tried your code and it works fine on my machine with latest snapshot.

Which version are you using? What is your OS, java version, ... ?

lultimouomo commented 6 years ago

jeromq version 0.4.3, Debian sid, linux 4.14.0, openjdk version 1.8.0_181 and version 10.0.1 2018-04-17. To reliably reproduce this I need to turn off and back on the WiFi 3-4 times (though it can happen with one only sometimes)

lultimouomo commented 6 years ago

Also happens with openjdk version 10.0.2 2018-07-17

fredoboulo commented 6 years ago

Thanks for the details!

I still cannot reproduce it even with 0.4.3 and openjdk version "1.8.0_162". Maybe the PUB side is needed for the bug to appear. I suspect that issue will be difficult to unit test...

ØMQ sockets are supposed to perform automagic reconnection, so disconnecting and reconnecting seems redundant, unless you need it for application purposes. The connection process is also asynchronous: for most transports and socket types the connection is not performed immediately but as needed by ØMQ; a successful call to connect(String) does not mean that the connection was or could actually be established.

Last point, the latest snapshot includes heartbeating within the socket itself.

lultimouomo commented 6 years ago

The PUB side uses libzmq version 4.1.4.

I implemented heartbeating because in the trials I experienced a hung connection (not receiving messages and not reconnecting) when connecting between SUB and PUB through a VPN; if the VPN drops the TCP keeps on living, and when the VPN comes up again it the zmq socket does not resume receiving messages. As I understand it this is the nature of TCP, and there needs to be some form of heartbeating to fix the problem.

Is the heartbeating included in the latest snapshot compatible with libzmq? Do you have pointers to some documentation on how to enable it?

fredoboulo commented 6 years ago

... I remember the hearbeats were implemented in libzmq 4.2.x. If you have no access to the code of the PUB side it might be a deadend.

Anyhow, you can get some doc about it there: http://api.zeromq.org/4-2:zmq-setsockopt, look for ZMQ_HEARTBEAT_IVL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_TTL.

You can also have a look at the HeartbeatsTest class. It's unit test for zmq, not for org.zeromq but it might provide some light.