zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.71k stars 2.35k forks source link

loss of subscription, when inside a docker container who's host had a network outage #4476

Open ConfusedMerlin opened 1 year ago

ConfusedMerlin commented 1 year ago

Issue description

A nodeJS powered zeroMQ subscriber loses connection to the broker/publisher. Said subscriber is inside a docker container, whose host just had a network-event, that caused a disconnect-reconnect of the network interface the docker host would connect to the zeroMQ broker/publisher. Any messages targeted to the subscribers topics will not reach the subscriber any more, until the docker container it is running inside is restartet.

If the subscriber is not inside a docker, but rather directly on the docker host, the network-outage does not influence its ability to receive messages of topics subscribed to.

manipulating the TCP timeouts (inside the docker and on the docker host) cannot solve this.

Environment

Minimal test code / Steps to reproduce the issue

you need Docker for that and two VMs (works with physical devices too).

by re-using a tutorial for upcomming zeroMQ 6 on nodeJs vom dev.to, two js files were created. note that this is the code for 6.0.0b; you can make it work for 5.2.8, if you switch out the socket creation (there is a commented line with the code for that).

Beware, the worker has a fixed IP for the VM inserted, where the Broker/Publisher/Server is running. You should adjust that to fit your setup:

// @/server.js
const Fastify = require("fastify");
const zmq = require("zeromq");

const app = Fastify();
// const sock = zmq.socket("pub");
 const sock = new zmq.Publisher();

app.post("/", async (request, reply) => {
  await sock.send(["dev.to", JSON.stringify({ ...request.body })]);
  return reply.send("Sent to the subscriber/worker.");
});

const main = async () => {
  try {
    await sock.bind("tcp://*:7890");
    await app.listen(3000, '0.0.0.0');
  } catch (err) {
    console.error(err);
    process.exit(1);
  }
};
main();
// @/worker.js
const zmq = require("zeromq");

//const sock = zmq.socket("sub");
 const sock = new zmq.Subscriber();
// is for zeromq 6
//

const main = async () => {
        try {
                sock.connect("tcp://10.0.2.4:7890");
                sock.subscribe("dev.to");
                 for await (const [topic, msg] of sock) {
                        console.log("Received message from " + topic + " channel:");
                        console.log(JSON.parse(msg));
                }
                //sock.on("message", function(topic, message) {
                //      console.log("message:", message, "topic", topic);
                //
                //});
        }catch(err) {
                console.error(err);
                process.exit(1);
        }
};
main();

next, a Dockerfile is used to create an image, containing both files (you can make two images with one file each too, but this way is faster):

FROM ubuntu:22.04

RUN apt-get update && apt-get install -y curl
RUN curl -sL https://deb.nodesource.com/setup_14.x -o /tmp/nodesource_setup.sh
RUN bash /tmp/nodesource_setup.sh
RUN apt-get update && apt-get install -y nodejs net-tools
RUN node -v && npm -v
RUN npm install zeromq fastify
COPY server.js /var/tmp/server.js
COPY worker.js /var/tmp/worker.js
WORKDIR /var/tmp/

build that with

docker build ./ -t zeromq

which should result in an image with the tag zeromq:latest being build. That can be startet with a docker-compose.yml:

version: '3'

networks:
  zero:

services:
  worker:
    image: zeromq:latest
    networks:
      zero:
    command: node worker.js

  server:
    image: zeromq:latest
    networks:
      zero:
    command: node server.js
    ports:
      - 3000:3000
      - 7890:7890

As you see, this docker-compose.yml defines two services, one is running the server.js, one is running the worker.js. Start the server on VM A, and the client on VM B with

docker-compose up server and docker-compose up worker

We do not detach from these, so we can see the log immediately. You should start a second worker at VM A, next to the broker, so you can compare their output.

And now you can send stuff to the broker with

curl -X POST 10.4.0.2:3000 -d "test"

or using Postman, if you like. However, the Container for the worker should now log whatever was in -d.

Now... go to the VM B, running the worker docker, and disconnect it from your network. I am useing VirtualBox, which enables me to just un-tick and re-tick the network device.

Now POST a message again and... the VM B worker will not receive any messages, while the one at A still does. B will not receive any messages, until you restart the docker.

EDiT: There is an other way of recreating the connection... kind of. If you go inside the docker (docker-compose exec worker /bin/bash) and start the worker again / a second time (node worker.js), then this instance will be able to receive messages published. But this instance will also fail to do so, if you do the dis/reconnect again afterwards.

So, whatever zeroMQ does to hold the connection breaks apart once the host of its docker container has a network hickup.

What's the actual result? (include assertion message & call stack if applicable)

I am sorry, the result is "nothing happens at the subscribers side"... so I do not have anything to put here.

What's the expected result?

the subscriber either manages to reconnect after a while, or it does not lose the connection at all. Messages published will reach the subscriber at the dis/reconnected docker host.

how was this noticed?

this Project: https://github.com/DeviceFarmer/stf it uses zeromq 5.2.8 on nodejs 17.4 to allow communication between the different services, that make it up. if one of the secondary servers encounters an network outage of some kind (however short), the android phones connected to it cannot be used any more, as the zeroMQ subscriber running there will not receive messages from the main server any more.

At least, we were able to replicate the issue there with the network dis/reconnect, which in the end was found out to be caused by zeroMQ, as it seems.