zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.62k stars 2.35k forks source link

Messages are never received on xpub socket after the message queue becomes full #4538

Closed GergoTot closed 1 year ago

GergoTot commented 1 year ago

Issue description

We are facing an issue with sending messages on 'xpub' sockets (publish-subscribe pattern implementation). Here are some details:

We are using cpp-zmq over the zmq library and IPC sockets under Linux. 

We have a Broker service which managing different communication patterns between multiple services. One of  the supported pattern is the pub-sub pattern: Publisher services can publish messages on topics, these are received by our Broker service which forwards the published messages through xpub sockets towards the Subsriber services (if they are subscribed to the given topic). We use 'xpub' socket_type in case of the Broker and 'sub' socket_type in case of the Subsriber services.

We have implemented a heartbeat mechanism between the Broker and Subscriber services:

Broker publishes a special heartbeat message (with a specific topic) at a regular timing interval and if a Subsriber service doesn't receive this heartbeat message under 3X heartbeat timing interval then this Subsriber service considers that the connection has lost with the Broker.

The problem comes sometimes (so it is not deterministic at all) when we are running some performance test with a heavy load: After a while we start seeing an issue: the Broker can't send heartbeat messages because the message queue becomes full. Our Subscriber services never receives heartbeat messages anymore from this point (sending the heartbeat message in the Broker always returns with EAGAIN error code).

Any other messages published by the Publisher services (on a different topics) are successfully forwarded by the Broker (using the same socket) to the Subscriber services, only the heartbeat messages are failing.

We have also patched the zmq library and cpp-zmq with some additional logs, but we have only seen that the message queue is full: HWM reached. We use the default HWM: 1000. 2000 messages have been written but 1000 messages have been only read, and these values are never changing after its stucked.

At this point we've tried to check how we can get out of this situation: First we've tried to restart only the Subscriber services 1by1 but this hasn't solved the issue. When we restarted the Broker service, heartbeat mechanism started to work again.

I was not able to reproduce this situation with a simple example code. I've tried to 'overload' the socket with lots of messages in a short timeframe, but after some time, subscriber sockets are started to receive the messages again (as they should).

Do you see any conceptual problem with our heartbeat mechanism? Do you have any idea how this situation can happen? How can it be possible that only for a topic is stucked?

Environment

Minimal test code / Steps to reproduce the issue

We couldn't reproduce yet.

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?

bluca commented 1 year ago

https://zguide.zeromq.org/docs/chapter5/

GergoTot commented 1 year ago

I have already read chapter 5 and I didin't find any explanation why some messages for only one one specific topic stuck in an xpub message queue for ever. Meanwhile the other messages belonging to other topics are forwarded correctly on the same socket from the same publisher to the same subscriber. Maybe I missed something. Can you please explain more which part of the chapter 5 you mean exactly?

For example it's not a slow subsriber situation since the processing of this messages takes only some milliseconds. Could you explain more detailed why message belonging to only one specific topic can never be received after the subsriber is able to receive messages belonging to other topics so it's available and responsable?