zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.64k stars 2.35k forks source link

sending data on client DEALER socket hangs after 30 seconds in 4.3.3 unless I set ZMQ_HANDSHAKE_IVL to 0 #4147

Open vashek opened 3 years ago

vashek commented 3 years ago

Issue description

TLDR: Sending data on a DEALER mode client socket hangs after 30 seconds. The workaround is to set ZMQ_HANDSHAKE_IVL to 0. This started in 4.3.3 (it didn't occur in 4.3.2).

Long version: I have a Python script using pyzmq that connects to a proprietary system using a DEALER mode socket, keeps the socket open and occasionally exchanges some messages. It runs on Windows and works fine with pyzmq 19.0.2 (with libzmq 4.3.2) but fails with pyzmq 20.0 (with libzmq 4.3.3). However, another similar script on Linux on ARM works fine with pyzmq 18.0.1 and 4.3.2 but fails with the same pyzmq after updating to libzmq 4.3.3, so that seems to leave only libzmq 4.3.3 as the apparent culprit. The counterpart is written in Java and uses jeromq version 0.3.4.

The symptom is that apparently after 30 seconds after the socket is opened, it stops sending any data. socket.send_multipart hangs for 30 seconds (which is my SNDTIMEO) and then throws zmq.error.Again (EAGAIN).

Inspired by #4076 which describes a similar issue, I tried setting ZMQ_HANDSHAKE_IVL to 0 and the script started working again. Again, though, it works fine without setting this socket option with libzmq 4.3.2.

I am by no means an expert on ZeroMQ. For all I know, maybe setting ZMQ_HANDSHAKE_IVL to 0 was always the correct thing to do. But, why did it it work in 4.3.2 and stopped working in 4.3.3? I found no mention of anything apparently related in the change log.

Environment

Minimal test code / Steps to reproduce the issue

import zmq
context = zmq.Context.instance()
socket = context.socket(zmq.DEALER)
socket.setsockopt(zmq.IDENTITY, "some_guid")
socket.setsockopt(zmq.IMMEDIATE, 1)
socket.setsockopt(zmq.LINGER, 0)
# no auto reconnection because we need to re-establish the session on the application level
socket.setsockopt(zmq.RECONNECT_IVL, -1)
socket.setsockopt(zmq.SNDTIMEO, 30000)
socket.setsockopt(zmq.RCVTIMEO, 5000)
socket.setsockopt(zmq.CONNECT_TIMEOUT, 15000)
socket.setsockopt(zmq.HANDSHAKE_IVL, 0)  # fix on 4.3.3
socket.connect("tcp://localhost:5555")
...do some comms...
socket.send_multipart((b"", b"some data", b"some more data"))
...wait up to 30 seconds...
socket.send_multipart((b"", b"some data", b"some more data"))
...results in zmq.error.Again after 30 seconds...

What's the actual result? (include assertion message & call stack if applicable)

Sending stuck.

What's the expected result?

Sending works.

bluca commented 3 years ago

Can you do a bisect to figure out the commit that introduced a regression and send a fix/revert?

vashek commented 3 years ago

Yep, found it, it's this one: c04f6581e0cd07794e00941ccf5b8b2e9b6a528d

Here's my git bisect log:

git bisect start
# good: [a84ffa12b2eb3569ced199660bac5ad128bff1f0] Finalize changelog for 4.3.2
git bisect good a84ffa12b2eb3569ced199660bac5ad128bff1f0
# bad: [04f5bbedee58c538934374dc45182d8fc5926fa3] Finalize changelog for 4.3.3
git bisect bad 04f5bbedee58c538934374dc45182d8fc5926fa3
# good: [81d522bb661470ba415d3e42376c90475d7f6499] Refer to zmq_socket because there are more thread-safe sockets than just client/server
git bisect good 81d522bb661470ba415d3e42376c90475d7f6499
# good: [60ccf54fa632a9ff1547302f5dfe2ca6af5af3f7] Problem: sub/cancel broken with CURVE
git bisect good 60ccf54fa632a9ff1547302f5dfe2ca6af5af3f7
# bad: [1ddfeb56ec4fb355160daa0dc750d05494e6d54e] Problem: formatting errors
git bisect bad 1ddfeb56ec4fb355160daa0dc750d05494e6d54e
# good: [afacdbeccf3598e2409c4be24ffc62067aa5b6f0] Merge pull request #3913 from somdoron/delay_bind
git bisect good afacdbeccf3598e2409c4be24ffc62067aa5b6f0
# good: [2e87390656dbc66309501dbaf6fcd7521234ef37] Initialize some CMake variables as empty before build
git bisect good 2e87390656dbc66309501dbaf6fcd7521234ef37
# good: [c8b141d9649b0a4e3ba33654372234aa09d7be94] Merge pull request #3955 from bluca/thread_doc
git bisect good c8b141d9649b0a4e3ba33654372234aa09d7be94
# good: [522abc73766364d176d7f97ba544cf38fd3f79bb] Merge pull request #3959 from bluca/fuzzers
git bisect good 522abc73766364d176d7f97ba544cf38fd3f79bb
# good: [d0d23446f5797364c885c7fdc982e5a1efe616fa] Merge pull request #3962 from sigiesec/fix-clang-tidy-analyze
git bisect good d0d23446f5797364c885c7fdc982e5a1efe616fa
# bad: [c04f6581e0cd07794e00941ccf5b8b2e9b6a528d] rebase reconnect-redux on master (#3960)
git bisect bad c04f6581e0cd07794e00941ccf5b8b2e9b6a528d
# good: [c7aef56048e5221fb5de047df04f282409c1eef7] - resolve FPE when ZMQ_RECONNECT_IVL == 0
git bisect good c7aef56048e5221fb5de047df04f282409c1eef7
# first bad commit: [c04f6581e0cd07794e00941ccf5b8b2e9b6a528d] rebase reconnect-redux on master (#3960)
jamescasbon commented 3 years ago

I also see this problem. The server is running zmq 3.2.5.