Open lukaszsamson opened 4 years ago
Possibly related to https://github.com/zeromq/libzmq/issues/3596
Strangely the issue stopped happening with no changes in code. I remember that around that time our administrators reconfigured/updated our company VPN. My guess is that some bug in OpenVPN was the culprit here.
The issue reappeared on our production environment (Linux). Test environment with slightly less load is not affected (and less connections to other zmq endpoints). The crash happens once or twice per day so it's pretty rare.
I can reproduce it reliably on current master. Steps:
My initial investigation indicates that when the heartbeat timeout fires zmq::session_base_t::engine_error
is called and a new instance of zmq::stream_engine_base_t
is created with _input_stopped
set to false
. I can see that zmq::stream_engine_base_t::in_event_internal
is called regularly, but it never reaches _input_stopped = true;
.
Later on, when the process is resumed and it calls recv
, zmq::stream_engine_base_t::restart_input
is attempted which triggers the assert.
The bug is definitively related to #3596
Edit: The above steps reproduce the bug 100% times in a simple erlang app using erlzmq bindings. However, I wasn't able to reproduce it in a simple C program. In erlang program when I hit ctrl+c i see the following sequence of calls:
BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
(l)oaded (v)ersion (k)ill (D)b-tables (d)istribution
cmd 5 (zmq::object_t::process_command)
recv returns success
calling recv
recv returns success
cmd 6 (zmq::object_t::process_command)
setting _input_stopped = true
(and 5 second later)
cmd 1 (zmq::object_t::process_command)
cmd 2 (zmq::object_t::process_command)
setting _input_stopped = false in ctor
cmd 3 (zmq::object_t::process_command)
cmd 11 (zmq::object_t::process_command)
cmd 12 (zmq::object_t::process_command)
cmd 13 (zmq::object_t::process_command)
(after resume)
calling recv
recv returns success
calling recv
cmd 6 (zmq::object_t::process_command)
Assertion failed: _input_stopped (src/stream_engine_base.cpp:419)
In C program the part after (and 5 second later) is never printed. It may be that erlang virtual machine does something nasty that breaks zmq assumptions. Any ideas how I can debug it further @bluca @brettviren?
Is there any update or plan to fix this issue? I am facing the same error after enabling zmq hearbeats on both pub and sub side.
I'm also experiencing this on 4.3.5 with heartbeats on both sides of pub/sub
Pubsub socket is unusable with heartbeats (in our case with erlang bindings). Since I opened this issue 4 years ago we disabled heartbeats and instead implemented a custom heartbeat protocol over ordinary pub messages. The pub server will emit heartbeat messages every few seconds. The clients listen for heartbeats and if they don't receive them in time they would close the socket and reconnect. This workaround made the connections much more stable and zmq asserts are no longer crashing our production servers.
Issue description
Assertion failed: _input_stopped (stream_engine.cpp:467)
Environment
Minimal test code / Steps to reproduce the issue
Issue happens indeterministically so no direct reproduction steps. I have a process with several threads each running one dedicated DEALER socket. Every socket is in the same zmq context and is connected to the same endpoint (ROUTER) over TCP transport and uses asynchronous request-response pattern. The socket threads each call zmq_poll (on that one socket) and alternate zm_send/zmq_recv with ZMQ_DONTWAIT. The endpoint is generally slow at consuming messages and sending responses and the responses may be produced in a different order or not arrive at all. Sockets are created and used only on 1 thread.
What's the actual result? (include assertion message & call stack if applicable)
Creshing thread stack trace
All threads
What's the expected result?
No crash