Closed antonymayi closed 6 years ago
@eponsko any idea what could be the issue?
same problem with just REQ-REP...
Hm I think you've ran into a kernel bug, there was a race-condition that occurred if the the latency between the two TIPC sockets was too low (should be fixed in kernel 4.16). What happens in kernels 4.8 -> 4.15 is that the internal handling in the TIPC module does not set the underlying TIPC socket to readable (or maybe it was writeable) in case the TIPC handshake is too quick. There's two ways to check if this is the cause:
Try with a kernel > 4.15, these you can download here http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/
Apply this patch to libzmq, compile libzmq, and try again. I'm not sure if the pyzmq bindings will use the fresh compiled library however. The patch adds a 10ms delay to the handshake regardless if it's needed or not, so it's not a nice solution..
--- a/src/tipc_connecter.cpp
+++ b/src/tipc_connecter.cpp
@@ -220,6 +220,10 @@ int zmq::tipc_connecter_t::open ()
int rc = ::connect (s, addr->resolved.tipc_addr->addr (),
addr->resolved.tipc_addr->addrlen ());
+ // TODO: Figure out why this happens in the kernel and remove this!
+ // Try to handle race-condition in TIPC kernel > v4.10
+ usleep(10*1000);
+
// Connect was successful immediately.
if (rc == 0)
return 0;
The delayed handshake is not nice - so IMHO if it's confirmed and it has been fixed we can simply document it. We could also ask that the fix gets backported to the 4.9 and 4.14 LTS branches.
Yeah I guess that's the better solution. I can try to find the commit that fixed it, unfortunately I haven't got access to the email discussion anymore so will take some time..
great, thanks!
There's also ongoing efforts to change the TIPC address structure, from the current hierarchical Z.C.N format to a flat 128-bit value (IIRC). I think it will be backwards compatible but that's another issue that might come up here soon.
Here's the commit that fixed the issue: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/tipc?id=517d7c79bdb39864e617960504bdc1aa560c75c6
tried with 4.16 kernel, works great, thanks!
That commit has also been backported to the 4.14 LTS. 4.9 LTS is not affected.
Issue description
TIPC multicast should work within the same host (they have a demo in the tipcutils which proves it) but doesn't seem to be working (or there is no way to enable it) with zeromq PUB/SUB (it works fine over the net but not locally between two processes).
Environment
Minimal test code / Steps to reproduce the issue
What's the actual result? (include assertion message & call stack if applicable)
subscriber hangs not receiving anything
What's the expected result?
subscriber receives the
foo
message