Closed minrk closed 4 years ago
Which socket type are you using?
There will be several SUB and DEALER sockets in the dying process.
Can it be something like firewall? Is it a virtual machine or PC?
It could be a firewall, I don't have access to the failing machine. I believe it is a real machine (cc @jveitchmichaelis for more details).
It's a PC - local notebook.
Few questions:
Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on.
The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that.
I'll try:
1) Jupyter with no notebook open 2) Idle notebook 3) Notebook in an infinite loop. I don't think the particular piece of code matters, the crashes seem arbitrary.
Can you try with firewall of as well?
On Mon, Feb 22, 2016 at 4:33 PM, jveitchmichaelis notifications@github.com wrote:
Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on.
The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that.
I'll try:
1) Jupyter with no notebook open 2) Idle notebook 3) Notebook in an infinite loop. I don't think the particular piece of code matters, the crashes seem arbitrary.
— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-187205927.
Yep will do, I'll see if turning on more verbose debugging in Jupyter throws up anything as well.
Hi, We have the same situation. Environment:
Reproduction:
We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?
So recently someone changed the signaler to use random port, this might make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4, but what I suggest is compiling with reverting the following commit:
https://github.com/zeromq/libzmq/commit/7e09306cb369f4c345850627f3969be153eaa3cf
Or bottom line, in config.hpp make sure the signaler_port is set to 5905 and not 0.
Also in firewall make sure to allow tcp connection on port 5905 to the application.
Let me know if this help.
On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com wrote:
Hi, We have the same situation. Environment:
- Windows 10 x64
- both ZMQ 4.0.4 and 4.1.4 x86
- Compiler: VS2013 x86 build
Reproduction:
- Always crashed at 2 hours after server & client connected.
- error code:
- Assertion failed: Connection reset by peer (......\src\signaler.cpp:298) or
- Assertion failed: ok (......\src\mailbox.cpp:82)
We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?
— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188056544.
OK! Thanks for your suggestion, We will try it.
We also found another problem, When we only executed the "zmq:socket->connect" side of zmq program, it didn't crash after 2 hours, (because it is not connected?) but it crashed when we close the "zmq:socket" and "zmq::context". If we close the program before 2 hours, it will act OK! It showed the error code: Assertion failed: Connection reset by peer (......\src\signaler.cpp:252)
We need to make this configurable, since reusing the same port leads to different problems. Hence that patch. On 24 Feb 2016 15:41, "Doron Somech" notifications@github.com wrote:
So recently someone changed the signaler to use random port, this might make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4, but what I suggest is compiling with reverting the following commit:
https://github.com/zeromq/libzmq/commit/7e09306cb369f4c345850627f3969be153eaa3cf
Or bottom line, in config.hpp make sure the signaler_port is set to 5905 and not 0.
Also in firewall make sure to allow tcp connection on port 5905 to the application.
Let me know if this help.
On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com wrote:
Hi, We have the same situation. Environment:
- Windows 10 x64
- both ZMQ 4.0.4 and 4.1.4 x86
- Compiler: VS2013 x86 build
Reproduction:
- Always crashed at 2 hours after server & client connected.
- error code:
- Assertion failed: Connection reset by peer (......\src\signaler.cpp:298) or
- Assertion failed: ok (......\src\mailbox.cpp:82)
We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?
— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188056544.
— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188282589.
[pic1] [pic2]
Hi, We only ran the built-in "local_thr.exe" as test program and didn't run the client program "remote_thr.exe". After 2 hours, we found two TCP port states have been changed from "ESTABLISHED" to "FIN_WAIT1" as you can see in the [pic1]. And after a while, the "local_thr.exe" test program crashed as the [pic2] shows. Why it always happened after 2 hours? Thanks a lot.
Did you try same test with firewall disabled?
Yes, I have run some tests with firewall disabled, but it crashed after 2 hours. I didn't test "local_thr.exe" with firewall disabled yet, I'll try it afterward.
We have tested "local_thr.exe" with firewall disabled, and it crashed after 2 hours. (2016/2/26)
@JimChenTaiwan are you using windows 10 as well?
Yes, I test it on Windows 10 x64 ver.
I'm seeing what I think is identical behaviour here as well.
Running the following:
f9c8687
After almost exactly two hours the following assert fires:
Assertion failed: Connection reset by peer (..\..\..\..\src\signaler.cpp:351)
Similar behaviour occurs if compiled with poll enabled rather than select:
Assertion failed: pfd.revents & POLLIN (..\..\..\..\src\signaler.cpp:248)
Running the same build of our software on Windows 7 doesn't demonstrate this issue.
I'm using inproc connections and running inproc_thr
(with a Sleep call added at the bottom of the for loop in worker to slow things down to hit 2 hours) reproduces the issue. Invoked as inproc_thr 64 1000000
. After two hours the assert above fires.
Additionally:
@mseagrief does it matter if the socket is idle or not?
it seems like issue with windows 10, I tried to google it without much success, can you open an issue with Microsoft?
@somdoron I'll do some tests today with the socket idle and report back.
As to reporting to Microsoft, I'm not sure what I'd be reporting. I'm not sufficiently familiar with the ZeroMQ internals to make a useful bug report. That said I'll spend some time in the next few days doing so and see if I can isolate the issue
ZeroMQ 3.2.5 doesn't appear to exhibit this behaviour.
Just running zmq_init and then sleeping for 3 hours doesn't crash. However 5 minutes after starting to use the sockets (using inproc_thr) the assert fires.
Running zmq_init and then creating and binding a PULL socket and then sleeping asserts after 2hrs.
It looks like the code added to address #1608 causes this behaviour to be triggered under Windows 10. Both the poll and select cases appear to work without triggering the assert with the code block in signaler.cpp commented out.
As for what the underlying issue is I'm unsure. I tried that code as it was the most significant change I could see in signaler.cpp comparing 3.2.5 against the current git master.
having the same issue. easy way to reproduce this problem is have a large number of subscribers active and then simulate a congested/broken network using a tool called clumsy (i used ver 0.2). in the tool enable "drop" and "out of order" and increase the percentages. depending on your network load, it should crash eventually.
I am seeing a lot of these Assertion failed: nbytes == sizeof (dummy)
in signaler.cpp:364
.
We use jupyter qtconsole as embedded python console in our application, and many our users faced the same bug leading to application crash. It happens even in idle application.
Affected OS:
Problem encountered with ZeroMQ 4.1.5 (PyZMQ 15.4.0) and ZeroMQ 4.1.6 (PyZMQ 16.0.2).
Error message: Connection reset by peer
Stacktrace:
libzmq.cp35-win_amd64.pyd!zmq::zmq_abort(const char * errmsg_) Line 85 C++
libzmq.cp35-win_amd64.pyd!zmq::signaler_t::recv() Line 303 C++
libzmq.cp35-win_amd64.pyd!zmq::mailbox_t::recv(zmq::command_t * cmd_, int timeout_) Line 93 C++
libzmq.cp35-win_amd64.pyd!zmq::socket_base_t::process_commands(int timeout_, bool throttle_) Line 1054 C++
libzmq.cp35-win_amd64.pyd!zmq::socket_base_t::recv(zmq::msg_t * msg_, int flags_) Line 955 + 0xd bytes C++
libzmq.cp35-win_amd64.pyd!zmq_msg_recv(zmq_msg_t * msg_, void * s_, int flags_) Line 601 + 0xe bytes C++
Is there any progress?
@PolarNick239 for my use case I reverted out the changes in #1608 and compiled up my own binaries. I've been successfully using them for some time now, albeit with fairly light workloads.
It's on my list to re-visit this and bump up the version of ZeroMQ we're using to see if it's resolved, but not high priority at the moment as it is working.
I spent a fair while digging around using various of the low level tracing tools in Windows, but nothing conclusive and they're not tools I've used before so I wasn't 100% certain I was looking in the correct place.
@mseagrief Thanks! I will make a test with reverted #1629 and write here my results when I will have a time.
help ! ports are occupie, who connect? 12580 occupie https://github.com/zeromq/libzmq/issues/2598
I am having this problem too. Consistent crashes at around 2 hours. Running PUB/SUB exclusively. May be associated with simultaneous use of websockets for WSS connection to remote host in app running ZMQ PUB server. C# app WEBSOCKET-sharp and ZMQ C# CLR. Have disabled WEBSOCKET-sharp and this reduces the frequency of events. ZMQ is unusable unless this problem is resolved. Need apps to run 24/7 with no unresolvable events. WEBSOCKET-sharp with no ZMQ operates 24/7 with no events. I'm currently looking for alternative methods for interprocess communication that will take the place of ZMQ functionality. This thread is 1.5 years old with no suggested resolution. That is very unacceptable for a problem so consistently reproduced. Other info - Windows 10 64-bit, Windows 7 64-bit. VS2017/NuGet. Error event always occurs in signaler.cpp at a variety of line numbers on both PUB and SUB side simultaneously. All apps crash to OS simultaneously. There is no way to encalpsulate a try-catch. Assertion error appears not to be escalated up the call stack and as such cannot be contained.
As a followup to the issue comment I made 5 days ago. I have completed my search for an alternate MQ. I have successfully implemented the same process with RabbitMQ. I have not had to encapsulate anything with try-catch. It simply works as expected with no assertion errors. I now have 4 simultaneous applications running with 1 PUB and 3 SUB end points in each all intercommunicating to each other at 5ms intervals for a period of days and still running, even through random externally induced TCP communication drop out. I urge everyone to stop wasting their time with ZeroMQ. ZeroMQ is damaged with no prospect for resolution. ZeroMQ has excellent documentation. That's as far as it goes. Regretfully, I'm abandoning ZeroMQ.
Thanks for the tip re: RabbitMQ. I'll take a look, I've been putting off upgrading our ZMQ binaries for a long while now.
We're still running locally built binaries with #1629 removed without issue. But I wholly agree about assertions like these firing in production code being deeply broken.
The move to RabbitMQ I reported 17 days ago has been so successful I feel compelled to report it. ZERO truely ZERO errors since the last report. It's been nearly a month of 24/7 messaging between all processes. Millions of transactions. Perhaps Billions of flawless object messaging transactions.
The code locations of the reported assertions are: https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L181 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L192
https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L298 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L313
https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L303 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L318
https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/mailbox.cpp#L94 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/mailbox.cpp#L99
The first location, in signaler_t::send was changed twice ca. 6 months ago, by #2360 and #2362. The resulting state is quite contradictory, as #2362 meant to revert #2360, but left unreachable code.
Could you please send a fix for it?
@bluca: I can send a fix to remove the unreachable code, but I think there was a point in the original #2360 PR that is related to the issue reported in the ticket. Probably, the send loop should actually be continued at least for some error codes.
@UniversalAE It's great to hear that you found a working solution for your use case. Please be aware that ZeroMQ and RabbitMQ are not easily comparable, they have different strengths and weaknesses for a particular problem, and operate on different levels of abstraction. RabbitMQ is a message broker, while ZeroMQ is a message queuing library (which can be used to build a broker). Of course, it would be great to resolve the apparent bug in ZeroMQ reported in this issue as soon as possible.
Sigiesec, I agree with variance of abstraction between the two products. I actually prefer the ZeroMQ approach. But, in its current state of operation ZeroMQ is a severely damaged product. I need it to operate without error for 24/7 ad infinitum. ZeroMQ operates for 2 hours and then fails with an unresolvable crash to the OS all emanating from signaler.cpp. For over a year and half others have reported the same problem. ZeroMQ is simply unusable in this state, I am not prepared nor do I have time to debug other peoples code, and I have moved on. The fact that this has been a problem for so long, reported by so many, is in itself deplorable.
I had the same issue but resolved it. I have anaconda 5.0.1 (pyzmq 16.0.2) installed on Windows 10 x64. After running Jupyter notebook, it always crashed in 2 hours with assertion failure errors.
My cause should be related to TDI filter (but I don't fully understand what it is). In my case, I uninstalled Networx and now it's working perfectly. This comment helped me identifying the cause.
Hope this helps other people.
Uninstalling Networx has fixed this issue for me too. Thank you!
This happens also in libzmq-4.2.2, Windows build.
Assertion failed: Connection reset by peer (........\src\signaler.cpp:352) Assertion failed: Connection reset by peer (........\src\signaler.cpp:352) Assertion failed: Connection reset by peer (........\src\signaler.cpp:352)
Unfortunately, a library like this that terminates an application because of a connection reset by peer is unusable.
Is this behavior the same on posix platforms?
@crosscode-nl try with latest code from master, there were a few fixes in that area. Also check for that Networkx program others are mentioning, as it seems to be messing with the platform.
The latest master branch does not solve this issue.
No such software like Networkx is installed.
It is installed in a VMWare environment though. I noticed localhost connections get killed sometimes. (Or even blackholed.)
To test zeromq after these issues I now kill all tcp connections with sysinternals tcpview. With the knowledge that (even local/loopback) TCP connections can get removed by the OS or other software, I believe it should be able to recover from this, or give an error that can be handled. It should not terminate the application.
I'm also wondering why it uses local TCP connections internally?
To test zeromq after these issues I now kill all tcp connections with sysinternals tcpview. With the knowledge that (even local/loopback) TCP connections can get removed by the OS or other software, I believe it should be able to recover from this, or give an error that can be handled. It should not terminate the application.
Sorry, but if the internal event pipe becomes unreliable, all hell breaks loose. There is no recovery possible from that situation. The entire command-based architecture between the I/O threads and the application threads is based on the assumption that the command channel is reliable - which is a perfectly reasonable assumption, given it's in-process communication. The question is, why is Windows killing random loopback TCP connection? Can that be stopped, and if not, why?
I'm also wondering why it uses local TCP connections internally?
It's TCP only on Windows. Posix systems have decent and usable inproc pipes (socket pairs IIRC), but at the time this was coded Windows did not. I do not know if that has changed as I am not familiar with Windows APIs. If there is an usable mechanism pull requests to add support for it would be very, very, very welcome since using TCP for this purpose has been a constant pain in the back on Windows.
I agree, loopback connections should be reliable, and not get killed, but it happens. We are currently investigating why it happens, but different sources on the internet mention the same issues. Most of the times some virtualization is involved. (VMWare or Hyper-V)
I think I would have used named pipes on Windows, but maybe you investigated that already and deemed them not suitable. The codebase of zeromq is not known to me yet.
If I'm allowed by my customer then I will try to change the internal event pipe(s) to use named pipes, unless you already know that that is not going to work.
It has been proposed multiple times but a working implementation has never been done AFAIK. I wasn't around at that time so I can't reall add any insight.
From what I can read, the problem is that a new polling mechanism should be added together with the current epoll/kqueue/etc, IOCP, but it doesn't fit well with the model so it's not trivial to do.
See:
https://lists.zeromq.org/pipermail/zeromq-dev/2010-September/005941.html https://github.com/zeromq/libzmq/issues/153 https://lists.zeromq.org/pipermail/zeromq-dev/2011-May/011092.html http://somdoron.com/2014/11/netmq-iocp/
If someone were to implement a working version of this it would be great of course.
I've looked into it a bit and found that nanomsg does use IOCP, however, even that project uses a loopback TCP connection sometimes (but less than ZeroMQ).
Also netmq uses loopback TCP connections, unfortunately.
The biggest problem seems to be that both libraries are written around sockets, and Windows and Linux have different concepts of what can be sockets. So basically the need for a high performance and reliable messaging library for Windows exists.
I'm thinking it might be easier to convince my customer to switch to Linux.
Environment
Reproduction
...is difficult, but for at least one user, leaving a Jupyter notebook running can die with any of the following asserts:
The mailbox assert is longstanding and still open at #1108. Any ideas on what might be the cause of the connection resets or the size mismatch?