zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.72k stars 2.36k forks source link

4.1.2: Assertion failed: Connection reset by peer (zeromq\src\signaler.cpp:298) #1808

Closed minrk closed 4 years ago

minrk commented 8 years ago

Environment

...is difficult, but for at least one user, leaving a Jupyter notebook running can die with any of the following asserts:

Assertion failed: Connection reset by peer (src\signaler.cpp:181)
Assertion failed: Connection reset by peer (src\signaler.cpp:298)
Assertion failed: nbytes == sizeof (dummy) (src\signaler.cpp:303)
Assertion failed: ok (src\mailbox.cpp:94)

The mailbox assert is longstanding and still open at #1108. Any ideas on what might be the cause of the connection resets or the size mismatch?

somdoron commented 8 years ago

Which socket type are you using?

minrk commented 8 years ago

There will be several SUB and DEALER sockets in the dying process.

somdoron commented 8 years ago

Can it be something like firewall? Is it a virtual machine or PC?

minrk commented 8 years ago

It could be a firewall, I don't have access to the failing machine. I believe it is a real machine (cc @jveitchmichaelis for more details).

jveitchmichaelis commented 8 years ago

It's a PC - local notebook.

somdoron commented 8 years ago

Few questions:

jveitchmichaelis commented 8 years ago

Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on.

The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that.

I'll try:

1) Jupyter with no notebook open 2) Idle notebook 3) Notebook in an infinite loop. I don't think the particular piece of code matters, the crashes seem arbitrary.

somdoron commented 8 years ago

Can you try with firewall of as well?

On Mon, Feb 22, 2016 at 4:33 PM, jveitchmichaelis notifications@github.com wrote:

Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on.

The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that.

I'll try:

1) Jupyter with no notebook open 2) Idle notebook 3) Notebook in an infinite loop. I don't think the particular piece of code matters, the crashes seem arbitrary.

— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-187205927.

jveitchmichaelis commented 8 years ago

Yep will do, I'll see if turning on more verbose debugging in Jupyter throws up anything as well.

JimChenTaiwan commented 8 years ago

Hi, We have the same situation. Environment:

Reproduction:

We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?

somdoron commented 8 years ago

So recently someone changed the signaler to use random port, this might make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4, but what I suggest is compiling with reverting the following commit:

https://github.com/zeromq/libzmq/commit/7e09306cb369f4c345850627f3969be153eaa3cf

Or bottom line, in config.hpp make sure the signaler_port is set to 5905 and not 0.

Also in firewall make sure to allow tcp connection on port 5905 to the application.

Let me know if this help.

On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com wrote:

Hi, We have the same situation. Environment:

  • Windows 10 x64
  • both ZMQ 4.0.4 and 4.1.4 x86
  • Compiler: VS2013 x86 build

Reproduction:

  • Always crashed at 2 hours after server & client connected.
  • error code:
  • Assertion failed: Connection reset by peer (......\src\signaler.cpp:298) or
  • Assertion failed: ok (......\src\mailbox.cpp:82)

We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?

— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188056544.

JimChenTaiwan commented 8 years ago

OK! Thanks for your suggestion, We will try it.

We also found another problem, When we only executed the "zmq:socket->connect" side of zmq program, it didn't crash after 2 hours, (because it is not connected?) but it crashed when we close the "zmq:socket" and "zmq::context". If we close the program before 2 hours, it will act OK! It showed the error code:   Assertion failed: Connection reset by peer (......\src\signaler.cpp:252)

hintjens commented 8 years ago

We need to make this configurable, since reusing the same port leads to different problems. Hence that patch. On 24 Feb 2016 15:41, "Doron Somech" notifications@github.com wrote:

So recently someone changed the signaler to use random port, this might make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4, but what I suggest is compiling with reverting the following commit:

https://github.com/zeromq/libzmq/commit/7e09306cb369f4c345850627f3969be153eaa3cf

Or bottom line, in config.hpp make sure the signaler_port is set to 5905 and not 0.

Also in firewall make sure to allow tcp connection on port 5905 to the application.

Let me know if this help.

On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com wrote:

Hi, We have the same situation. Environment:

  • Windows 10 x64
  • both ZMQ 4.0.4 and 4.1.4 x86
  • Compiler: VS2013 x86 build

Reproduction:

  • Always crashed at 2 hours after server & client connected.
  • error code:
  • Assertion failed: Connection reset by peer (......\src\signaler.cpp:298) or
  • Assertion failed: ok (......\src\mailbox.cpp:82)

We have tested this on three computers, and two of which will have this crash problem. Can anyone help us?

— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188056544.

— Reply to this email directly or view it on GitHub https://github.com/zeromq/libzmq/issues/1808#issuecomment-188282589.

JimChenTaiwan commented 8 years ago

[pic1] 1456385320653 [pic2] 1456385365827

Hi, We only ran the built-in "local_thr.exe" as test program and didn't run the client program "remote_thr.exe". After 2 hours, we found two TCP port states have been changed from "ESTABLISHED" to "FIN_WAIT1" as you can see in the [pic1]. And after a while, the "local_thr.exe" test program crashed as the [pic2] shows. Why it always happened after 2 hours? Thanks a lot.

somdoron commented 8 years ago

Did you try same test with firewall disabled?

JimChenTaiwan commented 8 years ago

Yes, I have run some tests with firewall disabled, but it crashed after 2 hours. I didn't test "local_thr.exe" with firewall disabled yet, I'll try it afterward.

We have tested "local_thr.exe" with firewall disabled, and it crashed after 2 hours. (2016/2/26)

somdoron commented 8 years ago

@JimChenTaiwan are you using windows 10 as well?

JimChenTaiwan commented 8 years ago

Yes, I test it on Windows 10 x64 ver.

ghost commented 8 years ago

I'm seeing what I think is identical behaviour here as well.

Running the following:

After almost exactly two hours the following assert fires:

Assertion failed: Connection reset by peer (..\..\..\..\src\signaler.cpp:351)

Similar behaviour occurs if compiled with poll enabled rather than select:

Assertion failed: pfd.revents & POLLIN (..\..\..\..\src\signaler.cpp:248)

Running the same build of our software on Windows 7 doesn't demonstrate this issue.

I'm using inproc connections and running inproc_thr (with a Sleep call added at the bottom of the for loop in worker to slow things down to hit 2 hours) reproduces the issue. Invoked as inproc_thr 64 1000000. After two hours the assert above fires.

Additionally:

somdoron commented 8 years ago

@mseagrief does it matter if the socket is idle or not?

somdoron commented 8 years ago

it seems like issue with windows 10, I tried to google it without much success, can you open an issue with Microsoft?

ghost commented 8 years ago

@somdoron I'll do some tests today with the socket idle and report back.

As to reporting to Microsoft, I'm not sure what I'd be reporting. I'm not sufficiently familiar with the ZeroMQ internals to make a useful bug report. That said I'll spend some time in the next few days doing so and see if I can isolate the issue

ghost commented 8 years ago

ZeroMQ 3.2.5 doesn't appear to exhibit this behaviour.

Just running zmq_init and then sleeping for 3 hours doesn't crash. However 5 minutes after starting to use the sockets (using inproc_thr) the assert fires.

Running zmq_init and then creating and binding a PULL socket and then sleeping asserts after 2hrs.

ghost commented 8 years ago

It looks like the code added to address #1608 causes this behaviour to be triggered under Windows 10. Both the poll and select cases appear to work without triggering the assert with the code block in signaler.cpp commented out.

As for what the underlying issue is I'm unsure. I tried that code as it was the most significant change I could see in signaler.cpp comparing 3.2.5 against the current git master.

tnthao commented 7 years ago

having the same issue. easy way to reproduce this problem is have a large number of subscribers active and then simulate a congested/broken network using a tool called clumsy (i used ver 0.2). in the tool enable "drop" and "out of order" and increase the percentages. depending on your network load, it should crash eventually.

lytboris commented 7 years ago

2334 is about connection reset by peer too. Take a look.

SylvainCorlay commented 7 years ago

I am seeing a lot of these Assertion failed: nbytes == sizeof (dummy) in signaler.cpp:364.

PolarNick239 commented 7 years ago

We use jupyter qtconsole as embedded python console in our application, and many our users faced the same bug leading to application crash. It happens even in idle application.

Affected OS:

Problem encountered with ZeroMQ 4.1.5 (PyZMQ 15.4.0) and ZeroMQ 4.1.6 (PyZMQ 16.0.2).

Error message: Connection reset by peer

Stacktrace:

libzmq.cp35-win_amd64.pyd!zmq::zmq_abort(const char * errmsg_) Line 85 C++
libzmq.cp35-win_amd64.pyd!zmq::signaler_t::recv() Line 303 C++
libzmq.cp35-win_amd64.pyd!zmq::mailbox_t::recv(zmq::command_t * cmd_, int timeout_) Line 93 C++
libzmq.cp35-win_amd64.pyd!zmq::socket_base_t::process_commands(int timeout_, bool throttle_) Line 1054 C++
libzmq.cp35-win_amd64.pyd!zmq::socket_base_t::recv(zmq::msg_t * msg_, int flags_) Line 955 + 0xd bytes C++
libzmq.cp35-win_amd64.pyd!zmq_msg_recv(zmq_msg_t * msg_, void * s_, int flags_) Line 601 + 0xe bytes C++

Is there any progress?

ghost commented 7 years ago

@PolarNick239 for my use case I reverted out the changes in #1608 and compiled up my own binaries. I've been successfully using them for some time now, albeit with fairly light workloads.

It's on my list to re-visit this and bump up the version of ZeroMQ we're using to see if it's resolved, but not high priority at the moment as it is working.

I spent a fair while digging around using various of the low level tracing tools in Windows, but nothing conclusive and they're not tools I've used before so I wasn't 100% certain I was looking in the correct place.

PolarNick239 commented 7 years ago

@mseagrief Thanks! I will make a test with reverted #1629 and write here my results when I will have a time.

caidongyun commented 7 years ago

help ! ports are occupie, who connect? 12580 occupie https://github.com/zeromq/libzmq/issues/2598

UniversalAE commented 7 years ago

I am having this problem too. Consistent crashes at around 2 hours. Running PUB/SUB exclusively. May be associated with simultaneous use of websockets for WSS connection to remote host in app running ZMQ PUB server. C# app WEBSOCKET-sharp and ZMQ C# CLR. Have disabled WEBSOCKET-sharp and this reduces the frequency of events. ZMQ is unusable unless this problem is resolved. Need apps to run 24/7 with no unresolvable events. WEBSOCKET-sharp with no ZMQ operates 24/7 with no events. I'm currently looking for alternative methods for interprocess communication that will take the place of ZMQ functionality. This thread is 1.5 years old with no suggested resolution. That is very unacceptable for a problem so consistently reproduced. Other info - Windows 10 64-bit, Windows 7 64-bit. VS2017/NuGet. Error event always occurs in signaler.cpp at a variety of line numbers on both PUB and SUB side simultaneously. All apps crash to OS simultaneously. There is no way to encalpsulate a try-catch. Assertion error appears not to be escalated up the call stack and as such cannot be contained.

UniversalAE commented 7 years ago

As a followup to the issue comment I made 5 days ago. I have completed my search for an alternate MQ. I have successfully implemented the same process with RabbitMQ. I have not had to encapsulate anything with try-catch. It simply works as expected with no assertion errors. I now have 4 simultaneous applications running with 1 PUB and 3 SUB end points in each all intercommunicating to each other at 5ms intervals for a period of days and still running, even through random externally induced TCP communication drop out. I urge everyone to stop wasting their time with ZeroMQ. ZeroMQ is damaged with no prospect for resolution. ZeroMQ has excellent documentation. That's as far as it goes. Regretfully, I'm abandoning ZeroMQ.

ghost commented 7 years ago

Thanks for the tip re: RabbitMQ. I'll take a look, I've been putting off upgrading our ZMQ binaries for a long while now.

We're still running locally built binaries with #1629 removed without issue. But I wholly agree about assertions like these firing in production code being deeply broken.

UniversalAE commented 7 years ago

The move to RabbitMQ I reported 17 days ago has been so successful I feel compelled to report it. ZERO truely ZERO errors since the last report. It's been nearly a month of 24/7 messaging between all processes. Millions of transactions. Perhaps Billions of flawless object messaging transactions.

sigiesec commented 7 years ago

The code locations of the reported assertions are: https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L181 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L192

https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L298 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L313

https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/signaler.cpp#L303 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/signaler.cpp#L318

https://github.com/zeromq/zeromq4-1/blob/575da3ec7ad2cdb85386d7ec2459b550b884c332/src/mailbox.cpp#L94 corresponding to https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/mailbox.cpp#L99

sigiesec commented 7 years ago

The first location, in signaler_t::send was changed twice ca. 6 months ago, by #2360 and #2362. The resulting state is quite contradictory, as #2362 meant to revert #2360, but left unreachable code.

bluca commented 7 years ago

Could you please send a fix for it?

sigiesec commented 7 years ago

@bluca: I can send a fix to remove the unreachable code, but I think there was a point in the original #2360 PR that is related to the issue reported in the ticket. Probably, the send loop should actually be continued at least for some error codes.

sigiesec commented 7 years ago

@UniversalAE It's great to hear that you found a working solution for your use case. Please be aware that ZeroMQ and RabbitMQ are not easily comparable, they have different strengths and weaknesses for a particular problem, and operate on different levels of abstraction. RabbitMQ is a message broker, while ZeroMQ is a message queuing library (which can be used to build a broker). Of course, it would be great to resolve the apparent bug in ZeroMQ reported in this issue as soon as possible.

UniversalAE commented 7 years ago

Sigiesec, I agree with variance of abstraction between the two products. I actually prefer the ZeroMQ approach. But, in its current state of operation ZeroMQ is a severely damaged product. I need it to operate without error for 24/7 ad infinitum. ZeroMQ operates for 2 hours and then fails with an unresolvable crash to the OS all emanating from signaler.cpp. For over a year and half others have reported the same problem. ZeroMQ is simply unusable in this state, I am not prepared nor do I have time to debug other peoples code, and I have moved on. The fact that this has been a problem for so long, reported by so many, is in itself deplorable.

yoshi4cs commented 6 years ago

I had the same issue but resolved it. I have anaconda 5.0.1 (pyzmq 16.0.2) installed on Windows 10 x64. After running Jupyter notebook, it always crashed in 2 hours with assertion failure errors.

My cause should be related to TDI filter (but I don't fully understand what it is). In my case, I uninstalled Networx and now it's working perfectly. This comment helped me identifying the cause.

Hope this helps other people.

anonEmoss commented 6 years ago

Uninstalling Networx has fixed this issue for me too. Thank you!

crosscode-nl commented 6 years ago

This happens also in libzmq-4.2.2, Windows build.

Assertion failed: Connection reset by peer (........\src\signaler.cpp:352) Assertion failed: Connection reset by peer (........\src\signaler.cpp:352) Assertion failed: Connection reset by peer (........\src\signaler.cpp:352)

Unfortunately, a library like this that terminates an application because of a connection reset by peer is unusable.

Is this behavior the same on posix platforms?

bluca commented 6 years ago

@crosscode-nl try with latest code from master, there were a few fixes in that area. Also check for that Networkx program others are mentioning, as it seems to be messing with the platform.

crosscode-nl commented 6 years ago

The latest master branch does not solve this issue.

No such software like Networkx is installed.

It is installed in a VMWare environment though. I noticed localhost connections get killed sometimes. (Or even blackholed.)

To test zeromq after these issues I now kill all tcp connections with sysinternals tcpview. With the knowledge that (even local/loopback) TCP connections can get removed by the OS or other software, I believe it should be able to recover from this, or give an error that can be handled. It should not terminate the application.

I'm also wondering why it uses local TCP connections internally?

bluca commented 6 years ago

To test zeromq after these issues I now kill all tcp connections with sysinternals tcpview. With the knowledge that (even local/loopback) TCP connections can get removed by the OS or other software, I believe it should be able to recover from this, or give an error that can be handled. It should not terminate the application.

Sorry, but if the internal event pipe becomes unreliable, all hell breaks loose. There is no recovery possible from that situation. The entire command-based architecture between the I/O threads and the application threads is based on the assumption that the command channel is reliable - which is a perfectly reasonable assumption, given it's in-process communication. The question is, why is Windows killing random loopback TCP connection? Can that be stopped, and if not, why?

I'm also wondering why it uses local TCP connections internally?

It's TCP only on Windows. Posix systems have decent and usable inproc pipes (socket pairs IIRC), but at the time this was coded Windows did not. I do not know if that has changed as I am not familiar with Windows APIs. If there is an usable mechanism pull requests to add support for it would be very, very, very welcome since using TCP for this purpose has been a constant pain in the back on Windows.

crosscode-nl commented 6 years ago

I agree, loopback connections should be reliable, and not get killed, but it happens. We are currently investigating why it happens, but different sources on the internet mention the same issues. Most of the times some virtualization is involved. (VMWare or Hyper-V)

I think I would have used named pipes on Windows, but maybe you investigated that already and deemed them not suitable. The codebase of zeromq is not known to me yet.

If I'm allowed by my customer then I will try to change the internal event pipe(s) to use named pipes, unless you already know that that is not going to work.

bluca commented 6 years ago

It has been proposed multiple times but a working implementation has never been done AFAIK. I wasn't around at that time so I can't reall add any insight.

From what I can read, the problem is that a new polling mechanism should be added together with the current epoll/kqueue/etc, IOCP, but it doesn't fit well with the model so it's not trivial to do.

See:

https://lists.zeromq.org/pipermail/zeromq-dev/2010-September/005941.html https://github.com/zeromq/libzmq/issues/153 https://lists.zeromq.org/pipermail/zeromq-dev/2011-May/011092.html http://somdoron.com/2014/11/netmq-iocp/

If someone were to implement a working version of this it would be great of course.

crosscode-nl commented 6 years ago

I've looked into it a bit and found that nanomsg does use IOCP, however, even that project uses a loopback TCP connection sometimes (but less than ZeroMQ).

Also netmq uses loopback TCP connections, unfortunately.

The biggest problem seems to be that both libraries are written around sockets, and Windows and Linux have different concepts of what can be sockets. So basically the need for a high performance and reliable messaging library for Windows exists.

I'm thinking it might be easier to convince my customer to switch to Linux.