zeromq / zeromq4-1

ZeroMQ 4.1.x stable release branch - bug fixes only
GNU General Public License v3.0
125 stars 137 forks source link

Backport fix libzmq #1584 #135

Open msune opened 8 years ago

msune commented 8 years ago

Backport fix to 4.1.X "fixed zmq assertion in signaler.cpp under ubuntu #1584"

somdoron commented 8 years ago

Anyway to test it?

msune commented 8 years ago

I am not sure.

This assertion was happening in a not-so-fast system with a very heavy load of messages. I applied (manually): https://github.com/zeromq/libzmq/pull/1584 solves the issue.

It is a pretty straight back-port. However, after leaving the system for some time, another assertion is hit (immediately on the return of that function):

(gdb) bt
#0  0x000000000242c07b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000024a28d8 in abort ()
#2  0x00000000023779cb in zmq::zmq_abort (errmsg_=0x2e360a2 "ok") at /home/marc/volta/builder/libzmq/src/err.cpp:83
#3  0x0000000002378304 in zmq::mailbox_t::recv (this=0x7fffdc009da0, cmd_=0x7fffe7ffe7b0, timeout_=0) at /home/marc/volta/builder/libzmq/src/mailbox.cpp:94
#4  0x0000000002388998 in zmq::socket_base_t::process_commands (this=0x7fffdc0099e0, timeout_=0, throttle_=true) at /home/marc/volta/builder/libzmq/src/socket_base.cpp:1044
#5  0x000000000238820e in zmq::socket_base_t::send (this=0x7fffdc0099e0, msg_=0x7fffe7ffe970, flags_=1) at /home/marc/volta/builder/libzmq/src/socket_base.cpp:829
#6  0x000000000236af19 in s_sendmsg (s_=0x7fffdc0099e0, msg_=0x7fffe7ffe970, flags_=1) at /home/marc/volta/builder/libzmq/src/zmq.cpp:346

That is:

68│ int zmq::mailbox_t::recv (command_t *cmd_, int timeout_)
69│ {
70│     //  Try to get the command straight away.
71│     if (active) {
72│         if (cpipe.read (cmd_))
73│             return 0;
74│
75│         //  If there are no more commands available, switch into passive state.
76│         active = false;
77│     }
78│
79│     //  Wait for signal from the command sender.
80│     const int rc = signaler.wait (timeout_);
81│     if (rc == -1) {
82│         errno_assert (errno == EAGAIN || errno == EINTR);
83│         return -1;
84│     }
85│
86│     //  Receive the signal.
87│     signaler.recv ();
88│
89│     //  Switch into active state.
90│     active = true;
91│
92│     //  Get a command.
93│     const bool ok = cpipe.read (cmd_);
94├>    zmq_assert (ok);

So I guess fix is not complete and problems are of a bigger scope... I am not sure if moving to 4.2.X (but I was experiencing similar issues, that's why I downgraded to 4.1.6).

For the original assertion, I think EAGAIN and EINTR, at least should be treated, as read() can fail.

You can reject the PR if you wish. But please let me know if there is some fix available somwhere for this, or if I have to implement a retrial to handle these 2 error codes and contribute back.

sourcedelica commented 7 years ago

Did you end up successfully porting this to 4.1.x? We've run into #1583 on 4.1.x too. I tried building my app with with 4.2.2 but ran into lots of problems right off the bat. Trying to decide now whether to forge ahead with 4.2.x or try to get this working in 4.1.x.

bluca commented 7 years ago

What problems are you having with 4.2.x? It's API/ABI compatible so it should be a smooth upgrade from 4.1.x

sourcedelica commented 7 years ago

Hi Luca - thanks for checking with me on this.

I ran the the tests for our library that is a layer on top of ZMQ using 4.1.6 plus the patch from this PR and using 4.2.2.

I manually applied the change from this PR to 4.1.6 and our tests ran OK though I've only run them a few times.

For 4.2.2, I've found that most of the ZMQ-related tests pass, but there are a few that crash each time they run. Here are the backtraces cut off after the first custom class.

It's quite possible that the bug is in our code but is latent in 4.1.6.

#0  0x00007ffff1925a98 in raise () from /lib64/libc.so.6
#1  0x00007ffff192772a in abort () from /lib64/libc.so.6
#2  0x0000000000e90029 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff1a778a2 "Bad address")
    at src/err.cpp:87
#3  0x0000000000eb1495 in zmq::fq_t::recvpipe (this=this@entry=0x7fffdedb2fb8, 
    msg_=msg_@entry=0x7fffdedb3040, pipe_=pipe_@entry=0x7fffda5d49b8) at src/fq.cpp:92
#4  0x0000000000eb8ed2 in zmq::router_t::xhas_in (this=0x7fffdedb2b00) at src/router.cpp:385
#5  0x0000000000e9bb06 in has_in (this=0x7fffdedb2b00) at src/socket_base.cpp:1281
#6  zmq::socket_base_t::getsockopt (this=0x7fffdedb2b00, option_=option_@entry=15, 
    optval_=optval_@entry=0x7fffda5d4a54, optvallen_=optvallen_@entry=0x7fffda5d4a58)
    at src/socket_base.cpp:417
#7  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffdedaffc8, 
    events_=0x7fffdedabf40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#8  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffda5d4b80) at src/zmq.cpp:816
#9  zmq_poll (items_=0x7fffda5d4b80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#10 0x0000000000a2fb55 in zmq::poll (items_=0x7fffda5d4b80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#11 0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7fffef33f100, 
    items=0x7fffda5d4ca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126
#0  0x0000000000e9b886 in zmq::socket_base_t::process_commands (this=this@entry=0x7fffe0900f00, 
    timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1355
#1  0x0000000000e9ba6f in zmq::socket_base_t::getsockopt (this=0x7fffe0900f00, 
    option_=option_@entry=15, optval_=optval_@entry=0x7fffdac5ea54, 
    optvallen_=optvallen_@entry=0x7fffdac5ea58) at src/socket_base.cpp:409
#2  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffe08dbd28, 
    events_=0x7fffe08e7f40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#3  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffdac5eb80) at src/zmq.cpp:816
#4  zmq_poll (items_=0x7fffdac5eb80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#5  0x0000000000a2fb55 in zmq::poll (items_=0x7fffdac5eb80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#6  0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7ffff7f87380, 
    items=0x7fffdac5eca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126
#0  0x00007ffff1925a98 in raise () from /lib64/libc.so.6
#1  0x00007ffff192772a in abort () from /lib64/libc.so.6
#2  0x0000000000e90029 in zmq::zmq_abort (errmsg_=errmsg_@entry=0xf61969 "pfd.revents & POLLIN")
    at src/err.cpp:87
#3  0x0000000000e99ce9 in zmq::signaler_t::wait (this=this@entry=0x7fffd58ce568, timeout_=0)
    at src/signaler.cpp:249
#4  0x0000000000e90765 in zmq::mailbox_t::recv (this=0x7fffd58ce500, cmd_=0x7fffd8256940, 
    timeout_=<optimized out>) at src/mailbox.cpp:81
#5  0x0000000000e9b88c in zmq::socket_base_t::process_commands (this=this@entry=0x7fffe07ef200, 
    timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1355
#6  0x0000000000e9ba6f in zmq::socket_base_t::getsockopt (this=0x7fffe07ef200, 
    option_=option_@entry=15, optval_=optval_@entry=0x7fffd8256a54, 
    optvallen_=optvallen_@entry=0x7fffd8256a58) at src/socket_base.cpp:409
#7  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffd5997fc8, 
    events_=0x7fffd58d3f40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#8  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffd8256b80) at src/zmq.cpp:816
#9  zmq_poll (items_=0x7fffd8256b80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#10 0x0000000000a2fb55 in zmq::poll (items_=0x7fffd8256b80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#11 0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7fffd5913880, 
    items=0x7fffd8256ca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126
bluca commented 7 years ago

Are those tests by any chance using/creating/closing the same socket from multiple threads?

The first abort is because the (internal) message is trying to close is invalid (most likely closed from somewhere else already), the second one has an invalid poll

sourcedelica commented 7 years ago

Thanks! It turned out to be a race condition that closed sockets from another code path. Due to the race it didn't manifest under 4.1.x. I've fixed it on our side and now all of our tests pass using 4.2.2. I am going to recommend to my team that we upgrade to 4.2.2 to get this patch (for #1584) plus all of the other fixes that came with 4.2.x.

bluca commented 7 years ago

Good to hear. Incidentally, did you manage to reproduce this issue on 4.1, and can you confirm whether or not this PR fixes it?

sourcedelica commented 7 years ago

The problem (Resource Temporarily Unavailable in signaler.cpp) has only come up once in QA testing. We have not seen in in developer testing. Based on the assertion message we found the bug and the PR in the main libzmq repo and then this backport PR.

I applied the backport and our test suite ran fine. I was going to go ahead and upgrade to 4.2.2, but I ran into an issue that 4.2.2 does not compile with Visual Studio 8 (2005). We primarily use GCC 5.2 on Linux but have some apps using VS8 on Windows, and 4.1.x compiles with VS8. It sucks having to support VS8 but I may end up needing to go with 4.1.x anyway.

bluca commented 7 years ago

What's the problem with VS8?

@somdoron given the test done by @sourcedelica I would say it's enough for merging this, what do you think?

sourcedelica commented 7 years ago

Here's some of the errors building 4.2.2 using VS8 (2005):

Target Build:
    Target ZERO_CHECK:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\ZERO_CHECK.vcproj "Debug|Win
32"
    Target libzmq:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\libzmq.tmp_Debug_Win32.vcpro
j "Debug|Win32"
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : ide
ntifier 'uint32_t'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : mis
sing ';' before identifier 'zmq_msg_routing_id'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot o
pen include file: 'stdint.h': No such file or directory
    Done building target "libzmq" in project "ZeroMQ.sln" -- FAILED.
    Target RUN_TESTS:
        The project "RUN_TESTS" is not selected for building in solution configu
ration "Debug|Win32".
    Target libzmq-static:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\libzmq-static.tmp_Debug_Win3
2.vcproj "Debug|Win32"
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : ide
ntifier 'uint32_t'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : mis
sing ';' before identifier 'zmq_msg_routing_id'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot o
pen include file: 'stdint.h': No such file or directory
    Done building target "libzmq-static" in project "ZeroMQ.sln" -- FAILED.
Done building target "Build" in project "ZeroMQ.sln" -- FAILED.

Done building project "ZeroMQ.sln" -- FAILED.

Build FAILED.
c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : identifier
'uint32_t'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : missing ';'
 before identifier 'zmq_msg_routing_id'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot open incl
ude file: 'stdint.h': No such file or directory
c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : identifier
'uint32_t'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : missing ';'
 before identifier 'zmq_msg_routing_id'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot open incl
ude file: 'stdint.h': No such file or directory
    0 Warning(s)
    10 Error(s)
bluca commented 7 years ago

I would suggest checking for existing issues on the libzmq repository, and if there's none opening one - sorry but I'm not a windows person so can't help too much with that

sourcedelica commented 7 years ago

Me neither :) I'll check in the libzmq issues. It looks like most of the errors are localized to a couple places.

Thanks a lot for your help!

bluca commented 7 years ago

@somdoron ping - you think these tests are enough for a merge?