zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.75k stars 2.36k forks source link

Problem: test_security_curve/zap hang on slow machines with no DRAFT APIs #2733

Open bluca opened 7 years ago

bluca commented 7 years ago

Issue description

test_security_curve and test_security_zap hang 1 in 5 times on slow environments, like running through Valgrind or on overcrowded build VMs on OBS.

Environment

Minimal test code / Steps to reproduce the issue

while valgrind --tool=memcheck --track-origins=yes --expensive-definedness-checks=yes tests/.libs/test_security_curve; do echo; echo; done

When built with autotools. Adding --vgdb=yes --vgdb-error=0 allows to attach gdb from another terminal and inspect the running process. Compiling with -O0 -ggdb3 is recommended.

What's the actual result? (include assertion message & call stack if applicable)

The test that actually hangs might vary, but the problem is always the same: the server socket (DEALER with identity "ident") does not close properly and the context hangs on zmq_ctx_term.

It's stuck terminating the ctx here: https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/tests/test_security_curve.cpp#L662

As it happened before, it's waiting to receive on the mailbox at https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/ctx.cpp#L179

(gdb) bt
#0  0x00000000060636ad in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000004e7c5b2 in zmq::signaler_t::wait (this=this@entry=0x693d7a8, timeout_=timeout_@entry=-1)
    at src/signaler.cpp:233
#2  0x0000000004e61376 in zmq::mailbox_t::recv (this=this@entry=0x693d740, cmd_=cmd_@entry=0xffefffbc0, 
    timeout_=timeout_@entry=-1) at src/mailbox.cpp:81
#3  0x0000000004e5060d in zmq::ctx_t::terminate (this=this@entry=0x693d6a0) at src/ctx.cpp:179
#4  0x0000000004e9a62b in zmq_ctx_term (ctx_=0x693d6a0) at src/zmq.cpp:175
#5  0x000000000010af03 in shutdown_context_and_server_side (ctx=0x693d6a0, zap_thread=0x69443a0, server=0x6944400, 
    server_mon=<optimized out>) at tests/test_security_curve.cpp:423
#6  0x000000000010bb1d in main () at tests/test_security_curve.cpp:697
(gdb) p *sockets.items[0]
$5 = {<zmq::own_t> = {<zmq::object_t> = {_vptr.object_t = 0x50b69a0 <vtable for zmq::dealer_t+16>, ctx = 0x6a1bfd0, 
      tid = 6}, options = {sndhwm = 1000, rcvhwm = 1000, affinity = 0, identity_size = 5 '\005', 
      identity = "IDENT", '\000' <repeats 250 times>, rate = 100, recovery_ivl = 10000, multicast_hops = 1, 
      multicast_maxtpdu = 1500, sndbuf = -1, rcvbuf = -1, tos = 0, type = 5, linger = 0, connect_timeout = 0, 
      tcp_maxrt = 0, reconnect_ivl = 100, reconnect_ivl_max = 0, backlog = 100, maxmsgsize = -1, rcvtimeo = 250, 
      sndtimeo = 250, ipv6 = false, immediate = 0, filter = false, invert_matching = false, recv_identity = false, 
      raw_socket = false, raw_notify = true, socks_proxy_address = "", tcp_keepalive = -1, tcp_keepalive_cnt = -1, 
      tcp_keepalive_idle = -1, tcp_keepalive_intvl = -1, tcp_accept_filters = std::vector of length 0, capacity 0, 
      zap_ipc_creds = false, ipc_uid_accept_filters = std::set with 0 elements, 
      ipc_gid_accept_filters = std::set with 0 elements, ipc_pid_accept_filters = std::set with 0 elements, 
      mechanism = 2, as_server = 1, zap_domain = "", plain_username = "", plain_password = "", 
      curve_public_key = '\000' <repeats 31 times>, 
      curve_secret_key = "N\301̫[j\254.\002\313RDJ\304cڪ\274\245\063#\272\215\257\217\230\357R\260\\J\257", 
      curve_server_key = '\000' <repeats 31 times>, gss_principal = "", gss_service_principal = "", 
      gss_principal_nt = 0, gss_service_principal_nt = 0, gss_plaintext = false, socket_id = 9, conflate = false, 
      handshake_ivl = 30000, connected = true, heartbeat_ttl = 0, heartbeat_interval = 0, heartbeat_timeout = -1, 
      use_fd = -1, bound_device = ""}, terminating = true, sent_seqnum = {value = 2}, processed_seqnum = 2, 
    owner = 0x0, owned = std::set with 0 elements, term_acks = 2}, <zmq::array_item_t<0>> = {
    _vptr.array_item_t = 0x50b6af0 <vtable for zmq::dealer_t+352>, array_index = 0}, <zmq::i_poll_events> = {
    _vptr.i_poll_events = 0x50b6b10 <vtable for zmq::dealer_t+384>}, <zmq::i_pipe_events> = {
    _vptr.i_pipe_events = 0x50b6b48 <vtable for zmq::dealer_t+440>}, connect_rid = "", 
  endpoints = std::multimap with 1 elements = {["tcp://127.0.0.1:46807"] = {first = 0x6a2e390, second = 0x0}}, 
  inprocs = std::multimap with 0 elements, tag = 3735928559, ctx_terminated = true, destroyed = false, 
  mailbox = 0x6a2dda0, pipes = {items = std::vector of length 1, capacity 1 = {0x6a44310}}, poller = 0x6a1ee10, 
  handle = 0x6a52ea0, last_tsc = 4530394302479, ticks = 1, rcvmore = false, clock = {last_tsc = 4530391173258, 
    last_time = 1312581}, monitor_socket = 0x0, monitor_events = 0, last_endpoint = "tcp://127.0.0.1:46807", 
  thread_safe = false, reaper_signaler = 0x0, sync = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
        __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, attr = {
      __size = "\001\000\000", __align = 1}}, monitor_sync = {mutex = {__data = {__lock = 0, __count = 0, 
        __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, attr = {
      __size = "\001\000\000", __align = 1}}}

(gdb) p sockets.items[0].processed_seqnum 
$17 = 2
(gdb) p sockets.items[0].sent_seqnum 
$18 = {value = 2}
(gdb) p sockets.items[0].term_acks 
$19 = 2

So it never goes through the if: https://github.com/zeromq/libzmq/blob/9c8844fd0840b699194ac89dca9967b6e8d32dec/src/own.cpp#L196 There is one missing term_ack from an owned object. I have not identified which object yet.

What's the expected result?

Test should not hang.

bluca commented 7 years ago

Opening an issue rather than discussing in various PRs @sigiesec

This is a serious issue since it happens without DRAFT APIs, which means it blocks the next release.

The symptom started appearing after recent work to implement security/curve socket events, of which the first PR was https://github.com/zeromq/libzmq/pull/2645 but I believe it is not due to that work, but the changes to the tests were merely enough to make an existing issue come to surface.

bluca commented 7 years ago

Hangs in the same place as reported here: https://github.com/zeromq/libzmq/issues/2586

And this issue looks familiar: https://github.com/zeromq/libzmq/issues/792

I suspect it's due to the inproc ZAP sockets, but can't prove it yet.

sigiesec commented 7 years ago

I also think the problem was not introduced by #2645, since the changes in the production only changed the types/parameters of monitoring events. However, they added several test cases, which may be more likely to trigger the underlying issue.

I will also try further to reproduce the issue under Windows. What I saw until now was that increasing the load increases the frequency of EPIPE/ECONNABORTED/... errors when running test_security_zap

sigiesec commented 7 years ago

I still had no "luck" with test_security_zap under Windows, not a single hang in more than 1500 runs of test_security_zap. Will now try with test_security_curve. Maybe the problem occurs only in the optimized/release build, I did not try that yet.

bluca commented 7 years ago

I can reproduce with -O0 -ggdb3 so I don't think it's a compiler optimisation issue unfortunately.

I guess one big difference between *nix and Windows is that the internal pipe is TCP on the latter, so maybe that's enough to make a difference. Not sure how though.

sigiesec commented 7 years ago

It appears so. I had no luck with test_security_curve on Windows either.

bluca commented 7 years ago

Ok, thank you for checking anyway. I'll have more time to dive into this issue next week.

sigiesec commented 7 years ago

On appveyor, there was one test assertion failure, but not a hang: https://ci.appveyor.com/project/zeromq/libzmq/build/build-1737/job/8de3sw1w4fviq9g9#L1022

jimklimov commented 7 years ago

https://travis-ci.org/zeromq/libzmq/jobs/275664038#L1216

Assertion failed: (events_received <= 1), function test_zap_unsuccessful_status_500, file ../../tests/test_security_zap.cpp, line 153.
FAIL tests/test_security_zap (exit status: 134)
sigiesec commented 7 years ago

I created a 4.2.3 milestone and added it to this issue. Are there any other issues that need to be resolved before 4.2.3?

bluca commented 7 years ago

I don't think so fortunately. But unfortunately I've not had much time to deal with it yet.

There's another sporadic failure but only with DRAFT APIs so it's less important:

FAIL: tests/test_security_zap
=============================
NULL mechanism
test_zap_protocol_error wrong_version
test_zap_protocol_error wrong_request_id
test_zap_protocol_error wrong_status_invalid
test_zap_protocol_error too_many_parts
test_zap_unsuccessful status 300
test_zap_unsuccessful status 500
Assertion failed: (events_received <= 1), function test_zap_unsuccessful_status_500, file ../../tests/test_security_zap.cpp, line 153.
FAIL tests/test_security_zap (exit status: 134)

https://travis-ci.org/zeromq/libzmq/jobs/275664038

bluca commented 7 years ago

A bit more debugging: the socket_base_t object cannot terminate because its owned tcp_listener_t cannot terminate, because its owned session_base_t it owns does not terminate. Strangely the session is not listed as an owned object of the listener, despite behind launched by it and set as its owner:

https://github.com/zeromq/libzmq/blob/master/src/tcp_listener.cpp#L124

I'll keep digging tomorrow.

(gdb) bt
#0  0x00007ffff69956ad in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff7b73ea5 in zmq::signaler_t::wait (this=0x55555576f658, timeout_=-1) at src/signaler.cpp:233
#2  0x00007ffff7b4e0c7 in zmq::mailbox_t::recv (this=0x55555576f5f0, cmd_=0x7fffffffdc40, timeout_=-1)
    at src/mailbox.cpp:81
#3  0x00007ffff7b2fc2f in zmq::ctx_t::terminate (this=0x55555576f550) at src/ctx.cpp:180
#4  0x00007ffff7b9bce1 in zmq_ctx_term (ctx_=0x55555576f550) at src/zmq.cpp:175
#5  0x0000555555557a48 in shutdown_context_and_server_side (ctx=0x55555576f550, zap_thread=0x555555770dc0, 
    server=0x555555772a10, server_mon=0x0, zap_control=0x5555557711f0, zap_handler_stopped=false)
    at tests/testutil_security.hpp:612
#6  0x0000555555558334 in test_zap_errors (
    server_socket_config_=0x5555555566d9 <socket_config_plain_server(void*, void*)>, server_socket_config_data_=0x0, 
    client_socket_config_=0x55555555663c <socket_config_plain_client(void*, void*)>, client_socket_config_data_=0x0)
    at tests/test_security_zap.cpp:314
#7  0x0000555555558690 in main () at tests/test_security_zap.cpp:406
(gdb) up 3
#3  0x00007ffff7b2fc2f in zmq::ctx_t::terminate (this=0x55555576f550) at src/ctx.cpp:180
180         int rc = term_mailbox.recv (&cmd, -1);
(gdb) p sockets
$45 = {items = std::vector of length 1, capacity 8 = {0x555555772a10}}
(gdb) p *(zmq::socket_base_t *)0x555555772a10
$46 = {<zmq::own_t> = {<zmq::object_t> = {_vptr.object_t = 0x7ffff7dd35e8 <vtable for zmq::dealer_t+16>, 
      ctx = 0x55555576f550, tid = 6}, options = {sndhwm = 1000, rcvhwm = 1000, affinity = 0, 
      routing_id_size = 5 '\005', 
      routing_id = "IDENT\000\000\000\000\000\000\000\000\000\000X*wUUU\000\000\a\000\000\000\000\000\000\000ZAPTEST\000\000\000\000\000\000\000\000\000x*wUUU", '\000' <repeats 26 times>, "\230*wUUU", '\000' <repeats 122 times>..., 
      rate = 100, recovery_ivl = 10000, multicast_hops = 1, multicast_maxtpdu = 1500, sndbuf = -1, rcvbuf = -1, 
      tos = 0, type = 5, linger = 0, connect_timeout = 0, tcp_maxrt = 0, reconnect_ivl = 100, reconnect_ivl_max = 0, 
      backlog = 100, maxmsgsize = -1, rcvtimeo = 250, sndtimeo = 250, ipv6 = false, immediate = 0, filter = false, 
      invert_matching = false, recv_routing_id = false, raw_socket = false, raw_notify = true, 
      socks_proxy_address = "", tcp_keepalive = -1, tcp_keepalive_cnt = -1, tcp_keepalive_idle = -1, 
      tcp_keepalive_intvl = -1, tcp_accept_filters = std::vector of length 0, capacity 0, 
      ipc_uid_accept_filters = std::set with 0 elements, ipc_gid_accept_filters = std::set with 0 elements, 
      ipc_pid_accept_filters = std::set with 0 elements, mechanism = 1, as_server = 1, zap_domain = "ZAPTEST", 
      plain_username = "", plain_password = "", curve_public_key = '\000' <repeats 31 times>, 
      curve_secret_key = '\000' <repeats 31 times>, curve_server_key = '\000' <repeats 31 times>, 
      gss_principal = "", gss_service_principal = "", gss_principal_nt = 0, gss_service_principal_nt = 0, 
      gss_plaintext = false, socket_id = 69, conflate = false, handshake_ivl = 30000, connected = true, 
      heartbeat_ttl = 0, heartbeat_interval = 0, heartbeat_timeout = -1, use_fd = -1, bound_device = "", 
      zap_enforce_domain = false}, terminating = true, sent_seqnum = {value = 2}, processed_seqnum = 2, owner = 0x0, 
    owned = std::set with 0 elements, term_acks = 2}, <zmq::array_item_t<0>> = {
    _vptr.array_item_t = 0x7ffff7dd3748 <vtable for zmq::dealer_t+368>, array_index = 0}, <zmq::i_poll_events> = {
    _vptr.i_poll_events = 0x7ffff7dd3768 <vtable for zmq::dealer_t+400>}, <zmq::i_pipe_events> = {
    _vptr.i_pipe_events = 0x7ffff7dd37a0 <vtable for zmq::dealer_t+456>}, connect_routing_id = "", 
  endpoints = std::multimap with 1 elements = {["tcp://127.0.0.1:46643"] = {first = 0x5555557734d0, second = 0x0}}, 
  inprocs = std::multimap with 0 elements, tag = 3735928559, ctx_terminated = true, destroyed = false, 
  mailbox = 0x555555771900, pipes = {items = std::vector of length 1, capacity 1 = {0x7fffec00a1a0}}, 
  poller = 0x55555576f7e0, handle = 0x7ffff0000e00, last_tsc = 89336189467941, ticks = 1, rcvmore = false, clock = {
    last_tsc = 89336189455654, last_time = 26313214}, monitor_socket = 0x0, monitor_events = 0, 
  last_endpoint = "tcp://127.0.0.1:46643", thread_safe = false, reaper_signaler = 0x0, sync = {mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, 
      __align = 0}, attr = {__size = "\001\000\000", __align = 1}}, monitor_sync = {mutex = {__data = {__lock = 0, 
        __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, 
          __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, 
    attr = {__size = "\001\000\000", __align = 1}}}
(gdb) p *(zmq::tcp_listener_t *)0x5555557734d0
$47 = {<zmq::own_t> = {<zmq::object_t> = {_vptr.object_t = 0x7ffff7dd72c8 <vtable for zmq::tcp_listener_t+16>, 
      ctx = 0x55555576f550, tid = 2}, options = {sndhwm = 1000, rcvhwm = 1000, affinity = 0, 
      routing_id_size = 5 '\005', 
      routing_id = "IDENT\000\000\000\000\000\000\000\000\000\000X*wUUU\000\000\a\000\000\000\000\000\000\000ZAPTEST\000\000\000\000\000\000\000\000\000x*wUUU", '\000' <repeats 26 times>, "\230*wUUU", '\000' <repeats 122 times>..., 
      rate = 100, recovery_ivl = 10000, multicast_hops = 1, multicast_maxtpdu = 1500, sndbuf = -1, rcvbuf = -1, 
      tos = 0, type = 5, linger = -1, connect_timeout = 0, tcp_maxrt = 0, reconnect_ivl = 100, 
      reconnect_ivl_max = 0, backlog = 100, maxmsgsize = -1, rcvtimeo = -1, sndtimeo = -1, ipv6 = false, 
      immediate = 0, filter = false, invert_matching = false, recv_routing_id = false, raw_socket = false, 
      raw_notify = true, socks_proxy_address = "", tcp_keepalive = -1, tcp_keepalive_cnt = -1, 
      tcp_keepalive_idle = -1, tcp_keepalive_intvl = -1, tcp_accept_filters = std::vector of length 0, capacity 0, 
      ipc_uid_accept_filters = std::set with 0 elements, ipc_gid_accept_filters = std::set with 0 elements, 
      ipc_pid_accept_filters = std::set with 0 elements, mechanism = 1, as_server = 1, zap_domain = "ZAPTEST", 
      plain_username = "", plain_password = "", curve_public_key = '\000' <repeats 31 times>, 
      curve_secret_key = '\000' <repeats 31 times>, curve_server_key = '\000' <repeats 31 times>, 
      gss_principal = "", gss_service_principal = "", gss_principal_nt = 0, gss_service_principal_nt = 0, 
      gss_plaintext = false, socket_id = 69, conflate = false, handshake_ivl = 30000, connected = false, 
      heartbeat_ttl = 0, heartbeat_interval = 0, heartbeat_timeout = -1, use_fd = -1, bound_device = "", 
      zap_enforce_domain = false}, terminating = true, sent_seqnum = {value = 2}, processed_seqnum = 2, 
    owner = 0x555555772a10, owned = std::set with 0 elements, 
    term_acks = 1}, <zmq::io_object_t> = {<zmq::i_poll_events> = {
      _vptr.i_poll_events = 0x7ffff7dd7388 <vtable for zmq::tcp_listener_t+208>}, poller = 0x5555557703c0}, 
  address = {_vptr.tcp_address_t = 0x7ffff7dd2af0 <vtable for zmq::tcp_address_t+16>, address = {generic = {
        sa_family = 2, sa_data = "\000\000\177\000\000\001\000\000\000\000\000\000\000"}, ipv4 = {sin_family = 2, 
        sin_port = 0, sin_addr = {s_addr = 16777343}, sin_zero = "\000\000\000\000\000\000\000"}, ipv6 = {
        sin6_family = 2, sin6_port = 0, sin6_flowinfo = 16777343, sin6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 
              0}}}, sin6_scope_id = 0}}, source_address = {generic = {sa_family = 0, 
        sa_data = '\000' <repeats 13 times>}, ipv4 = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
        sin_zero = "\000\000\000\000\000\000\000"}, ipv6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, 
        sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
            __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, _has_src_addr = false}, s = -1, 
  handle = 0x7fffec00aa50, socket = 0x555555772a10, endpoint = "tcp://127.0.0.1:0"}
(gdb) p *(zmq::session_base_t *)0x7fffec0014f0
$48 = {<zmq::own_t> = {<zmq::object_t> = {_vptr.object_t = 0x7ffff7dd6700 <vtable for zmq::session_base_t+16>, 
      ctx = 0x55555576f550, tid = 2}, options = {sndhwm = 1000, rcvhwm = 1000, affinity = 0, 
      routing_id_size = 5 '\005', 
      routing_id = "IDENT\000\000\000\000\000\000\000\000\000\000X*wUUU\000\000\a\000\000\000\000\000\000\000ZAPTEST\000\000\000\000\000\000\000\000\000x*wUUU", '\000' <repeats 26 times>, "\230*wUUU", '\000' <repeats 122 times>..., 
      rate = 100, recovery_ivl = 10000, multicast_hops = 1, multicast_maxtpdu = 1500, sndbuf = -1, rcvbuf = -1, 
      tos = 0, type = 5, linger = -1, connect_timeout = 0, tcp_maxrt = 0, reconnect_ivl = 100, 
      reconnect_ivl_max = 0, backlog = 100, maxmsgsize = -1, rcvtimeo = -1, sndtimeo = -1, ipv6 = false, 
      immediate = 0, filter = false, invert_matching = false, recv_routing_id = false, raw_socket = false, 
      raw_notify = true, socks_proxy_address = "", tcp_keepalive = -1, tcp_keepalive_cnt = -1, 
      tcp_keepalive_idle = -1, tcp_keepalive_intvl = -1, tcp_accept_filters = std::vector of length 0, capacity 0, 
      ipc_uid_accept_filters = std::set with 0 elements, ipc_gid_accept_filters = std::set with 0 elements, 
      ipc_pid_accept_filters = std::set with 0 elements, mechanism = 1, as_server = 1, zap_domain = "ZAPTEST", 
      plain_username = "", plain_password = "", curve_public_key = '\000' <repeats 31 times>, 
      curve_secret_key = '\000' <repeats 31 times>, curve_server_key = '\000' <repeats 31 times>, 
      gss_principal = "", gss_service_principal = "", gss_principal_nt = 0, gss_service_principal_nt = 0, 
      gss_plaintext = false, socket_id = 69, conflate = false, handshake_ivl = 30000, connected = false, 
      heartbeat_ttl = 0, heartbeat_interval = 0, heartbeat_timeout = -1, use_fd = -1, bound_device = "", 
      zap_enforce_domain = false}, terminating = false, sent_seqnum = {value = 2}, processed_seqnum = 2, 
    owner = 0x5555557734d0, owned = std::set with 0 elements, 
    term_acks = 0}, <zmq::io_object_t> = {<zmq::i_poll_events> = {
      _vptr.i_poll_events = 0x7ffff7dd67f8 <vtable for zmq::session_base_t+264>}, 
    poller = 0x5555557703c0}, <zmq::i_pipe_events> = {
    _vptr.i_pipe_events = 0x7ffff7dd6830 <vtable for zmq::session_base_t+320>}, active = false, 
  pipe = 0x7fffec00a0b0, zap_pipe = 0x0, terminating_pipes = std::set with 0 elements, incomplete_in = false, 
  pending = true, engine = 0x0, socket = 0x555555772a10, io_thread = 0x55555576f910, has_linger_timer = false, 
  addr = 0x0}
bluca commented 7 years ago

Strangely the session is not listed as an owned object of the listener, despite behind launched by it and set as its owner:

That's because the listener has already gone through zmq::own_t::process_term.

The problem is that the session_t has not, as proven by the fact that own_t.terminating is still false. But given listener_t has, it means the send_term command has been sent.

So the issue appears to be that the session is not processing the term command it gets from its owner.

bluca commented 6 years ago

Running under Valgrind (drd/helgrind) showed that there's a race condition. Immediately before closing a socket the linger is set to 0 in the application thread, but then closing the socket causes the I/O thread to check the option value. Setting the linger to 0 at the beginning instead appears to work as a bandaid - I can no longer reproduce the hang, both on my machine and on OBS.

The race itself needs to be properly fixed though.

sigiesec commented 6 years ago

@bluca: This was closed by #2839. Since the original race condition is not solved as you say, should we reopen this one, or create a new issue?

bluca commented 6 years ago

Let's reopen it, no need to dublicate all the info

bluca commented 6 years ago

I'm tring to add instrumentation to run through helgrind and drd - there's a LOT of false positives due to the customo atomic operations and lock-free queues. That should help find more race conditions.

I think ultimately we know one of the problems - we are freely setting and reading socket options from both application thread and I/O thread without any kind of protection. We'll need to bite the bullet and fix it at some point - perhaps making all accesses through the zmq::options_t class critical sections, and then override in socket_base_t the options used in the "hot path" to use the command queue instead to avoid killing performance.

WallStProg commented 6 years ago

FWIW, I've tried using Thread Sanitizer for similar purpose, but have not been able to get it to work. Following is a write-up that may be helpful:

Using TSAN w/ZeroMQ

ZeroMQ presents particular challenges w/TSAN, mostly because ZeroMQ uses non-standard synchronization techniques, which cause TSAN to report lots of false positives.

Some of those false positives can be eliminated by building ZeroMQ to use compiler intrinsics for atomic operations (-DZMQ_HAVE_ATOMIC_INTRINSICS=1), which TSAN understands (https://github.com/google/sanitizers/wiki/ThreadSanitizerAtomicOperations).

Unfortunately, that isn't enough to reduce the numer of false positives to a manageable level. The approach to eliminating false positives with other sanitizers (e.g., ASAN) has been to simply not instrument the problematic code. However, that approach doesn't work w/TSAN -- it breaks TSAN's stack-unwinding code, so stack traces are incorrect. (See https://github.com/google/sanitizers/issues/941 for more information).

The best compromise appears to be to disable all TSAN instrumentation when building ZeroMQ except for function entry/exit code -- this preserves proper stack frames. Then, at runtime a called_from_lib:libzmq.so is also required to prevent errors being reported from TSAN's interceptors (for functions like memcmp).

Unfortunately, even that isn't enough -- some interceptors are called by libzmq indirectlty via libstdc++. According to this (https://github.com/google/sanitizers/issues/949), the only way to eliminate those is with specific suppressions. Some of these false positives are eliminated via filtering in ctc.pl, but many remain.

One possibility is to use the (undocumented) ignore_noninstrumented_modules=1 flag, but doing so also disables many legitimate races that should be reported (https://github.com/google/sanitizers/issues/949).

NOTE: As of this writing, ignore_noninstrumented_modules is not implemented on Linux).

bluca commented 6 years ago

I have a WIP helgrind suppression file, but it's far from complete and I didn't have time to finish it: https://github.com/bluca/libzmq/commit/fb9ee9da7631f9506cbfcd6db29a284ae6e9651e

WallStProg commented 6 years ago

Thanks @bluca. Although to be clear, I haven't been trying to verify the thread-safety of libzmq -- I've simply been trying to exclude libzmq from TSAN's thread-checking in order to get the benefit of TSAN's checking in my own code, without a huge number of false(?) positives from libzmq.

As you point out, verifying the thread-safety of libzmq is a significant undertaking, and as such is not on my radar at this point.

I also haven't used helgrind at all yet, although from the docs it sounds like TSAN might be a better choice for that, given that TSAN at least understands compiler intrinsics for atomic ops (which can be configured for libzmq), which helgrind apparently does not?