zeromq / malamute

The ZeroMQ Enterprise Messaging Broker
Mozilla Public License 2.0
324 stars 77 forks source link

Malamute broker crash on service worker exiting #311

Closed isra17 closed 6 years ago

isra17 commented 6 years ago

Using malamute built from master.

How to reproduce:

Service provider:

from malamute import MalamuteClient
client = MalamuteClient()
client.connect(b'tcp://localhost:9999', 5000, b'service')
client.set_worker(b'poc', b'.*')
while True:
    print(client.recv())

Publisher:

import time
from malamute import MalamuteClient
client = MalamuteClient()
client.connect(b'tcp://localhost:9999', 5000, b'client')
while True:
    client.sendfor(b'poc', b'foo', content=[b'bar']*10)
    time.sleep(1)

malamute.cfg

server
    timeout = "5000"
    background = "0"
    workdir = "."
    verbose = "4"

mlm_server
    security
        mechanism = "null"
    echo = binding Malamute service to 'tcp://*:9999'
    bind
        endpoint = tcp://*:9999
    service
        queue
            size-limit = max     # Maximum size of service queue (in bytes)
            size-warn = max      # Warn if a service queue reaches this size
    mailbox
        size-limit = max        # Maximum mailbox size (integer or "max")
        size-warn = max         # Warn if a mailbox reaches this size

Run malamute first: malamute malamute.cfg Then simply start the service provider and publisher. If it doesn't crash right away, simply restart the service provider until it does. Eventually, malamute will abort with a failed assertion:

malamute: src/zmsg.c:213: zmsg_content_size: Assertionself' failed.`

Stacktrace:

> bt full
#0  0x00007ffff7596860 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007ffff7597ec9 in abort () from /usr/lib/libc.so.6
No symbol table info available.
#2  0x00007ffff758f0bc in __assert_fail_base () from /usr/lib/libc.so.6
No symbol table info available.
#3  0x00007ffff758f133 in __assert_fail () from /usr/lib/libc.so.6
No symbol table info available.
#4  0x00007ffff7951da5 in zmsg_content_size () from /usr/lib/libczmq.so.4
No symbol table info available.
#5  0x00007ffff7bcf1f0 in s_dequeue (self=0x7fffe400efc0, cursor=0x7fffe400f160) at src/mlm_msgq.c:123
        msg_content_size = 140737018654448
        res = 0x7fffe400ef80
#6  0x00007ffff7bcf28e in mlm_msgq_dequeue_cursor (self=0x7fffe400efc0) at src/mlm_msgq.c:144
No locals.
#7  0x00007ffff7bbc944 in s_service_dispatch (self=0x7fffe400f060) at src/mlm_server.c:299
        message = 0x7fffe400ef80
#8  0x00007ffff7bbd528 in dispatch_the_service (self=0x7fffe4012060) at src/mlm_server.c:623
        service = 0x7fffe400f060
        __PRETTY_FUNCTION__ = "dispatch_the_se"...
#9  0x00007ffff7bb9dac in s_client_execute (self=0x7fffe4012060, event=service_offer_event) at src/mlm_server_engine.inc:862
No locals.
#10 0x00007ffff7bbbf38 in s_server_handle_protocol (loop=0x7fffe4003120, reader=0x7fffe4000bc0, argument=0x7fffe4000b20) at src/mlm_server_engine.inc:1564
        rc = 0
        hashkey = 0x7fffe4012510 ".*"
        client = 0x7fffe4012060
        self = 0x7fffe4000b20
#11 0x00007ffff7950e65 in zloop_start () from /usr/lib/libczmq.so.4
No symbol table info available.
#12 0x00007ffff7bbc0e6 in mlm_server (pipe=0x555555772990, args=0x555555555523) at src/mlm_server_engine.inc:1611
        self = 0x7fffe4000b20
        __PRETTY_FUNCTION__ = "mlm_server"
#13 0x00007ffff792d377 in ?? () from /usr/lib/libczmq.so.4
No symbol table info available.
#14 0x00007ffff637308c in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.
#15 0x00007ffff7657e7f in clone () from /usr/lib/libc.so.6
No symbol table info available.

Note that interrupting the service worker also result in an assert abort from the Python process:

python: src/zmsg.c:200: zmsg_size: Assertion `self' failed.
Aborted (core dumped)

The publisher doesn't abort on SIGINT (It actually require multiples SIGINT before it actually exit).

isra17 commented 6 years ago

Some more logs:

I: 18-04-15 14:12:01 loading configuration from '/tmp/malamute.cfg'...
N: 18-04-15 14:12:01 server is using NULL security
N: 18-04-15 14:12:01 binding Malamute service to 'tcp://*:9999'
D: 18-04-15 14:12:03    597:Malamute                         : start:
D: 18-04-15 14:12:03    597:Malamute                         :     CONNECTION_OPEN
D: 18-04-15 14:12:03    597:Malamute                         :         $ register new client
I: 18-04-15 14:12:03 client 597 address='client' - registering
D: 18-04-15 14:12:03    597:Malamute                         :         $ send OK
D: 18-04-15 14:12:03    597:Malamute                         :         $ check for mailbox messages
D: 18-04-15 14:12:03    597:Malamute                         :         > connected
D: 18-04-15 14:12:03    597:Malamute                         : connected:
D: 18-04-15 14:12:03    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:03    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:03    597:Malamute                         :         > connected
D: 18-04-15 14:12:04    597:Malamute                         : connected:
D: 18-04-15 14:12:04    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:04    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:04    597:Malamute                         :         > connected
D: 18-04-15 14:12:04    597:Malamute                         : connected:
D: 18-04-15 14:12:04    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:04    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:04    597:Malamute                         :         > connected
D: 18-04-15 14:12:05    597:Malamute                         : connected:
D: 18-04-15 14:12:05    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:05    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:05    597:Malamute                         :         > connected
D: 18-04-15 14:12:05    597:Malamute                         : connected:
D: 18-04-15 14:12:05    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:05    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:05    597:Malamute                         :         > connected
D: 18-04-15 14:12:06    597:Malamute                         : connected:
D: 18-04-15 14:12:06    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:06    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:06    597:Malamute                         :         > connected
D: 18-04-15 14:12:06    597:Malamute                         : connected:
D: 18-04-15 14:12:06    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:06    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:06    597:Malamute                         :         > connected
D: 18-04-15 14:12:07    597:Malamute                         : connected:
D: 18-04-15 14:12:07    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:07    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:07    597:Malamute                         :         > connected
D: 18-04-15 14:12:07    597:Malamute                         : connected:
D: 18-04-15 14:12:07    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:07    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:07    597:Malamute                         :         > connected
D: 18-04-15 14:12:08    597:Malamute                         : connected:
D: 18-04-15 14:12:08    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:08    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:08    597:Malamute                         :         > connected
D: 18-04-15 14:12:08    597:Malamute                         : connected:
D: 18-04-15 14:12:08    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:08    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:08    597:Malamute                         :         > connected
D: 18-04-15 14:12:09    597:Malamute                         : connected:
D: 18-04-15 14:12:09    597:Malamute                         :     CONNECTION_PING
D: 18-04-15 14:12:09    597:Malamute                         :         $ send CONNECTION_PONG
D: 18-04-15 14:12:09    597:Malamute                         :         > connected
D: 18-04-15 14:12:09    597:Malamute                         : connected:
D: 18-04-15 14:12:09    597:Malamute                         :     SERVICE_SEND
D: 18-04-15 14:12:09    597:Malamute                         :         $ write message to service
D: 18-04-15 14:12:09    597:Malamute                         :         > connected
D: 18-04-15 14:12:09    598:Malamute                         : start:
D: 18-04-15 14:12:09    598:Malamute                         :     CONNECTION_OPEN
D: 18-04-15 14:12:09    598:Malamute                         :         $ register new client
I: 18-04-15 14:12:09 client 598 address='service' - registering
D: 18-04-15 14:12:09    598:Malamute                         :         $ send OK
D: 18-04-15 14:12:09    598:Malamute                         :         $ check for mailbox messages
D: 18-04-15 14:12:09    598:Malamute                         :         > connected
D: 18-04-15 14:12:09    598:Malamute                         : connected:
D: 18-04-15 14:12:09    598:Malamute                         :     SERVICE_OFFER
D: 18-04-15 14:12:09    598:Malamute                         :         $ store service offer
D: 18-04-15 14:12:09    598:Malamute                         :         $ send OK
D: 18-04-15 14:12:09    598:Malamute                         :         $ dispatch the service
D: 18-04-15 14:12:09    598:Malamute                         : connected:
D: 18-04-15 14:12:09    598:Malamute                         :     service_message
D: 18-04-15 14:12:09    598:Malamute                         :         $ get message to deliver
D: 18-04-15 14:12:09    598:Malamute                         :         $ send SERVICE_DELIVER
D: 18-04-15 14:12:09    598:Malamute                         :         > connected
lt-malamute: src/zmsg.c:213: zmsg_content_size: Assertion `self' failed.
Aborted (core dumped)
isra17 commented 6 years ago

Sorry for the lots of comments, first time diffing into zmq/malamute sources, trying to pin point the issue as I would love to run this in production with Python :| Looking at the msg triggering the assert, it looks like a use after free issue:

$14 = {
  sender = 0x7fffe4010420 "",
  address = 0x7fffe4010520 "\260\v\001\344\377\177",
  subject = 0x7fffe400e820 " \005\001\344\377\177",
  tracker = 0x7fffe400df80 <incomplete sequence \350>,
  expiry = 1523816818792,
  content = 0x0,
  refcount = 0x0
}
isra17 commented 6 years ago

Valgrind report confirming the issue:

==19750== Thread 5:
==19750== Invalid read of size 8
==19750==    at 0x4E60CBC: mlm_msg_content (mlm_msg.c:106)
==19750==    by 0x4E641E7: s_dequeue (mlm_msgq.c:124)
==19750==    by 0x4E6428D: mlm_msgq_dequeue_cursor (mlm_msgq.c:144)
==19750==    by 0x4E51943: s_service_dispatch (mlm_server.c:299)
==19750==    by 0x4E52527: dispatch_the_service (mlm_server.c:623)
==19750==    by 0x4E4EDAB: s_client_execute (mlm_server_engine.inc:862)
==19750==    by 0x4E50F37: s_server_handle_protocol (mlm_server_engine.inc:1564)
==19750==    by 0x50A4E64: zloop_start (in /usr/lib/libczmq.so.4.1.1)
==19750==    by 0x4E510E5: mlm_server (mlm_server_engine.inc:1611)
==19750==    by 0x5081376: ??? (in /usr/lib/libczmq.so.4.1.1)
==19750==    by 0x668E08B: start_thread (in /usr/lib/libpthread-2.26.so)
==19750==    by 0x5678E7E: clone (in /usr/lib/libc-2.26.so)
==19750==  Address 0x7b61408 is 40 bytes inside a block of size 56 free'd
==19750==    at 0x4C2E10B: free (vg_replace_malloc.c:530)
==19750==    by 0x4E60C7D: mlm_msg_destroy (mlm_msg.c:74)
==19750==    by 0x4E60E71: mlm_msg_unlink (mlm_msg.c:153)
==19750==    by 0x4E525F5: get_message_to_deliver (mlm_server.c:668)
==19750==    by 0x4E4F34F: s_client_execute (mlm_server_engine.inc:977)
==19750==    by 0x4E4D580: engine_send_event (mlm_server_engine.inc:278)
==19750==    by 0x4E51892: s_service_dispatch_message (mlm_server.c:281)
==19750==    by 0x4E5192F: s_service_dispatch (mlm_server.c:298)
==19750==    by 0x4E52527: dispatch_the_service (mlm_server.c:623)
==19750==    by 0x4E4EDAB: s_client_execute (mlm_server_engine.inc:862)
==19750==    by 0x4E50F37: s_server_handle_protocol (mlm_server_engine.inc:1564)
==19750==    by 0x50A4E64: zloop_start (in /usr/lib/libczmq.so.4.1.1)
==19750==  Block was alloc'd at
==19750==    at 0x4C2EEF5: calloc (vg_replace_malloc.c:711)
==19750==    by 0x4E60A0A: safe_malloc (czmq_prelude.h:635)
==19750==    by 0x4E60AD6: mlm_msg_new (mlm_msg.c:44)
==19750==    by 0x4E52376: write_message_to_service (mlm_server.c:584)
==19750==    by 0x4E4EC6D: s_client_execute (mlm_server_engine.inc:838)
==19750==    by 0x4E50F37: s_server_handle_protocol (mlm_server_engine.inc:1564)
==19750==    by 0x50A4E64: zloop_start (in /usr/lib/libczmq.so.4.1.1)
==19750==    by 0x4E510E5: mlm_server (mlm_server_engine.inc:1611)
==19750==    by 0x5081376: ??? (in /usr/lib/libczmq.so.4.1.1)
==19750==    by 0x668E08B: start_thread (in /usr/lib/libpthread-2.26.so)
==19750==    by 0x5678E7E: clone (in /usr/lib/libc-2.26.so)
==19750==