zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.65k stars 2.35k forks source link

Problem: msg_t ownership inconsistent, with some leaks #2481

Closed evoskuil closed 7 years ago

evoskuil commented 7 years ago

Given that there is no msg_t destructor there are only two options for resource release, the owner (creator) of the message must free based on result code or the code that generates and/or propagates the error must handle it.

A casual look at the code left me with the impression the the intent is the owner cleans up in the case of a send failure (but the sender does in the case of a success), and the reader cleans up in the case of a read failure (but the owner does in the case of success). It will take a little time to pin it down, since there are 52 message constructions, passed through various deep call stacks. But the first path I traced let to this:

bool zmq::pipe_t::read (msg_t *msg_)
{
    if (unlikely (!in_active))
        return false;
    if (unlikely (state != active && state != waiting_for_delimiter))
        return false;

read_message:
    if (!inpipe->read (msg_)) {
        in_active = false;
        return false;
    }

    //  If this is a credential, save a copy and receive next message.
    if (unlikely (msg_->is_credential ())) {
        const unsigned char *data = static_cast <const unsigned char *> (msg_->data ());
        credential = blob_t (data, msg_->size ());
        const int rc = msg_->close ();
        zmq_assert (rc == 0);
        goto read_message;
    }

    //  If delimiter was read, start termination process of the pipe.
    if (msg_->is_delimiter ()) {
        process_delimiter ();
        return false;
    }

    if (!(msg_->flags () & msg_t::more) && !msg_->is_identity ())
        msgs_read++;

    if (lwm > 0 && msgs_read % lwm == 0)
        send_activate_write (peer, msgs_read);

    return true;
}

Notice that this section returns false with a message that has been read:

    //  If delimiter was read, start termination process of the pipe.
    if (msg_->is_delimiter ()) {
        process_delimiter ();
        return false;
    }

Popping back up the stack we are here:

int zmq::session_base_t::read_zap_msg (msg_t *msg_)
{
    if (zap_pipe == NULL) {
        errno = ENOTCONN;
        return -1;
    }

    if (!zap_pipe->read (msg_)) {
        errno = EAGAIN;
        return -1;
    }

    return 0;
}

Notice that this read failure causes the return of a -1 (fail) code. Popping up again we are here:

int zmq::curve_server_t::receive_and_process_zap_reply ()
{
    int rc = 0;
    msg_t msg [7];  //  ZAP reply consists of 7 frames

    //  Initialize all reply frames
    for (int i = 0; i < 7; i++) {
        rc = msg [i].init ();
        errno_assert (rc == 0);
    }

    for (int i = 0; i < 7; i++) {
        rc = session->read_zap_msg (&msg [i]);
        if (rc == -1)
            break;
        if ((msg [i].flags () & msg_t::more) == (i < 6? 0: msg_t::more)) {
            //  Temporary support for security debugging
            ////puts ("CURVE I: ZAP handler sent incomplete reply message");
            errno = EPROTO;
            rc = -1;
            break;
        }
    }

    if (rc != 0)
        goto error;

[...]

error:
    for (int i = 0; i < 7; i++) {
        const int rc2 = msg [i].close ();
        errno_assert (rc2 == 0);
    }

    return rc;
}

So in this case a read call failure requires the owner to close the message that was reported as not read. Yet in the same code path a failure of the read in this line:

    if (!inpipe->read (msg_)) {

does not require that the message be closed by the caller. This then implies that it must always be safe to close an unclosed or a closed message, though this is unclear in zmq_msg_close documentation.

Documentation also states that is not required after any successful zmq_msg_send but is silent about whether it must be called after a failed zmq_msg_read (it appears that it must be), failed send or successful read.

In another section the behavior appears to leak on ceratin read failure:

void zmq::pgm_receiver_t::drop_subscriptions ()
{
    msg_t msg;
    msg.init ();
    while (session->pull_msg (&msg) == 0)
        msg.close ();
}
...
int zmq::session_base_t::pull_msg (msg_t *msg_)
{
    if (!pipe || !pipe->read (msg_)) {
        errno = EAGAIN;
        return -1;
    }

    incomplete_in = msg_->flags () & msg_t::more ? true : false;

    return 0;
}

Notice that, unlike in the previous example, the same pipe->read (msg_) failure in this case does not lead to the caller performing a cleanup. Now maybe certain types of read failures don't require the caller to clean up and others do, so it's ok (i.e. not a leak) for the reader to successfully read, not close the message, and yet return a failure (such as for a delimiter message). But that's not obvious (or commented) in the code.

IMO this C-style resource management (as opposed to RAII) makes the code fragile. By that I mean it's very hard to see errors as resource management is a side concern of a large amount of code. Are there tests designed to detect leaks in failure conditions, or only in case of success?

I did go through each code path that allocates msg_t. This one appeared to be a potential problem.

int zmq::proxy (
    class socket_base_t *frontend_,
    class socket_base_t *backend_,
    class socket_base_t *capture_,
    class socket_base_t *control_)
{
    msg_t msg;
    int rc = msg.init ();
    if (rc != 0)
        return -1;

    //  The algorithm below assumes ratio of requests and replies processed
    //  under full load to be 1:1.

    int more;
    size_t moresz;
    zmq_pollitem_t items [] = {
        { frontend_, 0, ZMQ_POLLIN, 0 },
        { backend_, 0, ZMQ_POLLIN, 0 },
        { control_, 0, ZMQ_POLLIN, 0 }
    };
    int qt_poll_items = (control_ ? 3 : 2);
    zmq_pollitem_t itemsout [] = {
        { frontend_, 0, ZMQ_POLLOUT, 0 },
        { backend_, 0, ZMQ_POLLOUT, 0 }
    };

    //  Proxy can be in these three states
    enum {
        active,
        paused,
        terminated
    } state = active;

    while (state != terminated) {
        //  Wait while there are either requests or replies to process.
        rc = zmq_poll (&items [0], qt_poll_items, -1);
        if (unlikely (rc < 0))
            return -1;

        //  Get the pollout separately because when combining this with pollin it maxes the CPU
        //  because pollout shall most of the time return directly.
        //  POLLOUT is only checked when frontend and backend sockets are not the same.
        if (frontend_ != backend_) {
            rc = zmq_poll (&itemsout [0], 2, 0);
            if (unlikely (rc < 0)) {
                return -1;
            }
        }

        //  Process a control command if any
        if (control_ && items [2].revents & ZMQ_POLLIN) {
            rc = control_->recv (&msg, 0);
            if (unlikely (rc < 0))
                return -1;

            moresz = sizeof more;
            rc = control_->getsockopt (ZMQ_RCVMORE, &more, &moresz);
            if (unlikely (rc < 0) || more)
                return -1;

            //  Copy message to capture socket if any
            rc = capture(capture_, msg);
            if (unlikely (rc < 0))
                return -1;
[...]
        }
        //  Process a request
        if (state == active
        &&  items [0].revents & ZMQ_POLLIN
        &&  (frontend_ == backend_ || itemsout [1].revents & ZMQ_POLLOUT)) {
            rc = forward(frontend_, backend_, capture_,msg);
            if (unlikely (rc < 0))
                return -1;
        }
        //  Process a reply
        if (state == active
        &&  frontend_ != backend_
        &&  items [1].revents & ZMQ_POLLIN
        &&  itemsout [0].revents & ZMQ_POLLOUT) {
            rc = forward(backend_, frontend_, capture_,msg);
            if (unlikely (rc < 0))
                return -1;
        }
    }
    return 0;
}

After a message is read successfully there are four conditions in which it will not be closed upon return.

evoskuil commented 7 years ago

This method appears to leak a msg_t on send failure:

int zmq::sub_t::xsetsockopt (int option_, const void *optval_,
    size_t optvallen_)
{
    if (option_ != ZMQ_SUBSCRIBE && option_ != ZMQ_UNSUBSCRIBE) {
        errno = EINVAL;
        return -1;
    }

    //  Create the subscription message.
    msg_t msg;
    int rc = msg.init_size (optvallen_ + 1);
    errno_assert (rc == 0);
    unsigned char *data = (unsigned char*) msg.data ();
    if (option_ == ZMQ_SUBSCRIBE)
        *data = 1;
    else
    if (option_ == ZMQ_UNSUBSCRIBE)
        *data = 0;
    //  We explicitly allow a NULL subscription with size zero
    if (optvallen_) {
        assert (optval_);
        memcpy (data + 1, optval_, optvallen_);
    }
    //  Pass it further on in the stack.
    int err = 0;
    rc = xsub_t::xsend (&msg);
    if (rc != 0)
        err = errno;
    int rc2 = msg.close ();
    errno_assert (rc2 == 0);
    if (rc != 0)
        errno = err;
    return rc;
}
evoskuil commented 7 years ago

This method ignores the write failure (without commentary). If the write fails the message presumably leaks.

void zmq::xpub_t::xattach_pipe (pipe_t *pipe_, bool subscribe_to_all_)
{
    zmq_assert (pipe_);
    dist.attach (pipe_);

    //  If subscribe_to_all_ is specified, the caller would like to subscribe
    //  to all data on this pipe, implicitly.
    if (subscribe_to_all_)
        subscriptions.add (NULL, 0, pipe_);

    // if welcome message exist
    if (welcome_msg.size() > 0)
    {
        msg_t copy;
        copy.init();
        copy.copy(welcome_msg);

        pipe_->write(&copy);
        pipe_->flush();
    }

    //  The pipe is active when attached. Let's read the subscriptions from
    //  it, if any.
    xread_activated (pipe_);
}
evoskuil commented 7 years ago

Assuming that not closing a received delimiter message is not a leak, the above three methods are the only ones out of the 52 msg_t constructions that I could see causing a leak. Most of calls that utilize a message are asserted for success, crashing the process on failure. I'm not familiar enough with the code base to say whether those are valid assumptions, although in the case of ZAP message sending they were not.

somdoron commented 7 years ago

Did you experience a leak or those are assumptions?

Any in some of your examples the msg was empty inside, so no closing it did not really leak anything. (like with the delimiter) , msg is always in the stack, never dynamic allocated, only the bytes inside on the heap. So on empty messages should be closed but not must. Anyway I will take a more look deep look later today.

On Mar 30, 2017 00:24, "Eric Voskuil" notifications@github.com wrote:

Assuming that not closing a received delimiter message is not a leak, the above three methods are the only ones out of the 52 msg_t constructions that I could see causing a leak. Most of calls that utilize a message are asserted for success, crashing the process on failure. I'm not familiar enough with the code base to say whether those are valid assumptions, although in the case of ZAP message sending they were not.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zeromq/libzmq/issues/2481#issuecomment-290230553, or mute the thread https://github.com/notifications/unsubscribe-auth/AClv9i7gViP0JxBicxVVpaodcYEsmW3cks5rqswMgaJpZM4MtiVj .

evoskuil commented 7 years ago

If this allocates (or assumes ownership of) memory:

rc = control_->recv (&msg, 0);

then there is a real leak. This is not an assumption but a conclusion.

This is public API:

int zmq_proxy (void *frontend_, void *backend_, void *capture_)
{
    if (!frontend_ || !backend_) {
        errno = EFAULT;
        return -1;
    }
    return zmq::proxy (
        (zmq::socket_base_t*) frontend_,
        (zmq::socket_base_t*) backend_,
        (zmq::socket_base_t*) capture_);
}

The recv call above therefore is not in control of the type of message transferred by the proxy. As such the failure to free the message in an error case after a successful read is a leak.

I speculated above that some messages may not allocate and therefore may not result in actual leaks, but this is horribly fragile. The caller cannot be expected to know whether the message payload is stack or heap allocated.

somdoron commented 7 years ago
rc = control_->recv (&msg, 0);

This doesn't leak memory, because msg is function local variable, the next time you call receive it will close the current msg and will create a new one. Receive always close the msg and create new one. Send always return empty initialized msg. However, with that said, zmq_proxy does leak one msg, because when the method exit we don't close the msg. This is a bug

somdoron commented 7 years ago

I look at your examples, the code is indeed complicated, but there is no leak.

  1. Msg is only on stack and always pass as pointer (not copied), only when passing between the threads the content is being copied and the original msg is being initialized.

Copied here, notice the * when calling write: https://github.com/zeromq/libzmq/blob/master/src/pipe.cpp#L228

Initialized here: https://github.com/zeromq/libzmq/blob/master/src/lb.cpp#L146

  1. Receive expected to get initialized msg structure (https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L1178)

  2. Receive will call close on the initialized msg before accepting new msg https://github.com/zeromq/libzmq/blob/master/src/fq.cpp#L91

  3. On success receive returns msg

  4. On failure receive return initialized msg, but empty (no heap, only stack), so you should call close on it, but you don't have to.

https://github.com/zeromq/libzmq/blob/master/src/fq.cpp#L128

  1. When sending, on success you get initialized empty msg

https://github.com/zeromq/libzmq/blob/master/src/lb.cpp#L146

  1. When sending fail the msg is not changed and must be closed

However, you usually won't see msg.close all the time, because of the pattern that receive and send close and initialize you usually see the msg being initialized once and closed once. So in a function it will be initialize at the beginning of the function and closed in the end and in a class initialized in the ctor and closed in the dtor.

evoskuil commented 7 years ago

On success receive returns msg... On failure receive return initialized msg, but empty

When a receive is successful a message payload may be allocated. If the message is not subsequently closed, it is a leak. You have not considered this case, which is the issue I've shown above in proxy (at least).

somdoron commented 7 years ago

The msg in the proxy example does get used again, so the next call to receive or send will close it.

But, as said, proxy does (maybe) leak one msg for the entire process of the proxy, when exit proxy it doesn't close the msg, so last msg might be leaked, this should be fixed. However chances are that the msg is empty, anyway that is a bug and the msg should be closed when the proxy exit.

evoskuil commented 7 years ago

chances are that the msg is empty

There is no way to predict whether the message will be empty. The caller controls the socket type.

proxy does (maybe) leak one msg

One message leak is a leak. It is not a maybe.

for the entire process of the proxy

The proxy lifetime is not the same as the process lifetime.

the msg in proxy declared and initialized at the beginning of the method:

This is irrelevant, the message is not always closed after a successful read.

somdoron commented 7 years ago

The msg in proxy get initialized and declared at the beginning of the method

https://github.com/zeromq/libzmq/blob/master/src/proxy.cpp#L115

We only need to close the msg once, when exiting the method (which we don't). Taking your example to close the msg after the call to _control.Recv we would need to immediately init it anyway, because send and receive must work with initialized msg.

So it would look like this:

rc = control_->recv (&msg, 0);
control_->close();
control_->init();

because of this both are not needed.

somdoron commented 7 years ago

This is irrelevant, the message is not always closed after a successful read.

But it will get closed in the next call to send or recv

evoskuil commented 7 years ago

Not in the cases where the method exits for another failure, which is the problem I raised (4 possible locations).

somdoron commented 7 years ago

agreed.

following is a fix:

https://github.com/somdoron/libzmq/commit/0530168601e93b74450a2f0c0447d2fae0f1bf33

evoskuil commented 7 years ago

For example, the second return leaks the message content allocation.

            rc = control_->recv (&msg, 0);
            if (unlikely (rc < 0))
                return -1;

            moresz = sizeof more;
            rc = control_->getsockopt (ZMQ_RCVMORE, &more, &moresz);
            if (unlikely (rc < 0) || more)
                return -1;
evoskuil commented 7 years ago

following is a fix...

yes, of course

evoskuil commented 7 years ago

I'm less concerned with how to patch each instance, which is pretty straightforward once isolated, and more concerned with the fragility of the pattern employed. This is C++, we can use destructors.

somdoron commented 7 years ago

actually we cannot...

msg is a structure, it is being allocated by every binding with another languages, which doesn't support destructors. So we must depend on the close method. I agree, it is complicated and delicate.

somdoron commented 7 years ago

another way to put it, because zeromq expose C API, msg is not really C++, it is C, and C doesn't support destructors.

evoskuil commented 7 years ago

The vast majority of these message allocations never see the public API.

somdoron commented 7 years ago

Not true, every time you are sending a message you are creating the msg structure, take a look at all the binding, CZMQ for example:

struct _zframe_t {
    uint32_t tag;               //  Object tag for runtime detection
    zmq_msg_t zmsg;             //  zmq_msg_t blob for frame
    int more;                   //  More flag, from last read
    uint32_t routing_id;        //  Routing ID back to sender, if any
};

https://github.com/zeromq/czmq/blob/master/src/zframe.c#L37

it being declared on the stack, zmq_msg_t actually is not even a typedef for class msg_t it is char array at the same size of msg_t (don't ask me why...), so not even constructor.

The correct way to look at msg is as C structure and not even C++. Which is why it doesn't use any ctor or dtors.

evoskuil commented 7 years ago

I don't follow this response. It doesn't seem related to my observation.

somdoron commented 7 years ago

we cannot use ctor and dtor, as msg_t is exposed by a C API which doesn't support those. other binding like C, python, java allocate msg_t themself, which mean no constructor and destructor is being used, they cannot call C++ code, only C. This is why msg is using close and init.

evoskuil commented 7 years ago

Yes, I got that part. My point is that very few uses of the msg_t class actually get exposed to public API. Most are entirely encapsulated in zmq internals. As such it is worth considering assigning the public struct to/from a safe internal message object. Just because there is a C API does not mean the C++ code has to work like C code.

I realize this is a stylistic issue within zmq, and I'm ok with it. But one should be aware of the risks of C style coding. The responsibilities and scope for error handling get spread out broadly. Just a casual walk through the code turned up a number of such issues for me today.

somdoron commented 7 years ago

The problem is, that all over libzmq we only have a reference to msg, almost no place allocate it. Binding are the one that allocate, so dtor and ctor won't save us from bugs.

In the proxy case, yes it would help, maybe in some other cases, but they are not in the critical path of zeromq.

Also the mechanism of when to call close and when to call init is complicated, not only when out of scope. I'm not sure it is possible to design such a class. But maybe.

evoskuil commented 7 years ago

This is really not the case. There are 43 calls to msg_t.close() in zmq and most of them are invoked as the message object is losing scope. The rest of them are called in the case of a loop over the message or as the result of an error prior to a return. In each case these would be unnecessary in typical class-oriented design. At best a small number of these calls accelerate deallocation by a tiny amount of time. Freeing memory is the most typical use case for an object destructor, and there is nothing at all special happening here.

Loss of scope is where object oriented approaches are most powerful, since function scope becomes irrelevant to the lifetime of a resource (as class scope takes over the resource). Only in the case of crossing the C API boundary is there is distinction.

I realize this approach infects the entire library and would be hard to change.

somdoron commented 7 years ago

In the core of the library, msg_t is almost always by reference, the cases where msg_t is not be reference is very few and on those you actually don't want to call close (proxy is not core of the library, I'm talking about the send& recv flow). Almost all other places it is by reference. Do you want to covert those to stack? all of them? part of them?

Also take a look at the following:

https://github.com/zeromq/libzmq/blob/master/src/fq.cpp#L88

The msg is being closed at the begining of the scope and initialize in the end, you cannot use class here, you must use reference.

I'm not sure where do you want to have stack msg (which will invoke the dtor), at what point? In the core path, there is none. For the none core path, like proxy, we can have some wrapper.

evoskuil commented 7 years ago

This is not a question of reference vs. pointer or stack vs. heap allocation, and there is nothing about a class that prevents reassignment of resources.

evoskuil commented 7 years ago

This is getting a little far off course for me. I'm not advocating for rewriting the message object. I was merely pointing out that its use is fragile as a matter of design. I spent the day today scanning the code for problems that I anticipated would arise from such design, and found some. It's very hard to tell if there are or are not more. So it's a caution. Also as I remarked, documentation regarding resource responsibility could be more clear.

somdoron commented 7 years ago

I agree that this is fragile. I don't think we can use ctor and dtor to solve it. You are welcome to send a pull request.

evoskuil commented 7 years ago

I will issue a PR to patch up whatever actual issues exist above. When I get time to spend on some major zeromq work it'll probably be the UTF-8 Everywhere implementation that I promised @hintjens.

evoskuil commented 7 years ago

Specific issues above resolved by https://github.com/zeromq/libzmq/pull/2486

evoskuil commented 7 years ago

@somdoron I did a little checking into this question of the lack of msg_t destruction. This class is private to the implementation (whether it is a struct or a class is not the relevant issue, since they are structurally the same). The reason it is constrained is that it is cast to and from a zmq_msg_t instance. This is done to prevent copying cost.

AFAICT there is nothing that prevents this technique from working in the presence of a destructor on msg_t. All that would be required is a change to the alignment of zmq_msg_t (in its definition). This may lead to a small amount of additional platform-dependent alignment data (such as 4-8 bytes).

typedef struct zmq_msg_t {
#if defined (__GNUC__) || defined ( __INTEL_COMPILER) || \
        (defined (__SUNPRO_C) && __SUNPRO_C >= 0x590) || \
        (defined (__SUNPRO_CC) && __SUNPRO_CC >= 0x590)
    unsigned char _ [64] __attribute__ ((aligned (sizeof (void *))));
#elif defined (_MSC_VER) && (defined (_M_X64) || defined (_M_ARM64))
    __declspec (align (8)) unsigned char _ [64];
#elif defined (_MSC_VER) && (defined (_M_IX86) || defined (_M_ARM_ARMV7VE))
    __declspec (align (4)) unsigned char _ [64];
#else
    unsigned char _ [64];
#endif
} zmq_msg_t;

I'm not in a rush to do this, but IMO it seems entirely reasonable and would not only make the code safer, it would make it much smaller and more readable.

bluca commented 7 years ago

Please please please be extremely careful with that... last time it was changed it was weeks of work to get it right without breaking ABI

evoskuil commented 7 years ago

Of course. Any change to the zmq_msg_t structure requires validation on all supported platforms/compilers, since it's fundamentally a set of compiler implementation assumptions (i.e. scary hack).