zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.53k stars 2.34k forks source link

Occasional crash in `KERNELBASE.dll` #4581

Open DiscipleOfEris opened 11 months ago

DiscipleOfEris commented 11 months ago

Issue description

I am currently using CPPZMQ version 4.8.1 and libzmq 4.3.4. The folks at CPPZMQ suggested this would be a better support channel.

If there is a better support channel, please let me know.

My application is running on Windows Server 2016. The latest windows updates and drivers are installed. I have run system file checker and it reports no issues.

My application occasionally crashes when attempting to read from the socket. The specific message is: image when calling int zmq_msg_recv(zmq_msg_t *msg_, void *s_, int flags_)

_flags_ is zmq::recv_flags::none.

Call stack: image

Last source code in the call stack I have (I highlighted the offending line with >>>):

ZMQ_NODISCARD
recv_result_t recv(message_t &msg, recv_flags flags = recv_flags::none)
{
>>> const int nbytes =
        zmq_msg_recv(msg.handle(), _handle, static_cast<int>(flags));
    if (nbytes >= 0) {
        assert(msg.size() == static_cast<size_t>(nbytes));
        return static_cast<size_t>(nbytes);
    }
    if (zmq_errno() == EAGAIN)
        return {};
    throw error_t();
}
void listen()
{
    while (true)
    {
        if (!zSocket)
        {
            return;
        }

        try
        {
            chat_message_t message;
>>>         if (!zSocket->recv(message.type, zmq::recv_flags::none))
            {
                send_queue();
                continue;
            }

            int more = zSocket->get(zmq::sockopt::rcvmore);
            if (more)
            {
                std::ignore = zSocket->recv(message.data);
                more        = zSocket->get(zmq::sockopt::rcvmore);
                if (more)
                {
                    std::ignore = zSocket->recv(message.packet);
                }
            }

            parse(message);
        }
        catch (zmq::error_t& e)
        {
            // Context was terminated (ETERM = 156384765)
            // Exit loop
            if (!zSocket || e.num() == 156384765)
            {
                return;
            }
            ShowError("Message: %s\n", e.what());
            continue;
        }
    }
}

Only one thread in the process creates and interacts with the zSocket. However, there is a companion process (an entirely separate application) that also has its own zSocket.

I'm not sure what steps I can take to tackle this problem. Our server has 20-70 simultaneous users (only three sessions are permitted from the same IP address). ZMQ doesn't really provide any method for tracking IP address on individual sockets, and it crashes while reading from the socket so I can't check what about the message might have caused it to crash. I'm not sure what other methods I can take to track/log this issue and attempt to develop a pattern or find a culprit, and we haven't noticed an pattern in user behavior that might be triggering it.

We get this crash on average 1-2 times a day. We were affected by this exact same crash about five months ago, but it seemed to resolve itself for about 3 months before it cropped up again early last month. It is highly intermittent, but frequent enough to be severely disruptive. Server chugs along quite happily for ~12 hours and then crashes. There doesn't seem to be any specific pattern to the timing of the crash.

Environment

Minimal test code / Steps to reproduce the issue

The crash is highly intermittent. I have no specific steps for reproducing it, beyond running my live server with its real user base until the crash happens.

I am unable to reproduce the issue in a controlled environment.

Running the live application in debug mode is not a viable option.

What's the actual result? (include assertion message & call stack if applicable)

See above

What's the expected result?

Does not intermittently crash.

I'm looking for advice on how to better approach this issue.

axelriet commented 10 months ago

Not sure this is still current but you should consider building the library with symbols so the stack trace makes sense.