plied commented 3 months ago

Issue description

While building a ultra low latency application using ZeroMQ through IPC I noticed that even though the benchmark run with perf/local_lat and perf/remote_lat is able to achieve sub 50us latencies within a same host, these results are not replicable in production. After hours of research I found out that if there is any timeout in between the calls to the zmq_sendmsg method the following message will be sent with a huge latency (>200us).

This means if the application is actually sending messages back to back it will behave well, however if there are breaks in between messages (which is a more realistic use case) the latency of the sent message increases dramatically.

This issue was partially identified in issues #3577 and #3560 but they didn't come up with a right way to reproduce and thus it wasn't solved.

I wonder if this could be caused by the sender thread losing priority if the zmq_sendmsg function is not called after some time? NOTE that I'm calling the deprecated zmq_sendmsg method just because the original perf/remote_lat also calls that.

I was able to replicate this same issue in the Python client, Rust Client and C++ itself.

Environment

libzmq version (commit hash if unreleased): Commit 2a75ef07
OS: I replicated locally in Ubuntu 23.10 as well as in the debian:buster-slim docker image (0b031cf4038f)

Minimal test code / Steps to reproduce the issue

Slighlty modify the perf/remote_lat.cpp file so that it measures the roundtrip time of each message sent independently, and adds a delay after each roundtrip is finished and measured:
```
/* SPDX-License-Identifier: MPL-2.0 */
```

include "../include/zmq.h"

include

include // Added this

int main (int argc, char argv[]) { const char connect_to; int roundtrip_count; size_t message_size; void ctx; void s; int rc; int i; zmq_msg_t msg; void *watch; unsigned long elapsed = 0; //Added this double latency;

if (argc != 4) {
    printf ("usage: remote_lat <connect-to> <message-size> "
            "<roundtrip-count>\n");
    return 1;
}
connect_to = argv[1];
message_size = atoi (argv[2]);
roundtrip_count = atoi (argv[3]);

ctx = zmq_init (1);
if (!ctx) {
    printf ("error in zmq_init: %s\n", zmq_strerror (errno));
    return -1;
}

s = zmq_socket (ctx, ZMQ_REQ);
if (!s) {
    printf ("error in zmq_socket: %s\n", zmq_strerror (errno));
    return -1;
}

rc = zmq_connect (s, connect_to);
if (rc != 0) {
    printf ("error in zmq_connect: %s\n", zmq_strerror (errno));
    return -1;
}

rc = zmq_msg_init_size (&msg, message_size);
if (rc != 0) {
    printf ("error in zmq_msg_init_size: %s\n", zmq_strerror (errno));
    return -1;
}
memset (zmq_msg_data (&msg), 0, message_size);

for (i = 0; i != roundtrip_count; i++) {
    watch = zmq_stopwatch_start ();  // Moved the time measurement into the loop

    rc = zmq_sendmsg (s, &msg, 0);
    if (rc < 0) {
        printf ("error in zmq_sendmsg: %s\n", zmq_strerror (errno));
        return -1;
    }
    rc = zmq_recvmsg (s, &msg, 0);
    if (rc < 0) {
        printf ("error in zmq_recvmsg: %s\n", zmq_strerror (errno));
        return -1;
    }
    if (zmq_msg_size (&msg) != message_size) {
        printf ("message of incorrect size received\n");
        return -1;
    }

    elapsed += zmq_stopwatch_stop (watch);  // Measures time at the end of each roundtrip

    std::this_thread::sleep_for(std::chrono::milliseconds(1));  // Sleeps for 1ms after one round is finished
}

rc = zmq_msg_close (&msg);
if (rc != 0) {
    printf ("error in zmq_msg_close: %s\n", zmq_strerror (errno));
    return -1;
}

latency = (double) elapsed / (roundtrip_count * 2);

printf ("message size: %d [B]\n", (int) message_size);
printf ("roundtrip count: %d\n", (int) roundtrip_count);
printf ("average latency: %.3f [us]\n", (double) latency);

rc = zmq_close (s);
if (rc != 0) {
    printf ("error in zmq_close: %s\n", zmq_strerror (errno));
    return -1;
}

rc = zmq_ctx_term (ctx);
if (rc != 0) {
    printf ("error in zmq_ctx_term: %s\n", zmq_strerror (errno));
    return -1;
}

return 0;

}

2. Build the first stage of the Dockerfile only (to keep the perf folder):
```Dockerfile
FROM debian:buster-slim AS builder
LABEL maintainer="ZeroMQ Project <zeromq@imatix.com>"
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq \
    && apt-get install -qq --yes --no-install-recommends \
        autoconf \
        automake \
        build-essential \
        git \
        libkrb5-dev \
        libsodium-dev \
        libtool \
        pkg-config \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /opt/libzmq
COPY . .
RUN ./autogen.sh \
    && ./configure --prefix=/usr/local --with-libsodium --with-libgssapi_krb5 \
    && make \
    && make check \
    && make install

Run the resulting image, access it and run the performance test commands:

./perf/local_lat ipc:///zmq.sock 100 10000

and

./perf/remote_lat ipc:///zmq.sock 100 10000

What's the actual result? (include assertion message & call stack if applicable)

We keep getting very high latencies:

message size: 100 [B]
roundtrip count: 10000
average latency: 277.231 [us]

What's the expected result?

The original code which is exactly the same just without sleeping in between messages resulted in average latencies of below 50us every single time. I would expect latencies to behave similarly no matter if we are sending messages often or not, else the entire latency benchmark is missleading and useless.

Lounarok commented 2 months ago

Hi @plied , Just came across this issue. Have you tried ZMQ_AFFINITY with zmq_ctx_set() and play with ZMQ_THREAD_SCHED_POLICY and ZMQ_THREAD_PRIORITY? There are many options there to bind a socket to specific IO thread with defined priority.

Btw, zmq_init() is deprecated by zmq_ctx_new().

plied commented 2 months ago

None of the above seem to make a big difference, I tried the following:

ZMQ_THREAD_SCHED_POLICY=SHED_FIFO
ZMQ_THREAD_PRIORITY=99
ZMQ_AFFINITY=1

It seems like the ZMQ_THREAD_SCHED_POLICY does improve the latency but minimally, but it still does not get anywhere close to the expected latency.

hungpham2511 commented 2 months ago

Came across a similar issue which leads me to this issue. Could it be due to the processes are put to slept by the CPU scheduler and hence need to be re-scheduled for execution before the messages can be processed?

This can affect both the local and remote processes IMO.

One way to validate this hypothesis is to give the process high priority and use a real-time kernel.

plied commented 2 months ago

It could be something like that, however I did achieve super good and consistent performance using aeron on the same OS and kernel and without tweaking any scheduler configs. My hunch is that something within ZMQ itself is missconfigured but I cant find out what.

zeromq / libzmq

High latency when messages are not being send repeatedly #4673

Issue description

Environment

Minimal test code / Steps to reproduce the issue

include "../include/zmq.h"

include

include

include

include // Added this

include // Added this

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?