zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.58k stars 2.34k forks source link

IPC server memory unlimited grow #4189

Open dkonyshev opened 3 years ago

dkonyshev commented 3 years ago

Hello,

Could someone please explain what is wrong with my server code below? The problem is that the server keeps allocating memory every time a short-living client connects to its IPC socket but never seems to release that memory when the clients exit. In my real embedded system app the servers ends up devouring all the system memory and the system crashes.

I narrowed down the problem code to this server code:

#!/usr/bin/python3

import zmq, time

url = 'ipc:///tmp/my'

context = zmq.Context()
sock = context.socket(zmq.PUB)
sock.bind(url)

while True:
    time.sleep(1)

And the client code:

#!/usr/bin/python3

import zmq

url = 'ipc:///tmp/my'

context = zmq.Context()
sock = context.socket(zmq.SUB)
sock.connect(url)
sock.close();
context.destroy();

Then, I run the server in one terminal as follows to keep track of used memory:

python3 zmq-send.py & while :; do cat /proc/$!/statm; sleep 5; done

And the clients are launching in another terminal as follows:

while :; do python3 zmq-rcv.py; done

It can be seen in the server memory printouts that its used memory grows very fast but never shrinks.

Any help will be very much appreciated.

Regards, Dmitry

shishirpy commented 3 years ago

The server code does not work as it is. I get error regarding missing os import, if I include that I get error regarding os.makedirs(dir, exist_ok=True) line in the server.

dkonyshev commented 3 years ago

Sorry about that. Just fixed the code in the comment.

shishirpy commented 3 years ago

I see the same lines printed again and again. Could you post the output for print(sock.getsockopt(zmq.SNDHWM))

before while True in the server code.

dkonyshev commented 3 years ago

print(sock.getsockopt(zmq.SNDHWM)) yields 1000.

Here is console output demonstrating the server used memory (6th number in the statm output) growing over time:

$ python3 zmq-send.py & while :; do cat /proc/$!/statm; sleep 5; done
[2] 1335088
2451 211 187 660 0 171 0
1000
27515 4258 2608 660 0 6417 0
27515 4588 2608 660 0 7620 0
27515 4984 2608 660 0 8809 0
27515 5380 2608 660 0 10006 0
27515 5776 2608 660 0 11192 0
shishirpy commented 3 years ago

I see a similar output when I reduce the sleep duration but after a point it remains constant. I guess that it just takes that long to grow to the true size. I am not sure why it takes so long for you for me it reaches the limit withing 1 sec

(venv)$ python3 server.py & while :; do cat /proc/$!/statm; sleep 0.02; done
[15] 21437
5326 442 338 660 0 2197 0
6153 1811 821 660 0 2977 0
7979 2669 1185 660 0 3446 0
1000
28461 2796 1298 660 0 23928 0
28461 2796 1298 660 0 23928 0
28461 2796 1298 660 0 23928 0
28461 2796 1298 660 0 23928 0
dkonyshev commented 3 years ago

I left the test running over a night and the server used memory still keeps growing:

10005371 3028276 2592 660 0 9988941 0
10005371 3028672 2592 660 0 9990120 0
10005371 3029002 2592 660 0 9991316 0
10005371 3027685 2592 660 0 9992512 0
10005371 3028015 2592 660 0 9993708 0
10005371 3028411 2592 660 0 9994895 0

top says the server takes 73,8% of system memory at the moment.

My system is Ubuntu 20.04:

Current libzmq version is 4.3.2 Current pyzmq version is 18.1.1

shishirpy commented 3 years ago

Not sure what's going on. I am using the following versions:

shishirpy commented 3 years ago

I don't see any memory increase with the versions you mentioned on WSL.

dkonyshev commented 3 years ago

This is really confusing.

Besides Ubuntu 20.04, I observe the same behavior with memory allocated by the server indefinitely:

Keynib commented 3 years ago

I tried to reproduce the given problem and caught the memory growth too. I probably think, that problem is in epoll() function of zmq thread. If you allocate memory by using context = zmq.Context() and don't destroy this context, valgrind will show a lot of leaks, one is interested:

==220861== 164,000 bytes in 10 blocks are still reachable in loss record 88 of 90
==220861==    at 0x4C31DFB: malloc (vg_replace_malloc.c:309)
==220861==    by 0x40A21A5: allocate_chunk (yqueue.hpp:189)
==220861==    by 0x40A21A5: yqueue_t (yqueue.hpp:68)
==220861==    by 0x40A21A5: zmq::pipe_t::hiccup() (ypipe.hpp:51)
==220861==    by 0x40ADE97: zmq::session_base_t::reconnect() (session_base.cpp:519)
==220861==    by 0x40ADFBF: zmq::session_base_t::engine_error(zmq::stream_engine_t::error_reason_t) (session_base.cpp:437)
==220861==    by 0x40BA003: zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) (stream_engine.cpp:986)
==220861==    by 0x40BC167: zmq::stream_engine_t::in_event() (stream_engine.cpp:321)
==220861==    by 0x409114B: zmq::epoll_t::loop() [clone .part.11] (epoll.cpp:198)
==220861==    by 0x40C250C: thread_routine (thread.cpp:182)
==220861==    by 0x504E6DA: start_thread (pthread_create.c:463)
==220861==    by 0x5EAD71E: clone (clone.S:95)
shishirpy commented 3 years ago

Do you see the same results if you do not run the zmq-rcv.py code?

giampaolo commented 2 years ago

I tried to reproduce the given problem and caught the memory growth too. I probably think, that problem is in epoll() function of zmq thread.

FWIW, this is the strace -p PID output when a new connection is accepted.

epoll_wait(8, [{EPOLLIN, {u32=872418144, u64=140356108684128}}], 256, -1) = 1
accept4(10, NULL, NULL, SOCK_CLOEXEC)   = 11
getpeername(11, {sa_family=AF_UNIX}, [128->2]) = 0
getsockname(11, {sa_family=AF_UNIX, sun_path="/tmp/my"}, [128->10]) = 0
getpeername(11, {sa_family=AF_UNIX}, [128->2]) = 0
getsockopt(11, SOL_SOCKET, SO_PEERCRED, {pid=95371, uid=1000, gid=1000}, [12]) = 0
fcntl(11, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
write(7, "\1\0\0\0\0\0\0\0", 8)         = 8
epoll_wait(8, [{EPOLLIN, {u32=30834208, u64=30834208}}], 256, -1) = 1
poll([{fd=7, events=POLLIN}], 1, 0)     = 1 ([{fd=7, revents=POLLIN}])
read(7, "\1\0\0\0\0\0\0\0", 8)          = 8
epoll_ctl(8, EPOLL_CTL_ADD, 11, {0, {u32=872421776, u64=140356108687760}}) = 0
epoll_ctl(8, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=872421776, u64=140356108687760}}) = 0
epoll_ctl(8, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=872421776, u64=140356108687760}}) = 0
recvfrom(11, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
recvfrom(11, "", 2, 0, NULL, NULL)      = 0
epoll_ctl(8, EPOLL_CTL_DEL, 11, 0x7fa734001994) = 0
close(11)                               = 0
poll([{fd=7, events=POLLIN}], 1, 0)     = 0 (Timeout)
epoll_wait(8, 
jimklimov commented 2 years ago

Couldn't find a link quickly, but this sounds like another discussion from a year or two ago about, IIRC, zero-copy mechanism to speed up communications. Effectively, many network packets come into a buffer, pointers to insides of that are given as zmq protocol message contents, and the big buffer is only released when no references remain alive.

In that other thread, people saw large memory consumptions because (I think) did not quickly process and free the messages.

The older non-zerocopy can be enabled somehow, it is less stressful on memory at a CPU time cost of copying each message into its own buffer.

On Thu, Sep 16, 2021, 14:50 Giampaolo Rodola @.***> wrote:

I tried to reproduce the given problem and caught the memory growth too. I probably think, that problem is in epoll() function of zmq thread.

FWIW, this is the strace -p PID output when a new connection is accepted.

epoll_wait(8, [{EPOLLIN, {u32=872418144, u64=140356108684128}}], 256, -1) = 1 accept4(10, NULL, NULL, SOCK_CLOEXEC) = 11 getpeername(11, {sa_family=AF_UNIX}, [128->2]) = 0 getsockname(11, {sa_family=AF_UNIX, sun_path="/tmp/my"}, [128->10]) = 0 getpeername(11, {sa_family=AF_UNIX}, [128->2]) = 0 getsockopt(11, SOL_SOCKET, SO_PEERCRED, {pid=95371, uid=1000, gid=1000}, [12]) = 0 fcntl(11, F_GETFL) = 0x2 (flags O_RDWR) fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0 write(7, "\1\0\0\0\0\0\0\0", 8) = 8 epoll_wait(8, [{EPOLLIN, {u32=30834208, u64=30834208}}], 256, -1) = 1 poll([{fd=7, events=POLLIN}], 1, 0) = 1 ([{fd=7, revents=POLLIN}]) read(7, "\1\0\0\0\0\0\0\0", 8) = 8 epoll_ctl(8, EPOLL_CTL_ADD, 11, {0, {u32=872421776, u64=140356108687760}}) = 0 epoll_ctl(8, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=872421776, u64=140356108687760}}) = 0 epoll_ctl(8, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=872421776, u64=140356108687760}}) = 0 recvfrom(11, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10 recvfrom(11, "", 2, 0, NULL, NULL) = 0 epoll_ctl(8, EPOLL_CTL_DEL, 11, 0x7fa734001994) = 0 close(11) = 0 poll([{fd=7, events=POLLIN}], 1, 0) = 0 (Timeout) epoll_wait(8,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zeromq/libzmq/issues/4189#issuecomment-920872293, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPTFADKJTN56BWBVQ7CDDUCHRY7ANCNFSM4446U42A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.