zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.46k stars 2.34k forks source link

zmq crashes when transferring large data with 1 MB messages using pub/sub pattern. #4640

Open KithurMohamedHallaj opened 7 months ago

KithurMohamedHallaj commented 7 months ago

Please use this template for reporting suspected bugs or requests for help.

Issue description

I have been multicasting a file about 200 GB through zmq pub/sub pattern. But randomly, the application crashes during the recv call. Mostly it throws some exceptions and when I encounter an exception/maximum tries after a timeout I reconnect. During that time I close the socket, which caused the program to crash. I have been tried to shutdown the socket and all. yet the issue didn't resolved.

Environment

winPE environment with win 11 sdk

Minimal test code / Steps to reproduce the issue

What's the actual result? (include assertion message & call stack if applicable)

What's the expected result?

KithurMohamedHallaj commented 7 months ago

Got this problem for a while and found that recv function throws out exception "Not Enough Space"

axelriet commented 6 months ago

Two separate things:

1) It should not crash (there is a bug) 2) If you are using pub-sub over TCP that is not the best way to multicast a file (as blocks are sent individually to every subscriber)

You should post more info, make a build with symbols/debug info, post the call stack with details so the issue can be tracked down and fixed.

Separately, given the information you supplied, you should consider actual multicasting (using NORM or PGM, or even UDP (RADIO/DISH)) - depending on your circumstances - to effectively multicast the data (sent it once to the wire to any number of subscribers) versus of sending it N times in 1:1 pipes where N is the number of subscribers.

axelriet commented 6 months ago

After trying I have to nuance the above perspective: I tried with PGM and there are rate limits coming into play for multicasting that may make it impractical for your case.

The good thing is all recipients gets the data at once, yet depending on the number of recipients it might be quicker to use PUB-SUB over TCP and send the data N times without rate throttling, or sent it only once (to all recipients at once) but more slowly and without saturating the network with PGM. There is a point where PGM will start to win as the transfer time does not strongly depend on the number of recipients.

I didn’t try NORM but that protocol was designed for large file distribution to many clients so it might be more suited than round-robin TCP PUB-SUB or rate-limited PGM.

Besides that I tried PUB-SUB over TCP (local host) with a few recipients and up to 1GB sent 16x to each recipient and did not see a crash.

KithurMohamedHallaj commented 6 months ago

Hi @axelriet thanks for you response.

I have tried pgm (open-pgm) only and the server is in Java and the receiver is in C++.

The issue happens randomly, as a default recovery mechanism, when the receiver couldn't receive the packets (some issue) I try to reconnect the socket after some retries (when the recv() call return EAGAIN), say 10 times with a timeout of 10 seconds, that too after a few cycles of the recovery process. Also the rate limiting, highwatermark, receive buffer options are set in both client and server.

I couldn't predict the behavior as this happens when multicasting on a single machine and multiple machines. Also the phase it gets crashed is not same every time. Attached below are the exception and the call stack. Kindly share your insights.

One thing I could see is, when the program about to crash, the whole OS is stuck, I try to go to taskmgr to check the resources, but I couldn't.

Also, regarding the norm protocol, I haven't tried yet. Will check.

Call Stack. Screenshot 2023-12-02 194100e

Message box: image1

axelriet commented 6 months ago

From the screenshot it looks like an assertion fails at line 464 after trying to allocate 1MB: https://github.com/zeromq/libzmq/blob/959a133520dfc80d29e83aa7ef762e1d0327f63b/src/ip.cpp#L464

I'm not sure why this large buffer is needed but, regardless, everything indicates a RAM exhaustion on the machine. Probably a rapid leak.

Since you are on Windows you could enable heap debugging in your process, specifically _CRTDBG_LEAK_CHECK_DF. See https://learn.microsoft.com/en-us/cpp/c-runtime-library/crt-debug-heap-details?view=msvc-170 for details. You call _CrtSetDbgFlag() at the very start of your program: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/crtsetdbgflag?view=msvc-170

Make sure you exit gracefully before the machine runs completely out and you should be able to track down what leaked.

Another option is to use Intel Inspector (it's a free download that integrates with Visual Studio) to try and see what leaks and where the memory was allocated: https://www.intel.com/content/www/us/en/developer/tools/oneapi/inspector.html Get the standalone version. You may need to register to get access.

I have reservations about NORM. The protocol and the library seems fine but I'm less sure about the integration in zmq. Let us know if it works for you.

KithurMohamedHallaj commented 6 months ago

@axelriet Thanks for your Input, Will check with the tools if probable, Since I am running the application on WinPE environment, not sure I could use the tools provided at all.

Also, I'm open for all the options available. Kindly check for the possibility of using norm with ZMQ or any way possible.

Kindly share the code which you have tested (Multicast using PGM) if possible, helps me to correct If I did something wrong. (The problem arises only during the transfer of large data over PGM Multicast with machines with lower RAM size say 4GB).