zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1
https://www.zeromq.org
Mozilla Public License 2.0
9.45k stars 2.34k forks source link

Large stacks in ZeroMQ threads lead to inflated crashpad minidumps on Windows #4682

Open zokrovi opened 2 months ago

zokrovi commented 2 months ago

Issue description

On Windows only, after crashing in an application that uses ZeroMQ 4.3.5 and crashpad, the resulting minidump file may contain several threads with excessively large stack frames (num_threads x 4MB). This leads to inflated minidump sizes which may in some circumstances prevent users from uploading their large crash reports to rate limited endpoints such as Sentry with clientside sentry-native.

The reason this occurs is because of ZeroMQ's use of _beginthreadex which does not use STACK_SIZE_PARAM_IS_A_RESERVATION. This causes the stack size reported to crashpad (and any other crash handler for that matter relying on VirtualQuery ) to be the full size and not representative of the used size.

Environment

Steps to reproduce the issue

  1. Run any application that uses many threads in ZeroMQ and also has crashpad as its crash handler.
  2. Crash the application
  3. Inspect the minidump and look for large stack frames 4MB in size.

Possible fix

My suggested fix is to pass the STACK_SIZE_PARAM_IS_A_RESERVATION flag to _beginthreadex (replacing the 0 here in the 5th argument). This will mean the requested 4MB of pages for the stack will not be committed by the OS immediately, but reserved as virtual memory.

https://github.com/zeromq/libzmq/blob/aa885c5a154256612108636b0fb22f44ae0e247a/src/thread.cpp#L54C1-L55C60

Here's an example of what this looks like in another project: Intel's TBB:

https://github.com/oneapi-src/oneTBB/blob/377e6c3b1719f8cc7b68f9d0939e652d7e3bf776/src/tbb/co_context.h#L210

In practice, this should not make much difference to performance anyway, as (I believe) even without this flag Windows does not commit to physical memory immediately until those pages are in use. This behaviour is very similar to how stack frames in pthreads work on Linux. However, in a near OOM scenario this will impact performance as those threads will not have a guarantee of physical pages from the OS. Also another downside from a robustness point of view: unlike UNIX system where you'd see a bus error, on Windows OOM may not lead to an exception if using STACK_SIZE_PARAM_IS_A_RESERVATION.

Therefore, contributors should consider whether ZeroMQ really needs the pages committed upfront (as is the case today) or if it's worth adding the STACK_SIZE_PARAM_IS_A_RESERVATION flag to help consumers that have crash reporting, and stack memory capture features etc.