wolfpld / tracy

Frame profiler
https://tracy.nereid.pl/
Other
8.74k stars 603 forks source link

Client hangs with TRACY_NO_CALLSTACK without TRACY_NO_SYSTEM_TRACING #589

Open dmirys opened 11 months ago

dmirys commented 11 months ago

I'm trying to reduce amount of data collected by tracy. For that purpose I use the following set of flags:

My app hangs at the exit in the infinite loop, after sending terminate command to the server. Weanwhile server processed terminate instruction and detects there are non zero m_pendingCallstackFrames. Looks like it supposed to get more data from client in such case. I think some misslogic here: client is waiting for data from server, while server is waiting data from client, while the client said to terminate.

With debugger I found that client sends QueueType::CallstackSample commands to the server during tracy session. Disabling system tracing with TRACY_NO_SYSTEM_TRACING solves the issue. Am I lost something useful with this option? What is a correct way to solve the problem? I'm ready to test some ideas.

mo-tenstorrent commented 2 months ago

@dmirys Did you find a solution here. I am also running into a similar issue.

I hang at exit even with TRACY_NO_SYSTEM_TRACING

dmirys commented 2 months ago

No. Still using workaround. Here is the full set of flags that I use: PUBLIC TRACY_NO_CALLSTACK TRACY_NO_SYSTEM_TRACING TRACY_NO_CODE_TRANSFER

Sorry, you have to debug if it doesn't help.

mo-tenstorrent commented 1 month ago

More info on this hang:

It is specifically on the join in Thread destructor call in ~Profiler.

https://github.com/wolfpld/tracy/blob/005d0929035bdc13f877da97c6631c0f2c98673e/public/client/TracyProfiler.cpp#L1564

Here is the back trace of the hang:

#0  __pthread_clockjoin_ex (threadid=139643671152384, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>)
    at pthread_join_common.c:145
#1  0x00007f015a709348 in tracy::Profiler::~Profiler() () from /home/mmemarian/models/tt-metal/build/lib/libtracy.so.0.10.0

Server is also reporting:

Screenshot 2024-05-24 at 1 19 51 PM

Here is the cmake option list:

set_option(TRACY_ON_DEMAND "On-demand profiling" OFF)
set_option(TRACY_CALLSTACK "Enforce callstack collection for tracy regions" OFF)
set_option(TRACY_NO_CALLSTACK "Disable all callstack related functionality" ON)
set_option(TRACY_NO_CALLSTACK_INLINES "Disables the inline functions in callstacks" ON)
set_option(TRACY_ONLY_LOCALHOST "Only listen on the localhost interface" OFF)
set_option(TRACY_NO_BROADCAST "Disable client discovery by broadcast to local network" OFF)
set_option(TRACY_ONLY_IPV4 "Tracy will only accept connections on IPv4 addresses (disable IPv6)" OFF)
set_option(TRACY_NO_CODE_TRANSFER "Disable collection of source code" ON)
set_option(TRACY_NO_CONTEXT_SWITCH "Disable capture of context switches" ON)
set_option(TRACY_NO_EXIT "Client executable does not exit until all profile data is sent to server" OFF)
set_option(TRACY_NO_SAMPLING "Disable call stack sampling" ON)
set_option(TRACY_NO_VERIFY "Disable zone validation for C API" OFF)
set_option(TRACY_NO_VSYNC_CAPTURE "Disable capture of hardware Vsync events" ON)
set_option(TRACY_NO_FRAME_IMAGE  "Disable the frame image support and its thread" ON)
set_option(TRACY_NO_SYSTEM_TRACING  "Disable systrace sampling" ON)
set_option(TRACY_PATCHABLE_NOPSLEDS  "Enable nopsleds for efficient patching by system-level tools (e.g. rr)" OFF)
set_option(TRACY_DELAYED_INIT "Enable delayed initialization of the library (init on first call)" OFF)
set_option(TRACY_MANUAL_LIFETIME "Enable the manual lifetime management of the profile" OFF)
set_option(TRACY_FIBERS "Enable fibers support" OFF)
set_option(TRACY_NO_CRASH_HANDLER "Disable crash handling" OFF)
set_option(TRACY_TIMER_FALLBACK "Use lower resolution timers" OFF)
set_option(TRACY_LIBUNWIND_BACKTRACE "Use libunwind backtracing where supported" OFF)
set_option(TRACY_SYMBOL_OFFLINE_RESOLVE "Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution" OFF)
set_option(TRACY_LIBBACKTRACE_ELF_DYNLOAD_SUPPORT "Enable libbacktrace to support dynamically loaded elfs in symbol resolution resolution after the first symbol resolve operation" OFF)
wolfpld commented 1 month ago

It is specifically on the join in Thread destructor call in ~Profiler.

https://github.com/wolfpld/tracy/blob/005d0929035bdc13f877da97c6631c0f2c98673e/public/client/TracyProfiler.cpp#L1564

What are other threads doing when this join is pending? Specifically, the LaunchWorker thread?

mo-tenstorrent commented 1 month ago

It is stuck on poll

Screenshot 2024-05-24 at 1 39 35 PM
mo-tenstorrent commented 1 month ago

Was wondering if there was any progress here? Is there any more data I can provide or tests to run?

wolfpld commented 1 month ago

According to the man page,

If the value of timeout is 0, poll() shall return immediately.

The glibc code at frame #0 in your call stack is basically doing a syscall into the kernel, which shouldn't be affected by the program environment.

I don't know what's wrong here, it shouldn't be happening.