wolfpld / tracy

Frame profiler
8.74k stars 603 forks source link


Open dmirys opened 11 months ago

dmirys commented 11 months ago

I'm trying to reduce amount of data collected by tracy. For that purpose I use the following set of flags:

My app hangs at the exit in the infinite loop, after sending terminate command to the server. Weanwhile server processed terminate instruction and detects there are non zero m_pendingCallstackFrames. Looks like it supposed to get more data from client in such case. I think some misslogic here: client is waiting for data from server, while server is waiting data from client, while the client said to terminate.

With debugger I found that client sends QueueType::CallstackSample commands to the server during tracy session. Disabling system tracing with TRACY_NO_SYSTEM_TRACING solves the issue. Am I lost something useful with this option? What is a correct way to solve the problem? I'm ready to test some ideas.

mo-tenstorrent commented 2 months ago

@dmirys Did you find a solution here. I am also running into a similar issue.

I hang at exit even with TRACY_NO_SYSTEM_TRACING

dmirys commented 2 months ago

No. Still using workaround. Here is the full set of flags that I use: PUBLIC TRACY_NO_CALLSTACK TRACY_NO_SYSTEM_TRACING TRACY_NO_CODE_TRANSFER

Sorry, you have to debug if it doesn't help.

mo-tenstorrent commented 1 month ago

More info on this hang:

It is specifically on the join in Thread destructor call in ~Profiler.


Here is the back trace of the hang:

#0  __pthread_clockjoin_ex (threadid=139643671152384, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>)
    at pthread_join_common.c:145
#1  0x00007f015a709348 in tracy::Profiler::~Profiler() () from /home/mmemarian/models/tt-metal/build/lib/libtracy.so.0.10.0

Server is also reporting:

Screenshot 2024-05-24 at 1 19 51 PM

Here is the cmake option list:

set_option(TRACY_ON_DEMAND "On-demand profiling" OFF)
set_option(TRACY_CALLSTACK "Enforce callstack collection for tracy regions" OFF)
set_option(TRACY_NO_CALLSTACK "Disable all callstack related functionality" ON)
set_option(TRACY_NO_CALLSTACK_INLINES "Disables the inline functions in callstacks" ON)
set_option(TRACY_ONLY_LOCALHOST "Only listen on the localhost interface" OFF)
set_option(TRACY_NO_BROADCAST "Disable client discovery by broadcast to local network" OFF)
set_option(TRACY_ONLY_IPV4 "Tracy will only accept connections on IPv4 addresses (disable IPv6)" OFF)
set_option(TRACY_NO_CODE_TRANSFER "Disable collection of source code" ON)
set_option(TRACY_NO_CONTEXT_SWITCH "Disable capture of context switches" ON)
set_option(TRACY_NO_EXIT "Client executable does not exit until all profile data is sent to server" OFF)
set_option(TRACY_NO_SAMPLING "Disable call stack sampling" ON)
set_option(TRACY_NO_VERIFY "Disable zone validation for C API" OFF)
set_option(TRACY_NO_VSYNC_CAPTURE "Disable capture of hardware Vsync events" ON)
set_option(TRACY_NO_FRAME_IMAGE  "Disable the frame image support and its thread" ON)
set_option(TRACY_NO_SYSTEM_TRACING  "Disable systrace sampling" ON)
set_option(TRACY_PATCHABLE_NOPSLEDS  "Enable nopsleds for efficient patching by system-level tools (e.g. rr)" OFF)
set_option(TRACY_DELAYED_INIT "Enable delayed initialization of the library (init on first call)" OFF)
set_option(TRACY_MANUAL_LIFETIME "Enable the manual lifetime management of the profile" OFF)
set_option(TRACY_FIBERS "Enable fibers support" OFF)
set_option(TRACY_NO_CRASH_HANDLER "Disable crash handling" OFF)
set_option(TRACY_TIMER_FALLBACK "Use lower resolution timers" OFF)
set_option(TRACY_LIBUNWIND_BACKTRACE "Use libunwind backtracing where supported" OFF)
set_option(TRACY_SYMBOL_OFFLINE_RESOLVE "Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution" OFF)
set_option(TRACY_LIBBACKTRACE_ELF_DYNLOAD_SUPPORT "Enable libbacktrace to support dynamically loaded elfs in symbol resolution resolution after the first symbol resolve operation" OFF)
wolfpld commented 1 month ago

It is specifically on the join in Thread destructor call in ~Profiler.


What are other threads doing when this join is pending? Specifically, the LaunchWorker thread?

mo-tenstorrent commented 1 month ago

It is stuck on poll

Screenshot 2024-05-24 at 1 39 35 PM
mo-tenstorrent commented 1 month ago

Was wondering if there was any progress here? Is there any more data I can provide or tests to run?

wolfpld commented 1 month ago

According to the man page,

If the value of timeout is 0, poll() shall return immediately.

The glibc code at frame #0 in your call stack is basically doing a syscall into the kernel, which shouldn't be affected by the program environment.

I don't know what's wrong here, it shouldn't be happening.