Closed mo-tenstorrent closed 1 month ago
@tiago-rodrigues can you take a look?
Will try take a look later today. @mo-tenstorrent any more details on your the repro case? Is there a public test case I could run, if not would you be able to post an 'anonymized' backtrace of what the other threads in your process are doing at the time of the hang?
Our repo is public https://github.com/tenstorrent/tt-metal and our fork of tracy is a submodule to it.
Our fork is slightly modified but I have tried compiling with the latest on wolfpld/tracy/master
and still run into the same hang.
Building with ENABLE_TRACY=1 and running any of our unit tests can cause the hang. However, it might be too much to get the test up and running there so I have attached a backtrace dump of the hang.
You can see Thread 33, TracySysWorker is in nanosleep.
Unfortunately I wasn't able to repro the same type of hang when having sys trace enabled in our codebase which has TRACY_USE_IMAGE_CACHE enabled. I'll try to setup your repro, but it does look like it has a few dependencies there that are going to be difficult to install on any system I have access to right now.
Looking at the callstacks you have there it appears that thread 36 and 34 are block both in loader code, but I'm unsure why. I'm not 100% sure if calling dladdr inside a dl_iterate_phdr callback might be problematic, so maybe I can try moving that out, but I'll need a hang repro first.
I assume you are unblocked with by not defining TRACY_USE_IMAGE_CACHE ?
Yeah we have disabled it for now.
I should say that this is intermittent. Roughly 20-30% of the starts run into it. Did 500+ runs without image_cache and no hags.
If there are diffs you want me to try I can do that. It is fairly easy to reproduce or confirm a fix.
@mo-tenstorrent: would you mind testing your repro with this patch: https://github.com/wolfpld/tracy/pull/783/commits/28244ab9b85d6e4337855bd3894a06a6dbff3030 ? Unfortunately I still haven't managed a repro with the original code. If my theory is correct, the deadlock is caused by dladdr() attempting to acquire a lock in the loader while another one was being held by dl_iterate_phdr() at the same time as other shared object loading is happening.
If my theory is correct, the deadlock is caused by dladdr() attempting to acquire a lock in the loader while another one was being held by dl_iterate_phdr() at the same time as other shared object loading is happening.
7e8961d2f may be relevant.
The patch fixes the issue. More than 1000 attempts and no hangs, while consistently failing within the first 20 attempts without the patch.
Thanks
Thanks for testing. I've tagged the PR for review.
The patch fixes the issue.
Merged, closing.
The following define is causing a hang at process start.
https://github.com/wolfpld/tracy/blob/a1a7e63d93e3cad9497f9f37838956bbff2af3a8/public/client/TracyCallstack.cpp#L94
The process gets stuck in the infinite loop here:
https://github.com/wolfpld/tracy/blob/a1a7e63d93e3cad9497f9f37838956bbff2af3a8/public/client/TracySysTrace.cpp#L1151
Couldn't investigate further how image cache is not letting sys trace worker loop to not let go.