wolfpld / tracy

Frame profiler
https://tracy.nereid.pl/
Other
8.61k stars 592 forks source link

`TRACY_USE_IMAGE_CACHE` Causes a hang right at process start #780

Closed mo-tenstorrent closed 1 month ago

mo-tenstorrent commented 2 months ago

The following define is causing a hang at process start.

https://github.com/wolfpld/tracy/blob/a1a7e63d93e3cad9497f9f37838956bbff2af3a8/public/client/TracyCallstack.cpp#L94

The process gets stuck in the infinite loop here:

https://github.com/wolfpld/tracy/blob/a1a7e63d93e3cad9497f9f37838956bbff2af3a8/public/client/TracySysTrace.cpp#L1151

Couldn't investigate further how image cache is not letting sys trace worker loop to not let go.

wolfpld commented 2 months ago

@tiago-rodrigues can you take a look?

tiago-rodrigues commented 2 months ago

Will try take a look later today. @mo-tenstorrent any more details on your the repro case? Is there a public test case I could run, if not would you be able to post an 'anonymized' backtrace of what the other threads in your process are doing at the time of the hang?

mo-tenstorrent commented 2 months ago

Our repo is public https://github.com/tenstorrent/tt-metal and our fork of tracy is a submodule to it.

Our fork is slightly modified but I have tried compiling with the latest on wolfpld/tracy/master and still run into the same hang.

Building with ENABLE_TRACY=1 and running any of our unit tests can cause the hang. However, it might be too much to get the test up and running there so I have attached a backtrace dump of the hang.

gdb.txt

You can see Thread 33, TracySysWorker is in nanosleep.

tiago-rodrigues commented 2 months ago

Unfortunately I wasn't able to repro the same type of hang when having sys trace enabled in our codebase which has TRACY_USE_IMAGE_CACHE enabled. I'll try to setup your repro, but it does look like it has a few dependencies there that are going to be difficult to install on any system I have access to right now.

Looking at the callstacks you have there it appears that thread 36 and 34 are block both in loader code, but I'm unsure why. I'm not 100% sure if calling dladdr inside a dl_iterate_phdr callback might be problematic, so maybe I can try moving that out, but I'll need a hang repro first.

I assume you are unblocked with by not defining TRACY_USE_IMAGE_CACHE ?

mo-tenstorrent commented 2 months ago

Yeah we have disabled it for now.

I should say that this is intermittent. Roughly 20-30% of the starts run into it. Did 500+ runs without image_cache and no hags.

If there are diffs you want me to try I can do that. It is fairly easy to reproduce or confirm a fix.

tiago-rodrigues commented 2 months ago

@mo-tenstorrent: would you mind testing your repro with this patch: https://github.com/wolfpld/tracy/pull/783/commits/28244ab9b85d6e4337855bd3894a06a6dbff3030 ? Unfortunately I still haven't managed a repro with the original code. If my theory is correct, the deadlock is caused by dladdr() attempting to acquire a lock in the loader while another one was being held by dl_iterate_phdr() at the same time as other shared object loading is happening.

wolfpld commented 2 months ago

If my theory is correct, the deadlock is caused by dladdr() attempting to acquire a lock in the loader while another one was being held by dl_iterate_phdr() at the same time as other shared object loading is happening.

7e8961d2f may be relevant.

mo-tenstorrent commented 2 months ago

The patch fixes the issue. More than 1000 attempts and no hangs, while consistently failing within the first 20 attempts without the patch.

Thanks

tiago-rodrigues commented 2 months ago

Thanks for testing. I've tagged the PR for review.

wolfpld commented 1 month ago

The patch fixes the issue.

Merged, closing.