wolfpld / tracy

Frame profiler
https://tracy.nereid.pl/
Other
8.67k stars 598 forks source link

Memory corruption in callstack recording #684

Closed YaLTeR closed 6 months ago

YaLTeR commented 6 months ago

Random strings end up in the callstack Image column:

image

Both when capturing with the UI and with the capture executable.

I tried UBSan, it shows this, not sure if related:

../subprojects/tracy/public/libbacktrace/dwarf.cpp:2337:20: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2337:20: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2397:19: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2397:19: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2399:15: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3588:30: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3588:30: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3590:19: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3592:19: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3622:17: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3622:17: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'

I'm building as a shared library with tracy_on_demand. The same issue occurs with/without libunwind, the new dynload option, manual lifetime, offline.

wolfpld commented 6 months ago

Have you tried bisecting this issue?

YaLTeR commented 6 months ago

ab1ec3f01ce0be78d74449ca78db907c9b532c2c is the first bad commit

wolfpld commented 6 months ago

Thanks. @tiago-rodrigues can you take a look?

tiago-rodrigues commented 6 months ago

will do

tiago-rodrigues commented 6 months ago

Unfortunately I haven't been able to repro this locally with our own app or the tracy test one. Likely there is something going on with the image name string lifetime there, but it's not obvious me just looking at the code either. Would you happen to have a bit more details on the repro, or maybe there is a publicly available I could try to run locally?

@wolfpld: maybe the best would be best to make TRACY_USE_IMAGE_CACHE disabled by by default until we figure out what is the issue? @YaLTeR: I assume you don't get the issue if TRACY_USE_IMAGE_CACHE is disabled?

YaLTeR commented 6 months ago

I assume you don't get the issue if TRACY_USE_IMAGE_CACHE is disabled?

Yeah, if I comment out that define, it seems to work fine.

tiago-rodrigues commented 6 months ago

I've created this: https://github.com/wolfpld/tracy/pull/685 to add disable the option for now.

YaLTeR commented 6 months ago

Would you happen to have a bit more details on the repro, or maybe there is a publicly available I could try to run locally?

It's just mutter (GNOME's Wayland compositor) instrumented with Tracy. I run it, connect with Tracy, move the cursor around and wait a little for the data to flow in, then stop, and check the sampled call stacks. Most will contain those glitched strings.

Might be a little annoying to run, but:

  1. Clone and apply this MR: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3391
  2. Open meson.build, comment out this line:

    --- a/meson.build
    +++ b/meson.build
    @@ -461,7 +461,7 @@ if have_profiler
             'tracy_only_localhost=true',
             # Disable symbol resolving parts of Tracy, because they contend with
             # GJS spamming dlsym, causing massive slowdowns.
    -        'tracy_no_callstack=true',
    +        # 'tracy_no_callstack=true',
             # Build as a shared, rather than static, library, because otherwise
             # some binaries end up with two copies of Tracy, which is not supported.
             'default_library=shared',
  3. Install build dependencies, on Fedora you can do sudo dnf builddep mutter
  4. meson setup build -Dprofiler_tracy=true, ninja -C build, build/src/mutter --nested

    The ninja step might hang in the end, which seems to also be caused by this issue. In this case, just Ctrl-C and run it again.

wolfpld commented 6 months ago

I can reproduce this with my test game.

obraz

tiago-rodrigues commented 6 months ago

@YaLTeR. I only have access to Ubuntu systems right now, so I'm not sure I can build mutter easily there. The repro from @wolfpld might be simpler, I'm guessing it's not a public project? I still can't repro on my end after trying a few runs, but I may be just (un)lucky.

@wolfpld: could this be due to the fact that on image cache refresh, the old strings get deleted and we've returned that address on DecodeCallstackPtr() already, before it gets send back? Does it need a CopyStringFast() into CallstackEntryData::imageName? It wasn't being done before as it just returned dlinfo.dli_fname directly so I didn't do it either.

wolfpld commented 6 months ago

I disabled clearing the cache on refresh in 5decb91c, so that the refresh code only adds new entries to the cache. It should fix the problem.

YaLTeR commented 6 months ago

That appears to fix the problem, yeah.