wolfpld / tracy

Frame profiler
https://tracy.nereid.pl/
Other
10.15k stars 679 forks source link

Flow events support #149

Open avoroshilov opened 3 years ago

avoroshilov commented 3 years ago

Hello!

The Chrome Trace document has this concept of "flow events" which allow to connect two different profiling zones, possible on different threads. These connections could be useful to connect some cross-thread events into one flow, and in particular, are good for fiber-based tasking visualization (fiber jumps between threads, and pieces are associated together with this "arrow").

Supporting these will require special API, as well as UX/UI changes to allow to highlight the whole event flow, visualize separate parts of the flow, etc.

I wonder - what are thoughts/concerns regarding supporting flow events (and, potentially - but not necessarily - upgrading the import-chrome tool to support those as well)?

wolfpld commented 3 years ago

For fibers I am imagining a separate thread-like track, which will present a continuous zone graph.

Implementing flow events would present several difficulties:

avoroshilov commented 3 years ago

For fibers I am imagining a separate thread-like track, which will present a continuous zone graph.

I am guessing you're talking about something akin to what's described in this issue? https://github.com/wolfpld/tracy/issues/106

We already have that for another profiling backend - chrome trace and is then possible to be converted to Tracy binary format preserving custom threads, and it becomes pretty unusable for the complex scenarios once there are hundreds of fibers spawned by the task manager, it is really hard to see the whole picture with all the scrolling and shifting around.

But I definitely understand that flow events implementation UX/UI is quite a complex problem - so having custom thread-like sections will be good as a backup if flow events are not available.

benvanik commented 3 years ago

I'd just like to second the request for this: an extremely useful feature this can help enable is imperfectly nested zones allowing for proper modeling of asynchronous execution. This is required to track things like network/RPC/task system requests, animation/state machines, and external events ("a user input led to this set of tasks running which then triggered IO that then completed etc").

To keep the UI simpler in WTF I implemented these as their own zone graph equivalent that allowed for overlap and used a simple greedy packing algorithm to keep things dense: image These were built by looking for the first and last zones for the particular time range and showing the endpoints. If you clicked on the time range zones in there then it would highlight all of the relevant zones in the other graphs (where a user input occurred vs an animation began vs a frame of the animation, etc). There was no need for complex synchronization in the rendering of the other zones, no queues or reorder buffers to sort events, etc - I too didn't want to do what https://github.com/wolfpld/tracy/issues/149#issuecomment-748971727 mentions :)

I imagine this approach would fit pretty well into tracy that already has ways to draw the zone graphs with independent styles. Of course, having nice little lines between the zone graphs to indicate flow would be cool, but the two approaches work really well together and the lines are optional improvements for clarity but not required to get the data.

We would also like to use this for GPU/TPU/accelerator tracking where today tracy cannot represent out-of-order completion of events. For example, a concurrent execution of zones A and B where B completes before A will need to be padded out to A in order to preserve the perfectly nested requirement in tracy leading to incorrect timings. We have to disable all concurrency in order to get the real timings which then invalidates the timings because they no longer match normal execution ;(

dominikh commented 3 years ago

tl;dr this would be useful for Go runtime traces, too.

I'm interested in this feature in the context of the Go runtime tracer. Go uses an M:N scheduler, mapping many goroutines to fewer OS threads. I reckon that in a UI context, M:N scheduling and fibers are sufficiently similar to have the same requirements, such as wishing to view a fiber as if it were a thread of execution, while still being able to view the actual scheduling.

Go's runtime has instrumentation to log detailed traces of all scheduler events. This includes events such as "goroutine X has created goroutine Y", "goroutine X is blocking on a channel/lock/network operation" and "goroutine X has unblocked goroutine Y".

Go can convert these traces to the Chrome trace format, and Tracy can, for the most part, convert them to the Tracy format. What's missing is any notion of the causality of events. It's currently impossible to tell in Tracy that a goroutine has been scheduled because another goroutine has marked it as runnable, or that a goroutine has stopped executing because it has been preempted.

This use-case addresses two of your concerns with the more general feature of flow events in Tracy:

Instrumentation would be the resposibility of the user, which makes everything error prone.

The instrumentation is made by the runtime and can be expected to work correctly.

All events which are now collected in thread-local queues would need to be serialized, to enforce proper ordering of events (i.e. so that zones appear in order in global scope, not only thread scope).

The import-chrome workflow doesn't suffer from this issue.

Of course I can appreciate that you may not want to add a major feature to Tracy that is only usable offline by importing other trace formats, and not supported by actual Tracy instrumentation or sampling.

wolfpld commented 3 years ago

The investigation into this feature had stopped when it became apparent that the fiber implementations I was provided don't have clear entry and leave points defined, making me chase elusive synchronization problems in external code. I will probably make some synthetic task dispatcher with clear semantics, for test purposes, and leave proper instrumentation in real use scenarios to the user. When that's done we can start thinking about importing data from chrome traces.

kvark commented 2 years ago

Just a heads up about this being a very much desired feature, almost a must-have for a good profiling tool. I hope your quest for a task dispatcher succeeds eventually.

mafemergency commented 2 years ago

i use tracy to profile a bytecode interpreted language where every function is a coroutine so this feature or quoted from #241:

I have instead imagined a solution where the fibers/coroutines/whatever are properly structured (with parent-child relations) in virtual thread lanes.

sounds very useful.

since any function can yield and resume at a later time it's near impossible to tell when a function is being called fresh vs when it's resuming.

for example: tracy24

the function highlighted has the name /datum/controller/subsystem/mapping/proc/loadStation it is called in source code only once (pseudocode await loadStation()) but it yields 88 times and resumes on the next frame, meaning this single function call has 89 zones. being able to determine that these zones are all related (the same call) via arrows or whatnot would be useful.

I will probably make some synthetic task dispatcher with clear semantics, for test purposes, and leave proper instrumentation in real use scenarios to the user.

sounds great to me

simplyWiri commented 2 years ago

To add more support for the idea, I was envisioning something less specific to fibers, and less about linking two zones together, but more about correlation of a set of zones along the lines of some common thing.

Tracking the flow of data across a non-synchronous system to derive latency statistics across multiple sections of code is challenging with a traditional profiler (incl: tracy). An example of this could be a compiler; How long did it take to compile a particular method across Lexing,Parsing, ..., Code Generation? - If this process is serial, this may be relatively simple to eyeball, however in more complex system, with for example; multithreading, batching or fibers this becomes significantly harder to simply eyeball, and derive useful metrics from.


If we could associate named metadata along with zones - I.e. for a method we could pass a uint64_t methodId to uniquely identify the string. We could then effectively write 'queries' which enable the correlation of multiple zones across multiple "systems" (zones, in tracy's lingo) which are all associated with the same bit of data.

This also could also help visualize interleaved type frame heirachies.

                |   Frame "0"   |   Frame "1"   |   Frame "2"   | 
----------------+---------------+---------------+---------------+
T0 - Rendering: |  frame n - 2  |  frame n - 1  |  frame n      |
----------------+---------------+---------------+---------------+
T1 - Physics:   |  frame n - 1  |  frame n      |  frame n + 1  | 
----------------+---------------+---------------+---------------+
T2 - Logic:     |  frame n      |  frame n + 1  |  frame n + 2  |
-----------------------------------------------------------------

If you could correspond the frame to a particular set of zones, you could better trace the latency per frame. However, this use case tracks better for systems with a less rigid order for work to be done in.


Unfortunately, I think in order to use this as I envision it, it would require effectively database-like storage of this data & assocations for efficient access, akin to chrome-trace / perfetto. (To allow for query-like joins etc).