wolfpld / tracy

Frame profiler
https://tracy.nereid.pl/
Other
10.07k stars 667 forks source link

Allow samples of inlined functions to be counted for at the callsite source code. #476

Closed mcourteaux closed 2 years ago

mcourteaux commented 2 years ago

For example, in my project that makes quite some use of Eigen, I get these results when using sampled profiling: image The last three lines actually do quite some work. There is a 4-vector difference, there is a matrix4*vector4 multiplication, there is a vector4-squareing and a vector-4 horizontal add happening. All of this is provided by inlined functions from the Eigen library. All of the instructions corresponding to it are right there in the assembly window: image I selected the appropriate range of instructions (in red percentages), and you can see that it is good for 15.7% of the samples of this function. That is a lot! Yet it is not visible in the source view, because instead (as you can see from this screenshot too), it is attributed to the lines of source code that define the inline function (in this case .../SSE/PacketMath.h).

The good news is that Tracy does know where these instructions actually belong to. As the manual says, middle mouse click on the percentage to open the entry call stack. This shows you which code has called the inlined function: image

So, I propose to have a toggle that lets you include all the samples of the instructions that originate from inlined functions, to be also attributed to the source code of the call site.


One annoying thing I notice here, is that not all instructions find their way all the way back to the line where the inlined function is called, for some reason... You might know better why, but here is an example: image The arrow points to the instruction for which the call stack never shows any line of mine. The block marked with the red thick line right below it come from the squaredNorm() call, which is visible in the call stack. Here is the callstack window for the instruction with the arrow: image I have trouble reading this window. I think that every line int his "call stack" is actually a difference instance of some inlined code, because if I right-click on the corresponding Location in the Location column the Assembly view jumps up and down through the assembly of the function. In general very confusing to me.

mcourteaux commented 2 years ago

Okay, regarding the problem that I mentioned after the separator in the OP of the instruction not being linked back to the original line of code, I investigated it more, and the debug symbols do correctly contain this information. I focused on the same instruction, and identified it's address using gdb with the disassemble command. Using the address of that instruction, I run addr2line (apparently a pretty standard linux tool, didn't know it until now):

addr2line --exe reconstructor -if 0x00000000000ccb30

Which gives the output:

_Z57reconstruct_lf4d_cpu_alpha_compositing_block_optimized_v2St10shared_ptrIN5model8MVNModelILm4ELm3EEEEPKN8renderer12RenderParamsEiiSt8functionIFv23reconstruction_progressEEi
/usr/lib/gcc/x86_64-linux-gnu/9/include/xmmintrin.h:186
_ZN5Eigen8internal4paddIDv4_fEET_RKS3_S5_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:290
_ZN5Eigen8internal5pmaddIDv4_fEET_RKS3_S5_S5_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/GenericPacketMath.h:1052
_ZN5Eigen8internal24etor_product_packet_implILi0ELi4ENS0_9evaluatorINS_6MatrixIfLi4ELi4ELi0ELi4ELi4EEEEENS2_INS3_IfLi4ELi1ELi0ELi4ELi1EEEEEDv4_fLi16EE3runEllRKS5_RKS7_lRS8_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:691
_ZNK5Eigen8internal17product_evaluatorINS_7ProductINS_6MatrixIfLi4ELi4ELi0ELi4ELi4EEENS3_IfLi4ELi1ELi0ELi4ELi1EEELi1EEELi3ENS_10DenseShapeES7_ffE6packetILi16EDv4_fEEKT0_ll
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:631
_ZN5Eigen8internal31generic_dense_assignment_kernelINS0_9evaluatorINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEEENS2_INS_7ProductINS3_IfLi4ELi4ELi0ELi4ELi4EEES4_Li1EEEEENS0_9assign_opIffEELi0EE12assignPacketILi16ELi16EDv4_fEEvll
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:676
_ZN5Eigen8internal31generic_dense_assignment_kernelINS0_9evaluatorINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEEENS2_INS_7ProductINS3_IfLi4ELi4ELi0ELi4ELi4EEES4_Li1EEEEENS0_9assign_opIffEELi0EE24assignPacketByOuterInnerILi16ELi16EDv4_fEEvll
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:690
_ZN5Eigen8internal47copy_using_evaluator_innervec_CompleteUnrollingINS0_31generic_dense_assignment_kernelINS0_9evaluatorINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEEENS3_INS_7ProductINS4_IfLi4ELi4ELi0ELi4ELi4EEES5_Li1EEEEENS0_9assign_opIffEELi0EEELi0ELi4EE3runERSD_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:281
_ZN5Eigen8internal21dense_assignment_loopINS0_31generic_dense_assignment_kernelINS0_9evaluatorINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEEENS3_INS_7ProductINS4_IfLi4ELi4ELi0ELi4ELi4EEES5_Li1EEEEENS0_9assign_opIffEELi0EEELi2ELi2EE3runERSD_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:490
_ZN5Eigen8internal26call_dense_assignment_loopINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEENS_7ProductINS2_IfLi4ELi4ELi0ELi4ELi4EEES3_Li1EEENS0_9assign_opIffEEEEvRT_RKT0_RKT1_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:786
_ZN5Eigen8internal10AssignmentINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEENS_7ProductINS2_IfLi4ELi4ELi0ELi4ELi4EEES3_Li1EEENS0_9assign_opIffEENS0_11Dense2DenseEvE3runERS3_RKS6_RKS8_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:955
_ZN5Eigen8internal24call_assignment_no_aliasINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEENS_7ProductINS2_IfLi4ELi4ELi0ELi4ELi4EEES3_Li1EEENS0_9assign_opIffEEEEvRT_RKT0_RKT1_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:891
_ZN5Eigen8internal20generic_product_implINS_6MatrixIfLi4ELi4ELi0ELi4ELi4EEENS2_IfLi4ELi1ELi0ELi4ELi1EEENS_10DenseShapeES5_Li3EE6evalToIS4_EEvRT_RKS3_RKS4_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:404
_ZN5Eigen8internal10AssignmentINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEENS_7ProductINS2_IfLi4ELi4ELi0ELi4ELi4EEES3_Li0EEENS0_9assign_opIffEENS0_11Dense2DenseEvE3runERS3_RKS6_RKS8_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:150
_ZN5Eigen8internal24call_assignment_no_aliasINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEENS_7ProductINS2_IfLi4ELi4ELi0ELi4ELi4EEES3_Li0EEENS0_9assign_opIffEEEEvRT_RKT0_RKT1_
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:891
_ZN5Eigen15PlainObjectBaseINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEE12_set_noaliasINS_7ProductINS1_IfLi4ELi4ELi0ELi4ELi4EEES2_Li0EEEEERS2_RKNS_9DenseBaseIT_EE
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/PlainObjectBase.h:795
_ZN5Eigen15PlainObjectBaseINS_6MatrixIfLi4ELi1ELi0ELi4ELi1EEEEC4INS_7ProductINS1_IfLi4ELi4ELi0ELi4ELi4EEES2_Li0EEEEERKNS_9DenseBaseIT_EE
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/PlainObjectBase.h:594
_ZN5Eigen6MatrixIfLi4ELi1ELi0ELi4ELi1EEC4INS_7ProductINS0_IfLi4ELi4ELi0ELi4ELi4EEES1_Li0EEEEERKNS_9EigenBaseIT_EE
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/Matrix.h:413
_Z57reconstruct_lf4d_cpu_alpha_compositing_block_optimized_v2St10shared_ptrIN5model8MVNModelILm4ELm3EEEEPKN8renderer12RenderParamsEiiSt8functionIFv23reconstruction_progressEEi
/home/martijn/w/SMoE/renderer/build_rel/../src/cpu_reconstructors/modular_cpu_reconstructor_lf4d.hpp:1016

And there you have it! The last line is the line that triggers this waterfall of includes, which is the line of my code! For ease of reading, filtering out the lines indicating the source code, we get:

❯ addr2line --exe reconstructor -if 0x00000000000ccb30 | grep :
/usr/lib/gcc/x86_64-linux-gnu/9/include/xmmintrin.h:186
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:290
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/GenericPacketMath.h:1052
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:691
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:631
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:676
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:690
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:281
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:490
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:786
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:955
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:891
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:404
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/ProductEvaluators.h:150
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/AssignEvaluator.h:891
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/PlainObjectBase.h:795
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/PlainObjectBase.h:594
/home/martijn/w/SMoE/renderer/build_rel/../lib/eigen/Eigen/src/Core/Matrix.h:413
/home/martijn/w/SMoE/renderer/build_rel/../src/cpu_reconstructors/modular_cpu_reconstructor_lf4d.hpp:1016

This is an exact match with the call stack window of Tracy, but just truncated until 14 (which is a little weird, but seems to be 16 - 2 (one for the "Location" and one for the "Entry point")) inlines deep. Notice how Tracy handles the whole concept of looking at the inlines in a weird way: it treats it as a symbol, which is at a certain location, with an entry point that is all a little messed up (to see what I mean compare the top 3 lines of the addr2line output with the information displayed in the Tracy window in the Symbol, Location, and Entry point fields, and the first "frame" of the call stack). Additionally, Tracy indicating frame 0 for the last one might be a result of the truncation, or the cause. Because that zeroth frame is clearly also inlined.

wolfpld commented 2 years ago

I selected the appropriate range of instructions (in red percentages), and you can see that it is good for 15.7% of the samples of this function. That is a lot! Yet it is not visible in the source view, because instead (as you can see from this screenshot too), it is attributed to the lines of source code that define the inline function (in this case .../SSE/PacketMath.h).

Symbols can be built from multiple functions, which may come from multiple source files. You can change the source file which is displayed with the selector atop the source view. Or just right-click on the source filename in the assembly line. You will find your missing statistics there.

So, I propose to have a toggle that lets you include all the samples of the instructions that originate from inlined functions, to be also attributed to the source code of the call site.

I'm not sure how that would be supposed to work. You can have code that spans across multiple files even without any inlining, e.g.:

[21:03 wolf@oberon:~]% cat f1.c
#include <stdio.h>
int main()
{
#include "f2.c"
}
[21:03 wolf@oberon:~]% cat f2.c
printf( "This is stupid!\n" );
[21:03 wolf@oberon:~]% gcc f1.c
[21:04 wolf@oberon:~]% ./a.out
This is stupid!

This is an exact match with the call stack window of Tracy, but just truncated until 14 (which is a little weird, but seems to be 16 - 2 (one for the "Location" and one for the "Entry point")) inlines deep.

There is indeed an artificial limit on the inline stack depth that you have hit.

Notice how Tracy handles the whole concept of looking at the inlines in a weird way: it treats it as a symbol

The name is wrong in that window. It is more correctly referenced to as "function" in the popup when you hover over the source filename on the asm line.

wolfpld commented 2 years ago

There is indeed an artificial limit on the inline stack depth that you have hit.

Fixed in 7552341f.

The name is wrong in that window.

Fixed in 5874ac11f.

mcourteaux commented 2 years ago

I'm not sure how that would be supposed to work. You can have code that spans across multiple files even without any inlining, e.g.:

I think there is a fair difference between an inlined function call, and some preprocessor abuse.

Or just right-click on the source filename in the assembly line. You will find your missing statistics there.

I know they are there, but the whole it is extremely cumbersome to find those inlined functions (as you can see, Eigen did a great job spreading the implementation of a simple mulps over several files and inlined functions). The fact that the cost of those individual instructions, that originate from an inlined function are not shown right where you call the inlined function (in the Tracy source view) is not very helpful. I'd be much better IMO to have an overview of where the time actually goes in the function. Just hiding the fact that these lines take up any time at all is just misleading and causing a lot of going back and forth between the file (or files!) that implement the inlined function.

So, as Tracy actually knows through those call stacks (which you just increased the limit of), where an instruction actually originates from from the deepest not-inlined line of source code, it would be helpful to attribute that optionally to that line. Alternatively, I think it would also be useful to just let those recorded samples attribute ALL the inlined frames up the callstack. Now you get useful overview of where the time actually goes when you are analyzing/optimizing both a regular function and an inlined function.

wolfpld commented 2 years ago

I think there is a fair difference between an inlined function call, and some preprocessor abuse.

Well, both are valid uses of the language. You can't just decide to not support something because you don't like it.

Alternatively, I think it would also be useful to just let those recorded samples attribute ALL the inlined frames up the callstack. Now you get useful overview of where the time actually goes when you are analyzing/optimizing both a regular function and an inlined function.

Implemented in ff36c900.

I know they are there, but the whole it is extremely cumbersome to find those inlined functions

I'm afraid this will be still an issue for you, even with the above fix. The solution should probably be a popup menu opened in the context of the source line, which would direct you to the appropriate inline functions called from the line.

mcourteaux commented 2 years ago

Oh! Really cool! Thank you for considering my idea. :smile: I'll check it out!

mcourteaux commented 2 years ago

Dang it's nice! This is good! Thank you!

I'd mention in the manual that you should not have the checkbox next to the function image dropdown selected for this to work.

Suggestion: Regarding the assembly-instruction-highlighting behavior on hovering a source line, it would be cool if the "Propagate inlines" is enabled, that the corresponding assembly lines coming from the hierarchy of inlines also light up.


Appreciation screenshot:

Without (misleading statistics, IMO, once you compare it to the next screenshot): image

With (interesting, as you can see now the time the operator()() (doing address computation) contributes to the total, std::sqrt() contributing to the total): image

I think the hardware impact metrics (like cache miss) are not propagated up. That I think is useful too to see, right? If a whole hierarchy of inlines is causing a lot of cache misses, you won't see it in the overview.

wolfpld commented 2 years ago

I'd mention in the manual that you should not have the checkbox next to the function image dropdown selected for this to work.

Possibly, there are a lot of interactions between the options.

Suggestion: Regarding the assembly-instruction-highlighting behavior on hovering a source line, it would be cool if the "Propagate inlines" is enabled, that the corresponding assembly lines coming from the hierarchy of inlines also light up.

This is a good solution when you are looking at the problem from the perspective of small, inlined math functions, like you have. However, in some other cases, for example when a small "entry" function performs dispatch into several instances of big, inlined workload functions, this behavior would be hardly usable. In fact, this is something that I have actively designed against, like you can see in the following video, where VTune does exactly what you suggest (and offers no other option):

https://youtu.be/_hU7vw00MZ4?t=699

Hence the alternative solution I have proposed.

I think the hardware impact metrics (like cache miss) are not propagated up. That I think is useful too to see, right? If a whole hierarchy of inlines is causing a lot of cache misses, you won't see it in the overview.

Hardware samples do not have a call stack (only the instruction address), so it is not possible to figure out from where the instruction was called. Or maybe it is, in local context?

mcourteaux commented 2 years ago

In fact, this is something that I have actively designed against, like you can see in the following video, where VTune does exactly what you suggest (and offers no other option): https://youtu.be/_hU7vw00MZ4?t=699

Aha, yes I see what you mean. But that doesn't make any of it less valid. (In the video,) It simply is the correct way to associate that single line calling the inlined function to all those assembly instructions (because there is a huge hierachy of inlined functions behind). But consider this, when you zoom in into the source code of that inlined function (by selecting it from the function dropdown), you can now hover over every individual line of the inlined function: those will map on way less lines of assembly.

Of course, I think this behavior makes sense to enable only if the "Propagate inlines" is enabled. I hope this explanation was clear.

Much like you now propagate the sample counts up the callstack, you could propagate the highlighted assembly instructions down from the hovered line. I think the key difference with Tracy and VTune here is that you have the dropdown to select a specific inlined function, to reduce the scope of what you are analyzing. That is extremely powerful and makes what I propose possible. VTune probably can't do this.

wolfpld commented 2 years ago

Hardware samples do not have a call stack (only the instruction address), so it is not possible to figure out from where the instruction was called. Or maybe it is, in local context?

It actually is doable. And it simplifies things a bit.

It simply is the correct way to associate that single line calling the inlined function to all those assembly instructions (because there is a huge hierachy of inlined functions behind).

Part of the simplification process makes it too cumbersome to track source line -> assembly associations, as was done before. A dynamic mapping will be used instead, which will also allow to change how these associations are presented, for example when the propagation option changes.

wolfpld commented 2 years ago

The inline propagation now also applies to both the hardware samples and the assembly line associations to the source lines.

mcourteaux commented 1 year ago

I'd like to thank you for implementing this. This is so extremely useful to me. Thanks a lot!

wolfpld commented 1 year ago

It truly is. Thanks for pushing me to do this.