Inlining and code bloat

First of all, thank you very much for Tracy! I have only seen it in action very briefly, and it is indeed an amazing project!

I have one question about inlining in C++ API. Right now ZoneScopedN adds quite a lot of code.

Here is what I see in MSVC2017 x64 Release for zone start:

!   call    ?GetProfiler@tracy@@YAAEAVProfiler@1@XZ ; tracy::GetProfiler
!   movzx   edx, BYTE PTR [rax+296]
!   test    dl, dl
!   je  SHORT $LN16@FindViewLi
    mov BYTE PTR ___tracy_scoped_zone$[rsp], 1
    call    ?GetProfiler@tracy@@YAAEAVProfiler@1@XZ ; tracy::GetProfiler
    mov rcx, QWORD PTR [rax+304]
    mov QWORD PTR ___tracy_scoped_zone$[rsp+8], rcx
    call    ?GetToken@tracy@@YAPEAUExplicitProducer@?$ConcurrentQueue@UQueueItem@tracy@@UConcurrentQueueDefaultTraits@moodycamel@2@@moodycamel@1@XZ ; tracy::GetToken
    mov rsi, rax
    mov r14, QWORD PTR [rax+40]
    movzx   edi, r14w
    test    rdi, rdi
    jne SHORT $LN65@FindViewLi
    mov rdx, r14
    mov rcx, rax
    call    ?enqueue_begin_alloc@ExplicitProducer@?$ConcurrentQueue@UQueueItem@tracy@@UConcurrentQueueDefaultTraits@moodycamel@2@@moodycamel@tracy@@QEAAX_K@Z ; tracy::moodycamel::ConcurrentQueue<tracy::QueueItem,tracy::moodycamel::ConcurrentQueueDefaultTraits>::ExplicitProducer::enqueue_begin_alloc
$LN65@FindViewLi:
    shl rdi, 5
    add rdi, QWORD PTR [rsi+72]
    mov BYTE PTR [rdi], 14
    rdtsc
    shl rdx, 32                 ; 00000020H
    or  rax, rdx
    mov QWORD PTR [rdi+1], rax
    lea rax, OFFSET FLAT:?__tracy_source_location937@?1??FindViewLightsAndEntities@idRenderWorldLocal@@QEAAXXZ@4USourceLocationData@tracy@@B
    mov QWORD PTR [rdi+9], rax
    lea rax, QWORD PTR [r14+1]
    mov QWORD PTR [rsi+40], rax
    movzx   r14d, BYTE PTR ___tracy_scoped_zone$[rsp]
    jmp SHORT $LN109@FindViewLi
$LN16@FindViewLi:
!   xor r14b, r14b
!   mov BYTE PTR ___tracy_scoped_zone$[rsp], r14b
$LN109@FindViewLi:

and for zone end:

$LN155@FindViewLi:
!   test    r14b, r14b
!   je  SHORT $LN298@FindViewLi
    call    ?GetProfiler@tracy@@YAAEAVProfiler@1@XZ ; tracy::GetProfiler
    mov rcx, QWORD PTR [rax+304]
    cmp rcx, QWORD PTR ___tracy_scoped_zone$[rsp+8]
    jne SHORT $LN298@FindViewLi
    call    ?GetToken@tracy@@YAPEAUExplicitProducer@?$ConcurrentQueue@UQueueItem@tracy@@UConcurrentQueueDefaultTraits@moodycamel@2@@moodycamel@1@XZ ; tracy::GetToken
    mov rdi, rax
    mov rsi, QWORD PTR [rax+40]
    movzx   ebx, si
    test    rbx, rbx
    jne SHORT $LN205@FindViewLi
    mov rdx, rsi
    mov rcx, rax
    call    ?enqueue_begin_alloc@ExplicitProducer@?$ConcurrentQueue@UQueueItem@tracy@@UConcurrentQueueDefaultTraits@moodycamel@2@@moodycamel@tracy@@QEAAX_K@Z ; tracy::moodycamel::ConcurrentQueue<tracy::QueueItem,tracy::moodycamel::ConcurrentQueueDefaultTraits>::ExplicitProducer::enqueue_begin_alloc
$LN205@FindViewLi:
    shl rbx, 5
    add rbx, QWORD PTR [rdi+72]
    mov BYTE PTR [rbx], 16
    rdtsc
    shl rdx, 32                 ; 00000020H
    or  rax, rdx
    mov QWORD PTR [rbx+1], rax
    lea rax, QWORD PTR [rsi+1]
    mov QWORD PTR [rdi+40], rax
$LN298@FindViewLi:

It is worth noting that we have TRACY_ON_DEMAND enabled, since we want to have profiling in the final Release build (so that we could ask a player with performance issues to record and send us a trace).

Looks like quite a lot of code... Wouldn't it be better from the point of code bloat to only check whether profiling is enabled in inline function, and call a non-inlineable function if it is? For instance, then optimizer won't have to decide between favoring Tracy or application code during optimization.

I think we can wrap C API or C++ implementation classes into custom wrapper to get such behavior with some overhead. Just wonder about reasons in the first place.

P.S. We are thinking about integrating Tracy into TheDarkMod game.

For reference, here are the annotated versions of zone begin and end code, compiled with MSVC 2019, and on-demand mode enabled.

Begin, 133 bytes: obraz

End, 97 bytes: obraz

When zones are not inlined, the compiler has to follow the calling convention, both when the functions are invoked:

obraz obraz

And by adding prologue and epilogue to function bodies (highlighted):

obraz obraz

This has a measurable cost. Moreover, when zones are inlined, the compiler has more freedom in deciding where the variables are being held. There was some effort put into making common variables possible to be shared between zone begin and end code. This is not an option when zone begin and end code is in separate symbols, in which case everything has to be loaded from memory, as there is no common environment.

For instance, then optimizer won't have to decide between favoring Tracy or application code during optimization.

I have found out that optimizers are very bad at deciding what is profitable to inline and in most cases it's worthwhile to manually mark force inlining of the hot code (assuming you have profiled the execution beforehand). For example, in etcpak I had two functions:

CompressBlock, which is rather largish, as it has to do some complex processing on a single 4x4 block of pixels.
CompressImage, a rather small dispatch function which extracts blocks from image and calls CompressBlock.

Now, the compiler saw that there is a large function and a small function and it decided that the small function should be inlined (because inlining large functions bloats the code, etc). This is what was happening as a result:

The CompressImage function was reading the image contents from the memory into SIMD registers.
To follow the call conventions, the SIMD registers were stored on stack.
The CompressBlock function prologue had to save contents of the SIMD registers, because they have to be preserved for the caller.
Then it could load the stack contents to SIMD registers.
In the function epilogue the original contents of the SIMD registers had to be restored.

Force inlining CompressBlock resulted in complete elimination of stack usage, because the contents of SIMD registers, as loaded by CompressImage, could be directly used by CompressBlock and the compiler was able to fully comprehend where everything is used and what doesn't need to be preserved for possible future use. Just this resulted in a 2x speed increase.

Even if the compiler happens to make a right guess in your case, you're betting that the same guess will be made by all future versions, or that otherwise unrelated factors (external libraries, etc) will never have an impact on inlining.

I think we can wrap C API or C++ implementation classes into custom wrapper to get such behavior with some overhead.

This should be faily easy to do.

we want to have profiling in the final Release build (so that we could ask a player with performance issues to record and send us a trace).

There are couple of commercial games which do this. One of them is Natural Selection 2.

We are thinking about integrating Tracy into TheDarkMod game.

With darkmod you have some very low hanging fruits to consider. Using Zstd instead of zip for pak compression. Compiling maps to a binary format, so that megabytes of floating point values don't have to be parsed. Other than that darkmod is rather "boring" to profile:

obraz

(This is a 3:23 run of A New Job.)

This has a measurable cost. Moreover, when zones are inlined, the compiler has more freedom in deciding where the variables are being held. There was some effort put into making common variables possible to be shared between zone begin and end code. This is not an option when zone begin and end code is in separate symbols, in which case everything has to be loaded from memory, as there is no common environment.

Ok, it is a good point indeed. The phase-start and phase-end code needs registers, a few volatile registers are not enough, so compiler has to save/load on stack.

I have found out that optimizers are very bad at deciding what is profitable to inline and in most cases it's worthwhile to manually mark force inlining of the hot code (assuming you have profiled the execution beforehand). For example, in etcpak I had two functions:

I definitely agree that compilers are not perfect in inlining. They don't know design intent and only have brief static analysis. Better control inlining manually it if matters =)

For instance, then optimizer won't have to decide between favoring Tracy or application code during optimization.

My point was that when Tracy code and application code are within one function, they start fighting for resources. If e.g. there is shortage in registers, then optimizer will have to decide whether it should save some value from tracy code in a register or give one more register to better optimize the application code. If Tracy code is in separate function, then compiler will not try to steal resources from application in order to help Tracy.

I think we can wrap C API or C++ implementation classes into custom wrapper to get such behavior with some overhead. This should be faily easy to do.

In fact, inlineable API can be turned into non-inlineable by making simple wrappers (if overhead is not very important), but not vice versa. I guess that's another strong point why inlineable API is better.

we want to have profiling in the final Release build (so that we could ask a player with performance issues to record and send us a trace). There are couple of commercial games which do this. One of them is Natural Selection 2.

We realized that we almost have to provide some very minimal wrappers in order to prevent Tracy from listening on socket on players' machines. So if we ever decide that we have so much instrumenting that code bloat is a problem, we will just change the wrappers.

With darkmod you have some very low hanging fruits to consider. Using Zstd instead of zip for pak compression. Compiling maps to a binary format, so that megabytes of floating point values don't have to be parsed. Other than that darkmod is rather "boring" to profile:

Hey, you are fast!

I suppose you did sampling profiling here? Right now Visual Studio covers this point quite well. The main reason for using Tracy is manually-instrumented profiling of game frame, with timeline and GPU profiling with OpenGL timer queries. I'm pretty sure we will try other features in future too.

I'm afraid we are already too tied on zip format and minizip code to consider migration to Zstd. And floating point parsing is not the biggest problem of level loading yet.

My point was that when Tracy code and application code are within one function, they start fighting for resources.

Yes, that was clear. Profiling always has a cost. When you are instrumenting your code with zones, in most cases this cost is negligible, as you can see by looking at sampling data. It becomes significant only in case of small functions, profiling which is an overkill.

obraz

(Note that the profiling cost presented here in the assembly is not attributed in the visible source code, as it's from a different source file.)

Given that zone begin/end events are typically encompassing the whole function, I don't think register pressure due to profiler usage would be a problem. You basically start with a clean state for the profiler to perform what it has to do (the registers will most likely have to be preserverd anyway, for later parts of the code), and then the actual function code is free to use the registers as it sees fit. The only things that may need to be additionally done is moving the function parameters passed in registers to other registers, but this is performed by renaming registers and has virtually zero cost (maybe you'd need a cycle to fetch the instructions).

I suppose you did sampling profiling here? Right now Visual Studio covers this point quite well.

Yes. I find the MSVC profiler very limiting.

I'm afraid we are already too tied on zip format and minizip code to consider migration to Zstd.

This could be an option to use, not necessarily a requirement.

And floating point parsing is not the biggest problem of level loading yet.

Ah yes, I see. I don't suppose the image format can change in midst of loading an image? ;)

obraz

It also seems you are compressing the textures in the driver (the long, child-less unknown zone after the image is loaded)? Don't do that, drivers are not optimized to do this fast.

For the record: we have integrated Tracy into TheDarkMod.

wolfpld / tracy

Inlining and code bloat #226