ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
32.19k stars 2.35k forks source link

zig cc creating significantly slower binaries than GCC and clang in certain edge cases #16704

Open xdBronch opened 11 months ago

xdBronch commented 11 months ago

Zig Version

0.12.0-dev.3+9c05810be

Steps to Reproduce and Observed Behavior

clone https://github.com/karpathy/llama2.c and wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin then make runfast and ./run stories15M.bin. This by default will use GCC to compile but can be overridden in the makefile, here are some data points on my computer compiler/flags Tokens/s
GCC ~305
GCC -march=native ~440
Clang ~370
Clang -march=native ~450
Zig CC ~82
Zig CC -march=native ~82

All of these are being passed -Ofast from the runfast config. If you instead use make run it uses -O3 which causes GCC and Clang to drop to the same level as Zig CC, ~85 or so tokens/s. This seems to indicate the Zig CC is not properly accepting -Ofast and instead is defaulting to a lower level of optimization.

I asked about this in the zig discord and was told that for zig cc "-O3 is translated into ReleaseFast, which is O2". If this is the case I believe it's valid when interacting directly with Zig code if theres a good reason for it but zig cc is often advertised as a better way to compile C, a drop in replacement. I understand that the majority of programs will not benefit this much from Ofast, some may even slow down, but for it to be an actual replacement it needs to be able to at least match the competition.

Expected Behavior

zig cc should make code as fast as other compilers

andrewrk commented 11 months ago

Workaround: use the more powerful zig build-exe CLI and you can pass -O3 directly with

  -cflags [flags] --        Set extra flags for the next positional C source files
xdBronch commented 11 months ago

not sure if im doing something incorrectly but zig build-exe run.c -cflags -Ofast -lm -march=native -- -lc runs at 10 tokens/s, adding -OReleaseFast after the cflags brings it back up to 82

andrewrk commented 11 months ago

There's an environment variable to print the clang command: ZIG_VERBOSE_CC=1

Edit: your positional argument is before the flags rather than after

Set extra flags for the next positional C source files

xdBronch commented 11 months ago

ah yeah thats my bad. putting the C file after the flags runs at 40 tokens/s then adding -OReleaseFast runs at ~415. this is a obviously a massive improvement but still a bit behind. im also a little confused as to why i need to use C and Zig optimization flags to get good results?

andrewrk commented 11 months ago

You should start by comparing the clang command lines, and make sure the clang versions are the same.

xdBronch commented 11 months ago

found that this can be worked around fairly easily (although maybe a bit hacky) by just telling zig to append the optimization level to clang's argv if its Ofast.

maybe instead of a work around like this, zig itself gains flags that match the behavior of Ofast? we have @setFloatMode on the zig side but theres nothing to enable it globally, i think adding something like that would both fix this problem and speed up existing zig code, sounds like a win-win imo. please lmk if theres a reason this cant be an option.

icls1337 commented 11 months ago

ah yeah thats my bad. putting the C file after the flags runs at 40 tokens/s then adding -OReleaseFast runs at ~415. this is a obviously a massive improvement but still a bit behind. im also a little confused as to why i need to use C and Zig optimization flags to get good results?

try -Xclang -Ofast