20%+ regression in ReleaseFast performance since 0.11.0

scheibo commented 11 months ago

Zig Version

0.12.0-dev.876+aaf46187a

Steps to Reproduce and Observed Behavior

I noticed a regression on my project's benchmark and bisected it via nightlies - zig-macos-aarch64-0.12.0-dev.866+3a47bc715 produces code that runs ~10% faster than zig-macos-aarch64-0.12.0-dev.876+aaf46187a:

$ git clone https://github.com/pkmn/engine.git
<snip>
$ cd engine
$ git reset --hard 12e7292c9df3ad73c9681205fb3acfee1536c425
$ zig version
0.12.0-dev.866+3a47bc715
$ zig build benchmark --  1 1000/10000 281483566841860
122294722,1232901,1471351722788935846

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build benchmark --  1 1000/10000 281483566841860
137394552,1232901,1471351722788935846

The benchmark tool uses std.time.Timer to perform its own internal timing which it prints out as the first number there (second two numbers are for confirming the benchmarks are computing the same results).

Alternatively, since the regression is large enough you can literally just use time (though this is measuring a different thing than the internal benchmark timer, but thats kind of unimportant from Zig's POV):

$ zig version
0.12.0-dev.866+3a47bc715
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/7cb294da313404bb8f51f4c304d47793/benchmark 1 1000/10000 281483566841860
124231503,1232901,1471351722788935846

real    0m0.178s
user    0m0.168s
sys 0m0.005s

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/30082df782561e9cee72e116248d663a/benchmark 1 1000/10000 281483566841860
136831126,1232901,1471351722788935846

real    0m0.198s
user    0m0.182s
sys 0m0.006s

Expected Behavior

There not to be a regression :)

From https://github.com/ziglang/zig/compare/3a47bc715...aaf46187a I'm guessing #17391 is a likely suspect?

xxxbxxx commented 11 months ago

mmm interresting.

looks like reverting the change does indeed restore the perf.

--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -10528,7 +10528,8 @@ pub const FuncGen = struct {
             if (isByRef(elem_ty, mod)) {
                 return self.loadByRef(ptr, elem_ty, ptr_alignment, access_kind);
             }
-            return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            //return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            return self.wip.load(access_kind, try o.lowerType(elem_ty), ptr, ptr_alignment, "");
         }

         const containing_int_ty = try o.builder.intType(@intCast(info.packed_offset.host_size * 8));

a quick look with perf report shows the main difference comes from gen1.data.Battle(common.rng.PRNG(1)).choice that uses a u4 loop index.

            var slot: u4 = 2;
            while (slot <= 6) : (slot += 1) {
                const id = side.order[slot - 1];
                if (id == 0 or side.pokemon[id - 1].hp == 0) continue;
                out[n] = .{ .type = .Switch, .data = slot };
                n += 1;
            }

The loop (and others in the function) was previously unrolled by llvm, and no longer is.

xxxbxxx commented 11 months ago

I've tried to poke at it a little bit, but no idea how to fix this.

Short of doing some kind of range propagation pass or something, I'm not quite sure how it is possible distinguish between "this is a nice clean local variable" and "this may contain some uninitialized left over bits" (as in #14200)

(but then, I know nothing about llvm or zig internals, so maybe there's a way...)

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

scheibo commented 11 months ago

Thanks for digging in, @xxxbxxx! Its nice to know that I can work around this with @intCast and different loop counter types throughout the code, though I'm a little worried in general about the performance cliff/footgun so I don't know what Zig wants to do here

scheibo commented 10 months ago

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

Changing all non-power-of-2 loop counters to u8 or usize and then adding @intCast()s solved half of the problem (i.e. this is now a 5% regression instead of a 10% regression), but im still stuck with a sizable regression.

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

xxxbxxx commented 10 months ago

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

that's a misunderstanding: the "change that didn't cause any regression" is indeed the change triggering the performance issue...

Inqnuam commented 4 months ago

Some notable regressions here 👇 ~24% regression when comparing zig 11 vs 12/13 with ReleaseFast ~33% regression when comparing zig 11 vs 12/13 with ReleaseSmall

macOS: 13.6.6
CPU: Intel Core i9-9900K

zigV build-lib --gc-sections -fsingle-threaded -dead_strip -dead_strip_dylibs -mcpu=native -OReleaseMODE -dynamic -lc  src/lib.zig -fallow-shlib-undefined -femit-bin=dist/lib-V-MODE

ZIG 11 Fast x 20,166 ops/sec ±0.35% (96 runs sampled)
ZIG 11 Small x 14,645 ops/sec ±0.39% (95 runs sampled)
ZIG 12 Fast x 16,136 ops/sec ±0.36% (97 runs sampled)
ZIG 12 Small x 10,988 ops/sec ±0.33% (94 runs sampled)
ZIG 13 Fast x 16,354 ops/sec ±0.33% (96 runs sampled)
ZIG 13 Small x 11,420 ops/sec ±0.37% (97 runs sampled)

🚀 Fastest: ZIG 11 Fast
🐌 Slowest: ZIG 12 Small

Would love to have zig 13 Fast's size with zig 11 Fast's speed 😄

scheibo commented 1 month ago

I just spent the day reverting my project to 0.11.0 - as noticed by @Inqnuam I'm now actually seeing 20%+ regression, not just 10% (and @xxxbxxx's suggestion above) no longer seems to move the needle much at all.

My project builds with the Zig compiler at HEAD and all the way back to 0.11.0. I don't know how long that will possible in the wake of breaking language changes (it seems like only breaking standard library and build system changes have occurred since 0.11.0), but currently I feel it would serve as a uniquely suitable testbed for someone interesting in attempting to improve compiler performance or the performance of compiled output. I would be very surprised if other projects haven't also experienced at least some sort of slowdown since 0.11.0, I just imagine such a slowdown is harder to attribute directly to Zig in the same way my project is able to due to the difficulties of supporting multiple Zig versions and how much project's code and feature set would usually change over time.

ziglang / zig