Open scheibo opened 11 months ago
mmm interresting.
looks like reverting the change does indeed restore the perf.
--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -10528,7 +10528,8 @@ pub const FuncGen = struct {
if (isByRef(elem_ty, mod)) {
return self.loadByRef(ptr, elem_ty, ptr_alignment, access_kind);
}
- return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+ //return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+ return self.wip.load(access_kind, try o.lowerType(elem_ty), ptr, ptr_alignment, "");
}
const containing_int_ty = try o.builder.intType(@intCast(info.packed_offset.host_size * 8));
a quick look with perf report
shows the main difference comes from
gen1.data.Battle(common.rng.PRNG(1)).choice
that uses a u4
loop index.
var slot: u4 = 2;
while (slot <= 6) : (slot += 1) {
const id = side.order[slot - 1];
if (id == 0 or side.pokemon[id - 1].hp == 0) continue;
out[n] = .{ .type = .Switch, .data = slot };
n += 1;
}
The loop (and others in the function) was previously unrolled by llvm, and no longer is.
I've tried to poke at it a little bit, but no idea how to fix this.
Short of doing some kind of range propagation pass or something, I'm not quite sure how it is possible distinguish between "this is a nice clean local variable" and "this may contain some uninitialized left over bits" (as in #14200)
(but then, I know nothing about llvm or zig internals, so maybe there's a way...)
Of course, as a workaround, changing the loop counter to u8
(and adding a few @intCast()
) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.
Thanks for digging in, @xxxbxxx! Its nice to know that I can work around this with @intCast
and different loop counter types throughout the code, though I'm a little worried in general about the performance cliff/footgun so I don't know what Zig wants to do here
Of course, as a workaround, changing the loop counter to u8 (and adding a few
@intCast()
) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.
Changing all non-power-of-2 loop counters to u8
or usize
and then adding @intCast()
s solved half of the problem (i.e. this is now a 5% regression instead of a 10% regression), but im still stuck with a sizable regression.
I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.
I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.
that's a misunderstanding: the "change that didn't cause any regression" is indeed the change triggering the performance issue...
Some notable regressions here 👇 ~24% regression when comparing zig 11 vs 12/13 with ReleaseFast ~33% regression when comparing zig 11 vs 12/13 with ReleaseSmall
macOS: 13.6.6
CPU: Intel Core i9-9900K
zigV build-lib --gc-sections -fsingle-threaded -dead_strip -dead_strip_dylibs -mcpu=native -OReleaseMODE -dynamic -lc src/lib.zig -fallow-shlib-undefined -femit-bin=dist/lib-V-MODE
ZIG 11 Fast x 20,166 ops/sec ±0.35% (96 runs sampled)
ZIG 11 Small x 14,645 ops/sec ±0.39% (95 runs sampled)
ZIG 12 Fast x 16,136 ops/sec ±0.36% (97 runs sampled)
ZIG 12 Small x 10,988 ops/sec ±0.33% (94 runs sampled)
ZIG 13 Fast x 16,354 ops/sec ±0.33% (96 runs sampled)
ZIG 13 Small x 11,420 ops/sec ±0.37% (97 runs sampled)
🚀 Fastest: ZIG 11 Fast
🐌 Slowest: ZIG 12 Small
Would love to have zig 13 Fast's size with zig 11 Fast's speed 😄
I just spent the day reverting my project to 0.11.0 - as noticed by @Inqnuam I'm now actually seeing 20%+ regression, not just 10% (and @xxxbxxx's suggestion above) no longer seems to move the needle much at all.
My project builds with the Zig compiler at HEAD and all the way back to 0.11.0. I don't know how long that will possible in the wake of breaking language changes (it seems like only breaking standard library and build system changes have occurred since 0.11.0), but currently I feel it would serve as a uniquely suitable testbed for someone interesting in attempting to improve compiler performance or the performance of compiled output. I would be very surprised if other projects haven't also experienced at least some sort of slowdown since 0.11.0, I just imagine such a slowdown is harder to attribute directly to Zig in the same way my project is able to due to the difficulties of supporting multiple Zig versions and how much project's code and feature set would usually change over time.
Zig Version
0.12.0-dev.876+aaf46187a
Steps to Reproduce and Observed Behavior
I noticed a regression on my project's benchmark and bisected it via nightlies -
zig-macos-aarch64-0.12.0-dev.866+3a47bc715
produces code that runs ~10% faster thanzig-macos-aarch64-0.12.0-dev.876+aaf46187a
:The benchmark tool uses
std.time.Timer
to perform its own internal timing which it prints out as the first number there (second two numbers are for confirming the benchmarks are computing the same results).Alternatively, since the regression is large enough you can literally just use
time
(though this is measuring a different thing than the internal benchmark timer, but thats kind of unimportant from Zig's POV):Expected Behavior
There not to be a regression :)
From https://github.com/ziglang/zig/compare/3a47bc715...aaf46187a I'm guessing #17391 is a likely suspect?