ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.6k stars 2.46k forks source link

Global write optimized-out even though side effect is possible #20452

Open cdemirer opened 2 months ago

cdemirer commented 2 months ago

Zig Version

0.12.0 (also newer versions)

Steps to Reproduce and Observed Behavior

I'm not 100% confident with atomics and memory orderings, but this might be a bug.

const std = @import("std");

var FLAG = std.atomic.Value(bool).init(false);
var SHARED_VALUE: usize = 0;

pub fn worker_fn() void {
    while (true) {
        if (!FLAG.load(.seq_cst)) continue;
        std.debug.print("Shared value is: {}\n", .{SHARED_VALUE});
        FLAG.store(false, .seq_cst);
    }
}

pub fn main() !void {
    _ = try std.Thread.spawn(.{}, worker_fn, .{});

    SHARED_VALUE = 1;
    magic();
    SHARED_VALUE = 2;
    magic();
}

pub fn magic() void {
    FLAG.store(true, .seq_cst);
    while (FLAG.load(.seq_cst)) {}
}

When compiled with -O ReleaseSmall, it outputs:

Shared value is: 0
Shared value is: 2

If the magic() function is inlined (either manually or using inline), the problem disappears.


I also have this modified version that triggers this behavior in -O ReleaseFast as well:

Click to expand ```zig const std = @import("std"); const MAGIC_NUM_1 = 6; // Try 5 :) const MAGIC_NUM_2 = 7; // Try 6 :) var FLAG = std.atomic.Value(bool).init(false); var SHARED_VALUE: usize = 0; pub fn worker_fn() void { while (true) { if (!FLAG.load(.seq_cst)) continue; std.debug.print("Shared value is: {}\n", .{SHARED_VALUE}); FLAG.store(false, .seq_cst); } } pub fn main() !void { _ = try std.Thread.spawn(.{}, worker_fn, .{}); SHARED_VALUE = 1; // Optimized-out in ReleaseFast, if the magic values are just right. magic(); SHARED_VALUE = 2; magic(); } pub fn magic() void { for (0..MAGIC_NUM_1) |_| { if (!FLAG.load(.seq_cst)) { FLAG.store(true, .seq_cst); } } for (0..MAGIC_NUM_2) |_| { while (FLAG.load(.seq_cst)) {} } } ``` Output: ``` Shared value is: 0 Shared value is: 2 ```

Expected Behavior

Maybe

Shared value is: 1
Shared value is: 2

but I'm not sure.

rohlem commented 2 months ago

From my understanding (based on what I remember from the C/C++ memory model, which Zig's will probably be based on, see https://github.com/ziglang/zig/issues/6396), compilers are always allowed to hoist and elide reads and writes without ordering constraints - that's the reason you use atomic accesses with memory ordering otherwise. EDIT2: I misremembered. From cppreference:

All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

Given this, I'm not sure how to justify the behavior of the given program: The read of SHARED_VALUE in worker_fn should have to occur between the read from and write to SHARED_VALUE, which means it should observe the value 1 stored to SHARED_VALUE before the store of true to FLAG by the main thread.

Therefore this truly looks like a bug to me (going by the C/C++ memory model; even if Zig's model isn't decided yet I assume we'd really want translated code to behave the same).

mlugg commented 2 months ago

I agree that this certainly looks like a bug. In this case, acquire/release semantics should be sufficient for the expected behavior to be exhibited, so sequential consistency should certainly get you there.

I wonder whether we're adding some incorrect LLVM function attributes or something.

presentfactory commented 2 months ago

I was discussing this issue with them in the Zig Matrix channel and my guess is that the semantics of the atomic loads/stores are not being propagated to the function they are contained in properly, causing said functions not be optimized correctly for some reason. I am not sure how this works in compilers, but if you do an release atomic store in a function, that function itself will also need to act as a release-esque memory barrier wherever it is called.

Right now it seems that in ReleaseSmall the compiler is keeping the function as-is to keep a smaller code size and some sort of optimization is done such that it breaks these semantics. In ReleaseFast, or marking magic with inline, or manually copying the contents of the code to the call sites, it is being inlined and thus I imagine this issue goes away since the compiler is no longer dealing with a function but rather just these annotated loads themselves. Also of course this doesn't happen in Debug presumably because no optimizations are going on.

This can be seen in the assembly for the ReleaseSmall version:

.LBB1_44:
        ; SHARED_VALUE = 1 implicitly here from it being initialized to 0.
        call    example.magic
        ; SHARED_VALUE = 2, optimization from the compiler treats this like a boolean flag to know if it should print 2 or not.
        mov     byte ptr [rip + example.SHARED_VALUE], 1
        call    example.magic

Vs the ReleaseFast:

.LBB1_88:
        ; SHARED_VALUE = 1
        mov     qword ptr [rip + example.SHARED_VALUE], 1
        ; Inlined Magic call 1
        mov     al, 1
        xchg    byte ptr [rip + example.FLAG.0], al
.LBB1_89:
        movzx   eax, byte ptr [rip + example.FLAG.0]
        test    al, 1
        jne     .LBB1_89
        ; SHARED_VALUE = 2
        mov     qword ptr [rip + example.SHARED_VALUE], 2
        ; Inlined Magic call 2
        mov     al, 1
        xchg    byte ptr [rip + example.FLAG.0], al
.LBB1_91:
        movzx   eax, byte ptr [rip + example.FLAG.0]
        test    al, 1
        jne     .LBB1_91

I initially thought this was the compiler re-ordering things, but if I change the values from 1 and 2 to something else, the ReleaseSmall version keeps the same exact code which showed that the mov of 1 into the shared value isn't actually the assignment of 1 to it, but rather it being used for some other purpose. The compiler seems to be using the shared value more as a boolean flag to know if it should print 0 or another value (even though 0 is not one of the values it should be printing), for instance when switching the shared value writes from 1 and 2 to 33 and 378 I get this code in the worker function:

        mov     r15b, byte ptr [rip + example.SHARED_VALUE]
        ; Omitted stuff...
        test    r15b, 1 ; Test the shared value like a flag
        mov     eax, 0
        mov     ecx, 378
        cmovne  rax, rcx ; Use either 0 or 378, clearly this is where the problem is emerging

So maybe the optimizer is just optimizing in a way that it is not respecting the atomic orderings and is just trying to be a bit too clever with this printing optimization. Either way it feels somehow inline-related none the less given this optimization is no longer done once the inlining is forced in ReleaseSmall.

presentfactory commented 2 months ago

Something else interesting, this happens in both clang and gcc if you write the same program in C (thanks to Inverted Boo for writing it), though which optimization flags trigger it vary (LTO needs to be enabled I think to do this weird printing optimization, but happens with O2 for instance in clang and O1 in GCC): https://godbolt.org/z/Er69sbcPs

Given both LLVM and GCC do this I feel like that might imply this isn't a bug and is actually some sort of intended behavior, but I don't know enough about the rules of atomic memory orderings like this to say.

rohlem commented 2 months ago

@presentfactory Interesting data point. Certainly if clang and gcc do the same, we're not necessarily doing anything wrong interfacing with LLVM here.

Replacing SHARED_VALUE with an atomic_int and loading it with memory_order_relaxed seems to result in the expected output 1 followed by 2. If the only difference between relaxed access and a non-atomic access is atomicity, maybe the non-atomic read is still somehow a data race, hence UB? Though if reading the variable were UB, that would make the statement that "non-atomic writes become visible side-effects" effectively useless.

cdemirer commented 2 months ago

Is it possible that the atomic memory orderings are always treated as local to function? Could the compiler be doing this (rightfully)?

    magic();
    SHARED_VALUE = 1; // Now it's legal to optimize-out.
    SHARED_VALUE = 2;
    magic();

since it doesn't have to consider atomic operations inside magic() as having side-effects, and the ordering only applies locally?

rohlem commented 2 months ago

@cdemirer While that does describe the symptom here, it doesn't really match the wording I've found.

My interpretation to poke holes in:

The quote above uses [the *happens-before* relation](https://en.cppreference.com/w/cpp/atomic/memory_order#Happens-before), which is defined as either *sequenced-before* or *inter-thread happens-before*. [*Sequenced-before* is defined via evaluation order](https://en.cppreference.com/w/cpp/language/eval_order#Rules), where for our example the important parts seem to be: - Function call statements are [*expression statements*](https://en.cppreference.com/w/cpp/language/statements#Expression_statements), as are assignments (they return the assigned value in C, the assigned-to lvalue in C++ iirc). - An expression outside of any other [*full-expression* is itself a *full-expression*](https://en.cppreference.com/w/cpp/language/expressions#Full-expressions). Since no expression spans from one statement to the next in this context, the function call *expression statement*'s expression should indeed be a *full-expression* in this context. - ["Each value computation and side effect of a *full-expression* is *sequenced-before* each value computation and side effect of the next full-expression."](https://en.cppreference.com/w/cpp/language/eval_order#Rules) That means the function calls of `magic` and assignments to `SHARED_VALUE` are fully *sequenced-before* each other, and the atomic ordering guarantees should cover their interactions.

@kprotty sorry to ping for something so simple, but since multiple people have now looked at it, do you see some glaring mistake in the code example? Do the non-atomic accesses simply always trigger UB via data race here?

jacobly0 commented 2 months ago

caused by:

Trying to eliminate MemoryDefs killed by 16 = MemoryDef(15) (  store i64 1, ptr @repro.SHARED_VALUE, align 8)
  trying to get dominating access
   visiting 15 = MemoryDef(9) (  call void @llvm.lifetime.end.p0(i64 24, ptr nonnull %1), !noalias !0)
   visiting 9 = MemoryDef(8) (  %41 = tail call i64 @clone(ptr nonnull readonly align 1 @Thread.LinuxThreadImpl.spawn__anon_3473.Instance.entryFn, i64 %39, i64 8195840, i64 %40, ptr nonnull align 4 %37, i64 %33, ptr nonnull align 4 %36), !noalias !3)
  finished walk
Trying to eliminate MemoryDefs killed by 18 = MemoryDef(17) (  store i64 2, ptr @repro.SHARED_VALUE, align 8)
  trying to get dominating access
   visiting 17 = MemoryDef(16) (  tail call fastcc void @repro.magic())
   visiting 16 = MemoryDef(15) (  store i64 1, ptr @repro.SHARED_VALUE, align 8)
  Checking for reads of 16 = MemoryDef(15) (  store i64 1, ptr @repro.SHARED_VALUE, align 8)
   17 = MemoryDef(16) (  tail call fastcc void @repro.magic())
   18 = MemoryDef(17) (  store i64 2, ptr @repro.SHARED_VALUE, align 8)
    ... skipping killing def/dom access
 Checking if we can kill 16 = MemoryDef(15) (  store i64 1, ptr @repro.SHARED_VALUE, align 8)
DSE: Remove Dead Store:
  DEAD:   store i64 1, ptr @repro.SHARED_VALUE, align 8
  KILLER:   store i64 2, ptr @repro.SHARED_VALUE, align 8
  trying to get dominating access
   visiting 15 = MemoryDef(9) (  call void @llvm.lifetime.end.p0(i64 24, ptr nonnull %1), !noalias !0)
   visiting 9 = MemoryDef(8) (  %41 = tail call i64 @clone(ptr nonnull readonly align 1 @Thread.LinuxThreadImpl.spawn__anon_3473.Instance.entryFn, i64 %39, i64 8195840, i64 %40, ptr nonnull align 4 %37, i64 %33, ptr nonnull align 4 %36), !noalias !3)
  finished walk

upstream: llvm/llvm-project#97252

rohlem commented 2 months ago

@jacobly0 Then that means the above-posted https://godbolt.org/z/Er69sbcPs having the same output on gcc is a similar bug in gcc? (Or is it still unclear whether this is a miscompilation or a misunderstanding of the standard by us users?)