andrewrk commented 3 years ago

Background

The goto keyword was removed in #630. This remains the right call because all the control flow in Zig can be expressed in a better way: continue to goto backwards, and break to goto forwards.

However, in C, there is another concept, called "computed goto". This is described in #5950 and briefly discussed in #2162. This concept is not currently possible in Zig. It is possible to model the desired semantics with existing control flow features quite simply, but it is not possible to obtain the desired machine code, even in optimized builds.

Problem Statement

For example (godbolt link):

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
        array_type,
        array_type_sentinel,
        indexable_ptr_len,
    };
};

export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void {
    const inst_list = inst_list_ptr[0..inst_list_len];
    for (inst_list) |inst, i| {
        map[i] = switch (inst.tag) {
            .add => analyze_add(inst),
            .addwrap => analyze_addwrap(inst),
            .alloc => analyze_alloc(inst),
            .alloc_mut => analyze_alloc_mut(inst),
            .alloc_inferred => analyze_alloc_inferred(inst),
            .alloc_inferred_mut => analyze_alloc_inferred_mut(inst),
            .anyframe_type => analyze_anyframe_type(inst),
            .array_cat => analyze_array_cat(inst),
            .array_mul => analyze_array_mul(inst),
            .array_type => analyze_array_type(inst),
            .array_type_sentinel => analyze_array_type_sentinel(inst),
            .indexable_ptr_len => analyze_indexable_ptr_len(inst),
        };
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;
extern fn analyze_array_type(inst: Inst) u32;
extern fn analyze_array_type_sentinel(inst: Inst) u32;
extern fn analyze_indexable_ptr_len(inst: Inst) u32;

In the generated machine code, each prong ends up jumping back to the loop condition, before getting re-dispatched to the next prong:

.LBB0_3:
        xor     edi, edi
        call    analyze_add
        jmp     .LBB0_15
.LBB0_4:
        mov     edi, 1
        call    analyze_addwrap
        jmp     .LBB0_15

The reason this machine code is not what we desire is described in this paper in the section "Direct Threading" and "The Context Problem":

Mispredicted branches pose a serious challenge to modern processors because they threaten to starve the processor of instructions. The problem is that before the destination of the branch is known the execution of the pipeline may run dry. To perform at full speed, modern CPUs need to keep their pipelines full by correctly predicting branch targets.

This problem is even worse for direct call threading and switch dispatch. For these techniques there is only one dispatch branch and so all dispatches share the same BTB entry. Direct call threading will mispredict all dispatches except when the same virtual instruction body is dispatched multiple times consecutively.

They explain it in a nice, intuitive way here:

Another perspective is that the destination of the indirect dispatch branch is unpredictable because its destination is not correlated with the hardware pc. Instead, its destination is correlated to the vPC. We refer to this lack of correlation between the hardware pc and vPC as the context problem. We choose the term context following its use in context sensitive inlining [#!Grove_Chambers_2002!#] because in both cases the context of shared code (in their case methods, in our case virtual instruction bodies) is important to consider.

So the problem statement here is that we want to be able to write zig code that outputs machine code that matches this Direct Threading pattern. In one sense, it is an optimization problem, since we can model the same semantics with other language constructs and other machine code. But in another sense, it is more fundamental than an optimization problem, because Zig is a language that wants to generate optimal machine code, meaning it is possible to write Zig code that generates machine code equivalent or better to what you could write by hand.

In short summary, we want to be able to express zig code where each switch prong jumps directly to the next prong, instead of all switch prongs sharing the same indirect jump, in order to benefit the branch predictor.

Research Dump

Can LLVM Do the Optimization?

In this example (godbolt link), I changed the loop to while(true) and manually inlined the continue expression into each switch prong, with a continue. It does not get much simpler than this; we are practically begging LLVM to do the optimization.

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void {
    const inst_list = inst_list_ptr[0..inst_list_len];
    var i: usize = 0;
    while (true) {
        const inst = inst_list[i];
        switch (inst.tag) {
            .add => {
                map[i] = analyze_add(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .addwrap => {
                map[i] = analyze_addwrap(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc => {
                map[i] = analyze_alloc(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_mut => {
                map[i] = analyze_alloc_mut(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_inferred => {
                map[i] = analyze_alloc_inferred(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_inferred_mut => {
                map[i] = analyze_alloc_inferred_mut(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .anyframe_type => {
                map[i] = analyze_anyframe_type(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .array_cat => {
                map[i] = analyze_array_cat(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .array_mul => {
                map[i] = analyze_array_mul(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
        }
        break;
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

Snippet of assembly:

.LBB0_3:
        mov     dword ptr [r14 + 4*rbx], eax
        inc     rbx
        cmp     rbx, r15
        jae     .LBB0_4
.LBB0_1:
        mov     eax, dword ptr [r12 + 4*rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]
.LBB0_2:
        xor     edi, edi
        call    analyze_add
        jmp     .LBB0_3
.LBB0_6:
        mov     edi, 2
        call    analyze_alloc
        jmp     .LBB0_3
.LBB0_5:
        mov     edi, 1
        call    analyze_addwrap
        jmp     .LBB0_3
.LBB0_7:
        mov     edi, 3
        call    analyze_alloc_mut
        jmp     .LBB0_3

Here, LLVM actually figured out the continue expression was duplicated N times, and un-inlined it, putting the code back how it was! So crafty.

EDIT: New Discovery

It does not get much simpler than this

Wrong!

After typing up this whole proposal, I realized that I did not try that optimization with using an "end" tag in the above code. Here is the case, modified (godbolt link):

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    while (true) {
        const inst = inst_list[i];
        switch (inst.tag) {
            .end => return,
            .add => {
                map[i] = analyze_add(inst);
                i += 1;
                continue;
            },
            .addwrap => {
                map[i] = analyze_addwrap(inst);
                i += 1;
                continue;
            },
            .alloc => {
                map[i] = analyze_alloc(inst);
                i += 1;
                continue;
            },
            .alloc_mut => {
                map[i] = analyze_alloc_mut(inst);
                i += 1;
                continue;
            },
            .alloc_inferred => {
                map[i] = analyze_alloc_inferred(inst);
                i += 1;
                continue;
            },
            .alloc_inferred_mut => {
                map[i] = analyze_alloc_inferred_mut(inst);
                i += 1;
                continue;
            },
            .anyframe_type => {
                map[i] = analyze_anyframe_type(inst);
                i += 1;
                continue;
            },
            .array_cat => {
                map[i] = analyze_array_cat(inst);
                i += 1;
                continue;
            },
            .array_mul => {
                map[i] = analyze_array_mul(inst);
                i += 1;
                continue;
            },
        }
        break;
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

Two example prongs from the machine code:

.LBB0_2:
        mov     edi, 1
        call    analyze_add
        mov     dword ptr [r14 + rbx], eax
        add     rbx, 4
        mov     eax, dword ptr [r15 + rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]
.LBB0_3:
        mov     edi, 2
        call    analyze_addwrap
        mov     dword ptr [r14 + rbx], eax
        add     rbx, 4
        mov     eax, dword ptr [r15 + rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]

It's perfect! This is exactly what we wanted.

This compromises the entire proposal. I will still post the proposal but this new discovery makes it seem unnecessary, since, in fact, we are hereby observing #2162 already implemented and working inside LLVM.

Real Actual Use Case

Here's one in the self-hosted compiler: https://github.com/ziglang/zig/blob/e9a038c33bbf171695b08540536f307b9e418173/src/zir_sema.zig#L30 This switch is inside a loop over ZIR instructions. In optimized builds, we noticed non-trivial amount of time spent in the overhead of this dispatch, when analyzing a recursive comptime fibonacci function call.

This pattern also exists in:

The tokenizer
The parser
astgen
sema (this is the linked one above)
codegen
translate-c
zig fmt

(pretty much in every stage of the pipeline)

Proposal

I propose to add continue :label expression syntax, and the ability to label switch expressions. Here is an example:

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    sw: switch (inst_list[i].tag) {
        .end => return,
        .add => {
            map[i] = analyze_add(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .addwrap => {
            map[i] = analyze_addwrap(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc => {
            map[i] = analyze_alloc(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_mut => {
            map[i] = analyze_alloc_mut(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_inferred => {
            map[i] = analyze_alloc_inferred(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_inferred_mut => {
            map[i] = analyze_alloc_inferred_mut(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .anyframe_type => {
            map[i] = analyze_anyframe_type(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .array_cat => {
            map[i] = analyze_array_cat(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .array_mul => {
            map[i] = analyze_array_mul(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

The new labeled continue syntax is syntactically unambiguous at a glance that it jumps to a switch expression, because it is the only form where continue accepts an operand. More details:

labeled continue with an operand on a loop would be a compile error
labeled break with a switch would be OK.

How to Lower this to LLVM

Note: I wrote this section before the EDIT: New Discovery section.

One idea I had was to put the switchbr instruction inside each prong. I did some LLVM IR surgery to try out this idea (godbolt link):

SwitchProngAdd:                                     ; preds = %WhileBody
  %9 = load i64, i64* %i, align 8
  %10 = load i32*, i32** %map, align 8
  %11 = getelementptr inbounds i32, i32* %10, i64 %9
  %12 = bitcast %Inst* %inst to i32*
  %13 = load i32, i32* %12, align 4
  %14 = call i32 @analyze_add(i32 %13)
  store i32 %14, i32* %11, align 4
  %15 = load i64, i64* %i, align 8
  %16 = add nuw i64 %15, 1
  store i64 %16, i64* %i, align 8
  %17 = load i64, i64* %i, align 8
  %18 = load %Inst*, %Inst** %inst_list, align 8
  %19 = getelementptr inbounds %Inst, %Inst* %18, i64 %17
  %20 = getelementptr inbounds %Inst, %Inst* %19, i32 0, i32 0
  %a20 = load i32, i32* %20, align 4
  switch i32 %a20, label %SwitchElse18 [
    i32 0, label %SwitchProngEnd
    i32 1, label %SwitchProngAdd
    i32 2, label %SwitchProngAddWrap
    i32 3, label %SwitchProngAlloc
    i32 4, label %SwitchProngAllocMut
    i32 5, label %SwitchProngAllocInferred
    i32 6, label %SwitchProngAllocInferredMut
    i32 7, label %SwitchProngAnyframeType
    i32 8, label %SwitchProngArrayCat
    i32 9, label %SwitchProngArrayMul
  ]

SwitchProngAddWrap:                                     ; preds = %WhileBody
  %21 = load i64, i64* %i, align 8
  %22 = load i32*, i32** %map, align 8
  %23 = getelementptr inbounds i32, i32* %22, i64 %21
  %24 = bitcast %Inst* %inst to i32*
  %25 = load i32, i32* %24, align 4
  %26 = call i32 @analyze_addwrap(i32 %25)
  store i32 %26, i32* %23, align 4
  %27 = load i64, i64* %i, align 8
  %28 = add nuw i64 %27, 1
  store i64 %28, i64* %i, align 8
  %29 = load i64, i64* %i, align 8
  %30 = load %Inst*, %Inst** %inst_list, align 8
  %31 = getelementptr inbounds %Inst, %Inst* %30, i64 %29
  %32 = getelementptr inbounds %Inst, %Inst* %31, i32 0, i32 0
  %a32 = load i32, i32* %32, align 4
  switch i32 %a32, label %SwitchElse18 [
    i32 0, label %SwitchProngEnd
    i32 1, label %SwitchProngAdd
    i32 2, label %SwitchProngAddWrap
    i32 3, label %SwitchProngAlloc
    i32 4, label %SwitchProngAllocMut
    i32 5, label %SwitchProngAllocInferred
    i32 6, label %SwitchProngAllocInferredMut
    i32 7, label %SwitchProngAnyframeType
    i32 8, label %SwitchProngArrayCat
    i32 9, label %SwitchProngArrayMul
  ]

The machine code for the prongs looks like this:

<snip>
.LBB0_8:                                # %SwitchProngAnyframeType
        mov     edi, r12d
        call    analyze_anyframe_type
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_7]
.LBB0_9:                                # %SwitchProngArrayCat
        mov     edi, r12d
        call    analyze_array_cat
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_8]
.LBB0_10:                               # %SwitchProngArrayMul
        mov     edi, r12d
        call    analyze_array_mul
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_9]
<snip>

Pretty nice. This is exactly what we want - there is an indirect jump in each prong directly to the next prong. But the problem is that even though we should have the same jump table 9 times, LLVM duplicates the jump table 9 times:

.LJTI0_0:
        .quad   .LBB0_1
        .quad   .LBB0_2
        .quad   .LBB0_3
        .quad   .LBB0_4
        .quad   .LBB0_5
        .quad   .LBB0_6
        .quad   .LBB0_7
        .quad   .LBB0_8
        .quad   .LBB0_9
        .quad   .LBB0_10
.LJTI0_1:
        .quad   .LBB0_1
        .quad   .LBB0_2
        .quad   .LBB0_3
        .quad   .LBB0_4
        .quad   .LBB0_5
        .quad   .LBB0_6
        .quad   .LBB0_7
        .quad   .LBB0_8
        .quad   .LBB0_9
        .quad   .LBB0_10
<snip>

The duplicated jump tables are problematic, because in reality there could reasonably be about 150-200 instruction tags, which makes the jump table 600-800 bytes. This is fine; for example my L1 cache size is 256 KiB. But I wouldn't want to multiply that jump table by 200! It would be 156 KiB just for the jump tables alone. That would wreak havoc on the cache.

Unless this improves upstream, the best strategy to lower this language feature will be for Zig to manually create the jump table itself instead of relying on LLVM to do it, using LLVM's ability to take the address of basic blocks and put them into an array. This will essentially generate the same code that you would get in Clang if you used computed goto in the traditional way.

How to Lower this in Self-Hosted Backends

We have lots of options here. It would be quite straightforward, since we have full control over AIR, as well as the backend code generation.

OK But Is The Perf Actually Good?

I don't know. I think realistically in order to benchmark this and find out if the machine code performs better we have to implement it first.

andrewrk commented 3 years ago

EDIT: New Discovery

And here it is in idiomatic zig form, no duplicated code or anything:

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    while (true) : (i += 1) {
        const inst = inst_list[i];
        map[i] = switch (inst.tag) {
            .end => return,
            .add => analyze_add(inst),
            .addwrap => analyze_addwrap(inst),
            .alloc => analyze_alloc(inst),
            .alloc_mut => analyze_alloc_mut(inst),
            .alloc_inferred => analyze_alloc_inferred(inst),
            .alloc_inferred_mut => analyze_alloc_inferred_mut(inst),
            .anyframe_type => analyze_anyframe_type(inst),
            .array_cat => analyze_array_cat(inst),
            .array_mul => analyze_array_mul(inst),
        };
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

It generates the ideal machine code.

sighoya commented 3 years ago

@andrewrk

Did you consider indirectbr?

Techcable commented 3 years ago

Interestingly enough, C# has this feature with its goto case statement.

Although the use cases are limited to certian situations. In those cases it can make a big difference. Interpreters and VMs are definitely a target audience for Zig, and low level code has plenty of other state machines where these sorts of dispatch tables might need this optimization.

As someone expressed in #2162, sometimes computed goto can be slower than a regular switch statement in a loop. It would be odd to have performance pitfalls around these sorts of switch statements. I think Zig usually likes to make these performance choices explicit.

On the other hand, this is generally considered a must-have optimization for fast interpreter implementations. In CPython it gives about a 15-20% speedup.

@sighoya

Did you consider indirectbr?

That is exactly how clang implements it :) In fact, this feature is the reason indirectbr was added to LLVM in the first place

andrewrk commented 3 years ago

So I want to use this in semantic analysis of Zig self-hosted compiler. Here is:

LLVM does not figure out how to generate the machine code that I want, and I suspect it would be a perf improvement. My plan is to implement this in order to benchmark it, and then we can make a decision, armed with some data.

andrewrk commented 3 years ago

Check out https://dino-lang.github.io/learn/dino.pdf, section 3.1 "Byte Code Dispatch"

Mouvedia commented 3 years ago

As prior art I found PowerShell which uses break—instead of continue in this proposal—to jump to the switch's label.

haberman commented 3 years ago

An argument for tail calls

Based on my experience with https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html, I would strongly recommend offering guaranteed tail calls as one tool available to interpreter writers (I'm not arguing against other options).

Getting the replicated decode/dispatch (as described above) is important, but it's only one part of the story. We also have to worry about the code quality for the operation itself -- the code that actually does the work.

Optimizers struggle to generate good code when the entire interpreter is in one big function. In particular, they struggle to keep the most important state in registers. Mike Pall talked about this problem here: http://lua-users.org/lists/lua-l/2011-02/msg00742.html

If each operation is in a separate function, with the most important state propagated in arguments (passed in registers on modern architectures), and each operation tail calling the next, we can get much better code out of the compiler. With this pattern, there is no switch() at all, just a call into a table of function pointers.

Addressing open questions on Tail Calls

To answer the questions above:

As far as I understand, tail calls don't work on some architectures. (what are these? does anybody know?)

For LLVM, musttail currently appears to be totally unsupported on powerpc (32 and 64). On ARM32 it runs into problems in some cases. There is a bit more information about this here: https://gcc.gnu.org/pipermail/gcc/2021-April/235903.html

I don't know if these problems are fundamental or if they just need proper bug fixes.

I'm also concerned about trying to debug when doing dispatch with tail calls.

If you mean stack traces, I don't think tail calls leave any less of a stack trace than a traditional switch()-based dispatch. In both cases the stack tells you where you are, but not where you've been.

It forces you to organize your logic into functions. That's another jump that maybe you did not want in your hot path.

I think efficiency argues for tail calls, not against them. The tail call pattern does not create any unnecessary jumps. I'll elaborate a bit on this point.

Tail calls generate code comparable to hand-written assembly

Take this example from the article of implementing the "ADDVN" operation from LuaJIT, which aims to match this hand-written assembly code.

The C compiler output almost exactly matches the hand-written assembly. It does not contain any jumps except the jump to the next operation, which is inherently necessary.

When you are implementing byte code dispatch, you have a fundamental choice of whether to use replicated dispatch or shared dispatch. There are tradeoffs here, and the LuaJIT source discusses these tradeoffs a bit).

Tail calls can support both options very naturally. All we have to do is flip the dispatch() function between noinline and always_inline: here is an example. For both patterns, the resulting code is comparable to hand-written assembly: it does not contain any extraneous or unnecessary jumps.

Caveats

I should mention one significant downside to this pattern: callee-save registers are not available to these functions without spilling them to the stack first. This means, in effect, that if we want the fastest code we can only use about half the registers. We also pay a steep price when making any non-tail call.

Both of these problems can be solved if you also offer some specialized calling conventions (these are mostly supported in LLVM already). I can go into this more if you are interested. With the right calling conventions, I believe this pattern can generate code that even the most talented assembly language programmers would have a hard time beating.

ifreund commented 3 years ago

@haberman note that zig already allows forcing tail calls with the @call() builtin: https://ziglang.org/documentation/master/#call

haberman commented 3 years ago

@ifreund that is great news, thanks for the info. :)

andrewrk commented 2 years ago

I'm inclined to accept this proposal. Regardless of performance improvements, this provides a useful, intuitive tool for control flow. Another observation is that when the operand of continue :sw operand is comptime-known, this would be lowered as a direct jump to another switch prong, effectively providing the desired control flow requested in #5950.

This even would allow someone to implement Duff's Device in zig:

fn send(to: *volatile u32, from_start: [*]u32, count: u32) void {
    var from = from_start;
    var n = (count + 7) / 8;
    sw: switch (count % 8) {
        0 => {
            to.* = from[0];
            from += 1;
            continue :sw 7;
        },
        7 => {
            to.* = from[0];
            from += 1;
            continue :sw 6;
        },
        6 => {
            to.* = from[0];
            from += 1;
            continue :sw 5;
        },
        5 => {
            to.* = from[0];
            from += 1;
            continue :sw 4;
        },
        4 => {
            to.* = from[0];
            from += 1;
            continue :sw 3;
        },
        3 => {
            to.* = from[0];
            from += 1;
            continue :sw 2;
        },
        2 => {
            to.* = from[0];
            from += 1;
            continue :sw 1;
        },
        1 => {
            to.* = from[0];
            from += 1;
            n -= 1;
            if (n > 0) continue :sw 0;
        },
    }
}

Reference example in C:

send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

With #7224 it could even be shortened to this:

fn send(to: *volatile u32, from_start: [*]u32, count: u32) void {
    var from = from_start;
    var n = (count + 7) / 8;
    sw: switch (count % 8) {
        inline 0...7 => |remainder| {
            to.* = from[0];
            from += 1;
            if (remainder == 1) {
                n -= 1;
                if (n <= 0) break :sw;
            }
            continue :sw @as(u3, remainder) -% 1;
        },
    }
}

What's interesting about this is that it lowers to the same machine code as the C version, even in debug mode.

Techcable commented 2 years ago

I am working on this.

I already have modified parsing, AST, and formatting to support this (both stage1 & stage2).

Example interpreter

The following parses sucessfully: https://gist.github.com/Techcable/dbcd7f7c3a708aa71d86031a1105db9c In particular, the core eval loop has no *syntax* errors: ```zig fn eval(ctx: *InsnCtx) Value { // decode first oparg ctx.oparg = ctx.ip[0].oparg; while (true) { evalLoop: switch (ctx.ip[0].opcode) { .LOAD_IMM => { load_imm(ctx).verify_continue(); continue :evalLoop ctx.ip[0].opcode; }, .POP => { pop(ctx).verify_continue(); continue :evalLoop ctx.ip[0].opcode; }, .ADD, .SUB, .MUL => { arithop(ctx).verify_continue(); continue :evalLoop ctx.ip[0].opcode; }, .PRINT => { print(ctx).verify_continue(); continue :evalLoop ctx.ip[0].opcode; }, .RETURN => { const done = @"return"(ctx); if (done.return_value) |value| { return value; } else { unreachable; } }, } } } ```

Of course running any of this through semantic analysis (stage1) gives an error "label not found: 'evalLoop'"

But this is progress!

hollmmax commented 2 years ago

I think I understand the motivation behind this change, but I'd like to raise a concern I have with it. With fall-through disallowed, switch-case is a linear control flow construct. A switch is entered, one of its prongs gets executed, the switch is left. Allowing jumping to random prongs from anywhere within the switch increases its control flow capabilities significantly. The switch is no longer a simple construct. If I see a labeled switch, I can't be sure it's not a loop until I read all its prongs. A switch inside a loop is much clearer about its control flow. I would be happier if instead the original motivation was implemented as a guaranteed optimization or forcible through some attributes. Duff's device would IMO be better served with explicit fall-through, although the order of cases would then become semantically relevant. Maybe I should just setup my highlighter to detect label: switch with continue :label _ inside a different color...

marler8997 commented 10 months ago

Interesting stuff. One idea I had while reading was we could require a switch that can loop like this to use the while keyword beforehand, i.e.

sw: while switch (val) {
    ...
}

I think this makes it a little easier for the uninitiated to intuit what's going on as it pays homage to the original version, a switch inside a while loop.

ghost commented 10 months ago

Interesting stuff. One idea I had while reading was we could require a switch that can loop like this to use the while keyword beforehand, i.e.
sw: while switch (val) {
    ...
}
I think this makes it a little easier for the uninitiated to intuit what's going on as it pays homage to the original version, a switch inside a while loop.

I hard approve of this, for multiple reasons. Firstly, it could allow the continue expression to be syntactically elided, inferring it as a recomputation of the condition/variant, just like a regular loop. (I notice that every continue expression in the motivating example is the same, and I can't help but think: since the author doesn't need to think about the expression that much per prong, surely it would be easy to slip up and forget to add it in a prong. The mistake would be caught quickly but it would be annoying having to repeat code that the compiler could in theory easily insert itself.)

Secondly, it would then also make syntactic sense to introduce for switch, allowing the motivating example to be written like this:

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    const inst_slice = std.mem.sliceTo(inst_list, .{ .tag = .end });

    for switch (inst_slice, map) |inst, *m| (inst.tag) {
        .add => m.* = analyze_add(inst),
        .addwrap => m.* = analyze_addwrap(inst),
        .alloc => m.* = analyze_alloc(inst),
        .alloc_mut => m.* = analyze_alloc_mut(inst),
        .alloc_inferred => m.* = analyze_alloc_inferred(inst),
        .alloc_inferred_mut => m.* = analyze_alloc_inferred_mut(inst),
        .anyframe_type => m.* = analyze_anyframe_type(inst),
        .array_cat => m.* = analyze_array_cat(inst),
        .array_mul => m.* = analyze_array_mul(inst),
    }
}

How about that, eh? No manual increment, no repeated continue, no bounds checks inside the loop, no double-jump on end (minor but still), the code is a third the size, and it's still plainly obvious what the machine code looks like. I like this idea a lot.

Edit for explication: In general, I think we should go with <loop> switch, and include all of the loop clauses and switch clauses in the "header". So where the original proposal would have sw: switch (val), I would recommend while switch (true) (val) // label optional. This also allows to generalise to cases where the variant comes from an iterator (while switch (it.next()) |el| (el.tag)), or where slicing as above is infeasible so we must manually increment an index (while switch (true) (it[i].tag) : (i += 1)), or, god forbid, all at once (while switch (mesg.next()) |c| (c.glyph) : (seed = rand(seed))).

AnataBakka commented 6 months ago

first comment here :)

Even if the compiler always correctly optimized while (true) switch (state) in a computed goto for each state, i find that a lot of the benefits here are for the compiler translating (not optimizing) continue :sw State_x in a direct goto statement to the required state. A state machine can create any type of flowchart, and is the final tool in the arsenal for creating loops (after normal while / for, and labeled while - which are less wordy), since nothing matches its flexibility. "Any loop is a state". Additionally, a non direct continue :sw (state) can also be clearly translated into a computed goto.

This construct would literally be redefining tail calls without the predefined input and output tool (which may actually feel like a loss in some cases), with the gain that it should make it more clear than tail calls when we are dealing with a single state in a loop (rather than with a fully abstracted function), and should in theory be less clunky to use in local scope than functions. And since "tail calls" = goto then if "while switch" = "tail calls" then "while switch" = goto. Same power as goto, but unlike goto, which usually feels as being used randomly, the state construct should make it clearer where the codeflow can go.

Everyone should argue against the pitfalls of goto, but i don't think anyone can argue against an efficient state machine (unless you've never had to use a state machine).

The clearest and most efficient way of doing it in C is to do something like this:

while(1)
{
  StateMachine_Init:
    [set up and determine the first state to go to, either with computed goto or just goto]
    break;
  State_x:
    [stateActions]
    [determine the next state to goto]
    break;
  State_y:
    [...]
    break;
}

Which works good enough, but having a while switch construct would further improve on that by not having to manually insert breaks. As others have mentioned, i call it while switch to clearly identify it as a loop, since switch is usually not a loop, or is the label enough to identify it as a loop?

Can it be abused? Yes. Because just like with tail calls, you can literally create as many states / functions as there are statements and then randomly move around, but doesn't mean you should do that with functions or any other construct.

Also, this hasn't been previously directly specified, but by having a labeled switch i would assume continue-ing to parent switches, but not viceversa, is allowed as well, just like how other loop labeled continues currently work while nested, which would allow for more stylistic freedom. You could in theory, in some cases, merge nested switch cases into a single parent switch, but nesting it and reducing the scope would always lower complexity.

And as a final concept, if we already know the initial state we want to start with, using while switch (State_x) should allow us to avoid the initial computed goto and just directly start with the state. Any state can then still be able to decide whether to use computed goto or not.

If we want to avoid the functional way of recursive tail calls and get maximum performance and flexibility in an imperative way, then we need while switch continue with block scope.

As Andrew showed above, by mirroring what a function tail call can do, you would also be able to merge states with the inline keyword:

main()
{
  enum { init, next }
  int minimumDistance;
  int i;

  Loop: while switch (.init) {
    inline else => |state| {
      if (state == .init) {
        i = 0;
      }
      else if (state == .next) {
        i++;
      }

      if (i < n) {
        int distance = calculateDistance(i);
        if (state == .init || distance < minimumDistance) {
          minimumDistance = distance;
        }
        continue :Loop .next;
      }
      else {
        if (state == .next) {
          //we finished without error
          print("Success!");
        }
      }
    }
  }
}

which should be inlined to this

main()
{
  enum { init, next }
  int minimumDistance;
  int i;

  Loop: while switch (.init) {
    .init => {
      i = 0;
      if (i < n) {
        int distance = calculateDistance(i);
        minimumDistance = distance;
        continue :Loop .next;
      }
      else {
        //exit
      }
    },
    .next => {
     i++;
     if (i < n) {
        int distance = calculateDistance(i);
        if (distance < minimumDistance) {
          minimumDistance = distance;
        }
        continue :Loop .next;
      }
      else {
        //we finished without error
        print("Success!");
      }
    }
  }
}

This feels easy, intuitive and efficient at first, by helping to avoid placing redundant checks / duplicated statements on each state. However, after further experimentation, i realized you may fall into a pitfall of always unnecessarily trying to merge states into an inlined version when it's quicker to just write them off completely separate. So i guess use inline with caution.

Disclaimer: i say this as someone who just knows some C, and only heard bits and pieces of Zig.

EDIT:

(this is getting a bit long, but i blame the code examples) i noticed a concept similar to the original problem statement that i think would be interesting to think about (hope it's not too offtopic, so i'd rather quickly type it here than not at all) (slight pseudocode)

enum tokenStatus {ACCEPT, REJECT, ERROR, [etc...]};

enum tokenStatus verifyToken(Token_t token) {
  if ([...]) {
    return ACCEPT;
  }
  else if ([...]) {
    return REJECT;
  }
  else if ([...]) {
    return [etc...]
  }
  else {
    return ERROR;
  }
}

main()
{
 for (int i = 0; i < maxN; i++) {
 Token_t token = getToken(i);
 switch (verifyToken(token)) {
   ACCEPT => [...],
   REJECT => [...],
   [...] => [...],
   ERROR => [...]
 }
}

assume verifyToken can be used by different libraries, each deciding their own action for token state.

My assumption is that the above will process at run time and once verifyToken returns, the value will go into the computed goto generated by the switch. This computed goto is run in only one location for different tokens, and not only it doesn't have to be in one location, it doesn't even need to be a computed goto. What if verifyToken was able to move directly to the required action? There is a world where a compiler could maybe optimize this? However, again, i prefer when things are simply translated rather than optimized. I.e (rough version)

enum tokenStatus {ACCEPT, REJECT, ERROR, [etc...]};

void tokenAction(Token_t token, comptime enum tokenStatus status) {
 if (status == ACCEPT) {
   [...]
 }
 else if (status == REJECT) {
   [...]
 }
 else if (status == [...]) {
   [...]
 }
 else if (status == ERROR) {
   [...]
 }
}

/*function input should use varargs, but this is fine for now*/
void verifyToken(Token_t token, void (*sequencedAction)(Token_t, enum tokenStatus)) {
  if ([...]) {
    return (*sequencedAction)(token, ACCEPT);
  }
  else if ([...]) {
    return (*sequencedAction)(token, REJECT);
  }
  else if ([...]) {
    return (*sequencedAction)(token, [etc...]);
  }
  else {
    return (*sequencedAction)(token, ERROR);
  }
}

main()
{
 for (int i = 0; i < maxN; i++) {
 Token_t token = getToken(i);
 verifyToken(token, tokenAction);
}

Now, the code should in theory just be translated to goto statements, which i find better. However, that solution is the functional way, and as before, we'd much rather have something that works quicker. So what if we had this?

enum tokenStatus {ACCEPT, REJECT, ERROR, [etc...]};

void verifyToken(Token_t token) {
  if ([...]) {
    return next(ACCEPT);
  }
  else if ([...]) {
    return next(REJECT);
  }
  else if ([...]) {
    return next([etc...]);
  }
  else {
    return next(ERROR);
  }
}

main()
{
 for (int i = 0; i < maxN; i++) {
 Token_t token = getToken(i);
 verifyToken(token) => |state| {
    if (state == ACCEPT) {
      [...]
    }
    else if (state == REJECT) {
      [...]
    }
    else if (state == [...]) {
      [...]
    }
    else if (state == ERROR) {
      [...]
    }
 }
}

"Next" should cause the compiler to move to the block following the function call, which i don't find a very wild concept, and i believe greatly follows the concepts of what make imperative code quick and easy to write. This also voids the requirement of manually transferring over block variables.

Finally, i noticed that this would also allow a functionality that a lot of people would like.

void safeCreate(object[...]) {
  try create(object[...]) catch return next (-1);
  next (0);
  object.deInit(); 
  /*create() is an incomplete function, in that it *requires* a follow up closure action,
  unlike other functions that can return without any worries. So we should always automatically apply the closure. */
}

main()
{
  safeCreate(object[...]) => |state| {
    if (state == 0) {
      /*object successfully created, so use it here. It will be automatically destroyed when the block returns.
      If object is required out of scope, then you must initialize it with minimum memory in the full scope
      where it's used, using the same automatic destruction, and then reallocate to the necessary memory
      in local scope without the automatic destruction*/
    }
    else {
      /*object creation failed*/
    }
  }
}

Ev1lT3rm1nal commented 6 months ago

@andrewrk is there a way to optimize my small interpreter to generate asm like computed gotos? https://github.com/Ev1lT3rm1nal/bf-interpreter-fast

andrewrk commented 6 months ago

As part of our Donor Bounty Program, I have confirmed a bounty for this issue with an anonymous donor.

The anonymous donor has offered 4,000 USD to be donated to Zig Software Foundation if the criteria can be met by April 30, 2024.

It doesn’t matter who implements it, or whether multiple people work together on implementing it, or how much help is needed from the core Zig team. As long as it gets done properly and the donor is satisfied with the results, the bounty will be awarded.

Cloudef commented 5 months ago

I guess this proposal won't allow you to jump from a switch statement to another statement? I'm writing a ragel compatible state machine compiler and for the default code generation, I'm thinking of going with bunch of switch statements where cases itself are the event transitions and if transitioning to a sub-machine with different transitions it would jump to another switch statement, there is no forwards / backwards jumping here but any transition can jump to any machine. This essentially requires local goto for the most efficient (or perhaps trivial?) code generation.

Roughly something like this in C:

#define N_EVENTS 's'
#define EVENT_W_STATE(c, a, b) ((a * N_EVENTS * N_EVENTS) + b * N_EVENTS) + c
#define EVENT(a, b) EVENT_W_STATE(a, a, b)

int main() {
    const char syote[] = "oispa kaljaa";
    uint32_t state = 'o';  // starting state
    const char *s = syote - 1;
oispa_machine:
    for (s = s + 1; *s; ++s) {
        const uint32_t event = EVENT_W_STATE(state, *s, s[1]);
        switch (event) {
            case EVENT('o', 'i'):
                state = s[1];
                break;

            case EVENT('i', 's'):
                state = s[1];
                break;

            case EVENT('s', 'p'):
                state = s[1];
                break;

            case EVENT('p', 'a'):
                state = s[1];
                goto kalja_machine;

            default:
                errx(EXIT_FAILURE, "nojoo, oot jossaki iha oudossa siirtymässä");
                break;
        }
    }
    assert("unreachable" && false);
kalja_machine:
    for (s = s + 1; *s; ++s) {
        const uint32_t event = EVENT_W_STATE(state, *s, s[1]);
        switch (event) {
            case EVENT('a', ' '):
                state = s[1];
                break;

            case EVENT(' ', 'k'):
                state = s[1];
                break;

            case EVENT('k', 'a'):
                state = s[1];
                break;

            case EVENT('a', 'l'):
                state = s[1];
                break;

            case EVENT('l', 'j'):
                state = s[1];
                break;

            case EVENT('j', 'a'):
                state = s[1];
                break;

            case EVENT('a', 'a'):
                state = s[1];
                break;

            case EVENT('a', 0):
                goto last_state;

            default:
                errx(EXIT_FAILURE, "nojoo, oot jossaki iha oudossa siirtymässä");
                break;
        }
    }
    assert("unreachable" && false);
last_state:
    errx(EXIT_SUCCESS, "vittu jes, nyt kaljalle");
}

AnataBakka commented 5 months ago

@Cloudef yea, your case looks like a nested labeled switch, so assuming the proposal would work like nested labeled loops, then it should automatically work unless intentionally removed

Cloudef commented 5 months ago

True I guess you could have outer switch with the machines as cases for the same effect and then continue to those, and finally break out on final state, as long as the code gen would be essentially jmp it's fine. I could try generating some zig code based on this proposal and see how it looks. One thing I'm also interested into looking into is parallelization using SIMD and other techniques. Ragel allows mixing scanners with FSMs and the scanner could effectively do the longest matching in parallel. There's also interesting articles of essentially vectorizing switch like here: http://0x80.pl/notesen/2019-02-03-simd-switch-implementation.html

rohlem commented 4 months ago

Is anybody experimenting with / actively investing efforts into implementing this currently? I haven't worked on the compiler at all yet (was hoping I could wait out the LLVM requirement becoming optional), but as there's only 5 weeks left until the bounty runs out, now might be the time for somebody to step up. (~10% chance of success if I'm the one doing it.)

andrewrk commented 4 months ago

was hoping I could wait out the LLVM requirement becoming optional

Not sure where you got this idea from. Even in the most distant of future plans, Zig still has an LLVM backend that outputs an LLVM module as bitcode.

rohlem commented 4 months ago

was hoping I could wait out the LLVM requirement becoming optional

Not sure where you got this idea from. Even in the most distant of future plans, Zig still has an LLVM backend that outputs an LLVM module as bitcode.

@andrewrk Sorry, to clarify I wasn't talking about the LLVM parts of this proposal becoming optional, but rather the LLVM dependency whilst working on the compiler itself, aka being able to use the x86_64-backend for faster iteration times. (Though while working on / verifying the LLVM parts of this proposal, it would probably still be necessary anyway.)

andrewrk commented 4 months ago

Ah, I see. For some tasks it's already feasible, however at this point in time the debugging experience is still lacking.

EzequielRamis commented 4 months ago

I'm working on it.

N00byEdge commented 4 months ago

how does the codegen of this interact with defer? If I have a defer in my current case block and 5 different switch-continues in the same switch case, does each path generate its own defer block, or will the last jump be an indirect/runtime jump?

rohlem commented 4 months ago

@N00byEdge Not sure how current codegen does this (you can already have multiple continue/break/ tail calls within the same parent block), but if we can add a PC return address for after the defer-generated block, this block can be shared and return to the caller's direct jump.

@EzequielRamis How's your progress? How confident are you in completing it before the deadline? Since the bounty is only awarded to the ZSF, feel free to upload your branch publicly to GitHub for others (like me) to cooperate where that may be helpful.

EzequielRamis commented 4 months ago

@rohlem I don't think it will be completed before the deadline, but I'm not preoccupied. I was planning to upload today what I've done and the todos, I think the draft will be in a couple of hours.

N00byEdge commented 4 months ago

@rohlem

@N00byEdge Not sure how current codegen does this (you can already have multiple continue/break/ tail calls within the same parent block), but if we can add a PC return address for after the defer-generated block, this block can be shared and return to the caller's direct jump.

Ah right, that all checks out. I just get a little weirded out when I see language features that can't map nicely to machine code. But I guess if that already exists, then it's not too terrible.

expikr commented 4 months ago

I'm having a hard time wrapping my head around the concepts involved.

Is my understanding correct that the issue arises from wanting each prong to contain all of the operations to be done in each step rather than being shared across a common step?

Would this generate the desired machine code?

while(switch(inst_list[i].tag){
    .end => false,
    .add => sw: {
        map[i] = analyze_add(inst_list[i]);
        i += 1;
        break :sw true;
    },
    .addwrap => sw: {
        map[i] = analyze_addwrap(inst_list[i]);
        i += 1;
        break :sw true;
    },
    ...
}){}

EDIT: godbolt link

source

```zig const Inst = extern struct { tag: enum(u8) { end, add, addwrap, alloc, alloc_mut, alloc_inferred, alloc_inferred_mut, anyframe_type, array_cat, array_mul, }, }; export fn entry(inst_list: [*]const Inst, map: [*]u32) void { var i: usize = 0; while(switch(inst_list[i].tag){ .end => false, .add => sw: { map[i] = analyze_add(inst_list[i]); i += 1; break :sw true; }, .addwrap => sw: { map[i] = analyze_addwrap(inst_list[i]); i += 1; break :sw true; }, .alloc => sw: { map[i] = analyze_alloc(inst_list[i]); i += 1; break :sw true; }, .alloc_mut => sw: { map[i] = analyze_alloc_mut(inst_list[i]); i += 1; break :sw true; }, .alloc_inferred => sw: { map[i] = analyze_alloc_inferred(inst_list[i]); i += 1; break :sw true; }, .alloc_inferred_mut => sw: { map[i] = analyze_alloc_inferred_mut(inst_list[i]); i += 1; break :sw true; }, .anyframe_type => sw: { map[i] = analyze_anyframe_type(inst_list[i]); i += 1; break :sw true; }, .array_cat => sw: { map[i] = analyze_array_cat(inst_list[i]); i += 1; break :sw true; }, .array_mul => sw: { map[i] = analyze_array_mul(inst_list[i]); i += 1; break :sw true; }, }){} } extern fn analyze_add(inst: Inst) u32; extern fn analyze_addwrap(inst: Inst) u32; extern fn analyze_alloc(inst: Inst) u32; extern fn analyze_alloc_mut(inst: Inst) u32; extern fn analyze_alloc_inferred(inst: Inst) u32; extern fn analyze_alloc_inferred_mut(inst: Inst) u32; extern fn analyze_anyframe_type(inst: Inst) u32; extern fn analyze_array_cat(inst: Inst) u32; extern fn analyze_array_mul(inst: Inst) u32; ```

assembly

```asm # Compilation provided by Compiler Explorer at https://godbolt.org/ entry: push r15 push r14 push rbx mov rbx, rsi mov r14, rdi xor r15d, r15d movzx eax, byte ptr [rdi + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_1: mov edi, 1 call analyze_add@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_2: mov edi, 2 call analyze_addwrap@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_3: mov edi, 3 call analyze_alloc@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_4: mov edi, 4 call analyze_alloc_mut@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_5: mov edi, 5 call analyze_alloc_inferred@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_6: mov edi, 6 call analyze_alloc_inferred_mut@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_7: mov edi, 7 call analyze_anyframe_type@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_8: mov edi, 8 call analyze_array_cat@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_9: mov edi, 9 call analyze_array_mul@PLT mov dword ptr [rbx + 4*r15], eax inc r15 movzx eax, byte ptr [r14 + r15] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_10: pop rbx pop r14 pop r15 ret .LJTI0_0: .quad .LBB0_10 .quad .LBB0_1 .quad .LBB0_2 .quad .LBB0_3 .quad .LBB0_4 .quad .LBB0_5 .quad .LBB0_6 .quad .LBB0_7 .quad .LBB0_8 .quad .LBB0_9 ```

SimonMeskens commented 1 month ago

EDIT: LLVM keeps removing the optimization when I add the bounds check, so this is not actually as successful as I hoped

I discovered another way to do it I think. Not sure if this helps anyone, but it's kinda cool. godbolt link (0.7.1 like the other examples) godbolt link (0.12.0)

const Inst = extern struct {
    tag: Tag,

    const Tag = enum(c_int) {
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
        array_type,
        array_type_sentinel,
        indexable_ptr_len,
    };
};

export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void {
    const inst_list = inst_list_ptr[0..inst_list_len];
    var i: usize = 0;
    while(true) : (i += 1) {
        const inst = inst_list[i];
        map[i] = switch (inst.tag) {
            .add => analyze_add(inst),
            .addwrap => analyze_addwrap(inst),
            .alloc => analyze_alloc(inst),
            .alloc_mut => analyze_alloc_mut(inst),
            .alloc_inferred => analyze_alloc_inferred(inst),
            .alloc_inferred_mut => analyze_alloc_inferred_mut(inst),
            .anyframe_type => analyze_anyframe_type(inst),
            .array_cat => analyze_array_cat(inst),
            .array_mul => analyze_array_mul(inst),
            .array_type => analyze_array_type(inst),
            .array_type_sentinel => analyze_array_type_sentinel(inst),
            .indexable_ptr_len => analyze_indexable_ptr_len(inst),
        };
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;
extern fn analyze_array_type(inst: Inst) u32;
extern fn analyze_array_type_sentinel(inst: Inst) u32;
extern fn analyze_indexable_ptr_len(inst: Inst) u32;

ziglang / zig

introduce labeled continue syntax inside a switch expression #8220

Background

Problem Statement

Research Dump

Can LLVM Do the Optimization?

EDIT: New Discovery

Real Actual Use Case

Other Possible Solution: Tail Calls

Proposal

How to Lower this to LLVM

How to Lower this in Self-Hosted Backends

OK But Is The Perf Actually Good?

An argument for tail calls

Addressing open questions on Tail Calls

Tail calls generate code comparable to hand-written assembly

Caveats

EDIT: