ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.86k stars 2.55k forks source link

Switch Prongs Defined as Comptime-Known Arrays #21507

Open mnemnion opened 1 month ago

mnemnion commented 1 month ago

This is an idea which comes directly out of a (modest) pain point in code I'm working on, as opposed to dreaming up language features on a nice-to-have basis. I've searched for prior discussion of this area without finding any, so I hope this clears the bar of tracking the priorities of real-world Zig users.

Problem Statement

The logic of runerip is driven by state-transition tables, and is structured so that there are inner functions which take the tables as anytype, so that differences in array width don't have to be erased by coercion to a slice. This gives code which is as efficient as the versions which use the state transition tables directly as globals, and lets me have several versions which differ only in the states.

The issue is that there are different collections of states: one is missing a state, another has an extra state. So certain switch statements have to contain all the states, even though I know that some variations have one or two which can never be called.

I don't think it's feasible to count on the optimizer examining the full specialized state table to eliminate possibilities, and my measurements indicate that it hasn't done so (although I haven't inspected the assembly output). There basically has to be a level of complexity where that isn't feasible, and finite state machines can have large numbers of states with very large transition tables: I don't know if there's a threshold at which it would try, but I appear to be over it if there is.

Branches which have to be checked for but will never occur are bad for efficiency, because they confuse branch prediction, and decrease the data locality of the instructions. Dead weight, basically. In my specific case, the difference is very modest, around 1% of the benchmark. But it's durable and visible, I ran it many times to confirm this. This switch is small, as is the number of spurious states involved: for bigger switches, and more states, the effect would be worse.

I'd like to add the implementations in runerip to std.unicode, once I'm done refining and testing it, so optimal performance here would be beneficial to everyone.

This specific problem domain isn't unique to what I'm working on, it's something which could come up a lot, state machines being one of the most general and efficient algorithms in existence. One of the strengths of FSMs is having a logic core which can be varied with different transition tables, and fully optimizing that is guaranteed to run into this problem eventually. Now that we have labeled switch continue, Zig is pushing best-in-class for state machines, it's a worthy area to consider improving further.

After drafting this, but before posting it, @mitchellh published a blog post which is directly related. The solution presented isn't applicable to my use case, but the affordance offered here would also serve as an eloquent solution to his, dramatically simplifying the resulting switch and making it much easier to understand, debug, and change.

There's no barrier to having a different behavior for a category, a comptime if statement will provide that and guarantee that the branch is during compilation, not runtime. It's providing a different category for a behavior which is not possible, and this would complete the dual.

The Idea

It would be nice if I could do this (this part is legal right now, naturally):

const two_byte_state = comptime if (state_machine == .tx_dfa)
    .{2, 12}
else
    .{2};

Then have this in the switch:

switch (st) {
    1, 5, 7 => ...,
    two_byte_state => ...,
}

The rule would be that an array of coercible-to-T, with comptime-known values, can coerce to a prong of a switch on T which matches any value in the array. In this case, we have a [2]comptime_int (or a [1]), and comptime_int coerces to u8 (in my case), and the values are known, so this would be allowed. I believe that comptime-known values would be a hard requirement; the Zig switch statement isn't pattern-matching, or a Lispy cond, and this proposal does not suggesting change that nature in any way.

An identifier used as a switch prong must already be a comptime known value, so this array coercion is directly comparable to existing practice. I don't think there's any context possible in which ambiguity could arise.

In addition to the benefit for comptime specialization, this would be good for code quality. A program which reuses switch categories could define them once, and benefit from a single source of truth. It gives meaningful names to prongs, which helps readability.

I claim that the rule is easy to understand, and philosophically compatible in the sense that many transformations which apply to comptime-known values only are already on offer in the language (including identifier-named switch prongs). The syntax is compatible: .{.fee, .fie, .foe} becomes the prong .fee, .fie, .foe =>.

There may of course be issues with this proposal which I wasn't able to ascertain, but this is the best counterargument I can think of:

Right now, if we see an identifier in a switch prong, we know that it refers to one value. This change would mean that it could also be several values. I think this is good on balance, however. A switch prong is a category, not a value, and this makes it more likely that a switch with entirely named categories can be written: identifiers in switch prongs can only represent values which trivially coerce to a literal switch prong. I'll even make the case that an identifier for a collection of values is more useful than it is for a single value, since determining the meaning of a collection is more challenging, and benefits from being grouped.

It passes the "missed that part of the manual" test: a reader who sees a variable in a switch statement, and looks it up to discover that it's an array of comptime-known values, would figure out what was going on without help.

It's possible that this might tempt users to use

const digit = .{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};

So they can have a digit => switch prong. This would only be a problem if the switch optimizer can't recognize a range inside a collection and optimize for it, and a quick experiment with Godbolt shows that it has no problem doing that. I wanted to mention this anyway, for completeness' sake. Also because, if the optimizer does consistently handle this, code like the tokenizer could have an invalid_control => switch, instead of repetition of 0x01...0x09, 0x0b...0x0c, 0x0e...0x1f, 0x7f. Whether that would be considered an improvement is a separate question: it's easier to read with ranges, as opposed to enumerating all the values in an array, but also shows up five times in the next state machine. I don't expect Zig will ever reverse existing practice and allow vertical tab and form feed in source code, but in that counterfactual, it's a change which is better made one time.

mvzr is switch-heavy code with large categories, some repeated in more than one switch. I would definitely use this feature to organize that code if it were available.

To refer back to @mitchellh's blog post, all of the comptime engineering could be replaced with defining sets of the enums as arrays, and reusing those sets as switch prongs. Granted that it solves only the first of the three reasons he didn't use an exhaustive list:

It's verbose. I have to list out all the terminal-specific actions even though I don't care about them. The blog post example only shows a half dozen actions but in reality there are many, many more.

I would suggest that the issue with the verbosity is that it's in the wrong place, the logic of the switch doesn't need a long list of all enum tags which don't happen to apply in a useful way. Defining that list at container level and then merely mentioning it would clarify the logic considerably, it might be that going the full (and very cool) route of comptime-subsetting might still be chosen for the resulting type safety, but perhaps merely defining a range would suffice. If you read the post (it's good, you should) you'll see several opportunities to apply this affordance to the solution he chose as well. For example, the scopes could be defined as arrays, and then reused anywhere they apply.

This would allow (only) two of the three forms of switch prongs to be defined and reused using an identifier. However, the ranged integers in #3806 could allow the third type to be an identifier as well, so this proposal can be thought of as a move in the direction of a consistent semantics for identifiers in switch prongs: all three patterns can be represented as identifiers, and even complex combinations can be several comma-separated identifiers. Sometimes this will clarify the logic of a switch, sometimes it would obscure it, but I hope we can trust users to choose the better option for their needs.

Another way to solve the issue would be some non-scope-introducing comptime conditional, which could wrap prongs of a switch statement. I've wanted something like that in other places as well, but it's a huge and far-reaching change, and I think it's less elegant for this case (although admittedly more powerful and general). Adding a new coercion is a much smaller proposal than adding new control flow which would deeply impact the grammar, and add a whole concept of a non-scoped block.

Existing Solutions

The available workaround is to gate the entire switch on the value of the DFA: since it's comptime-known, only one of the switches would be compiled per specialization, so it does achieve the outcome. This is unsatisfactory, however: it duplicates a lot of logic, and doesn't have the property that the shared prongs have one source of truth, meaning it has to be done carefully, and any bugs have to be corrected in several places. For a switch which performs assignment, this would get fairly ugly fairly quick, and it multiplies by the amount of customization needed. For large state machines which are basically a big switch statement, "unmaintainable disaster area" would be a fair reflection of the result of trying to do things this way. Better at that point to just write separate functions and keep them in sync.

If the modified states were contiguous, the test would most likely be just as efficient, but they aren't, and rewriting them to be isn't feasible. In some other program this might be adequate.

In some cases (haha) one could squeak by, through defining the states as enums, and making the variation an else branch. That isn't possible for runerip, because there are three cases, one adds to a branch and another removes an option from a different branch. This technique would also make one state into an else when it actually isn't, which is bad policy, and in my case, the states are used to index the lookup tables, with arithmetic, they aren't naturally expressed as enums.

I would say these are all variations on doing something inessential to the (often very complex) logic of the switch, and none of them is a general and adequate substitute for the proposal.

So that's the pitch: arrays of comptime-known values can coerce to switch prongs, if the element type can coerce to the switch type. I look forward to your feedback.

mlugg commented 1 month ago

Haven't read the proposal at all, but just from the title and example: I have wanted this feature recently, however it should have a different syntax in the switch case, for (a) clarity and (b) result types for switch items. Something like switch (x) { 0, 1 => foo, [a] => bar } where a is an array or slice or something (probably not actually that, that looks weird, just trying to clarify what I'm saying)

mnemnion commented 1 month ago

I don't quite see the bit about result types, I think the syntax is compatible there:

switch(a) {
   array_prong => |val| {_ = val;},
}

But I may be missing what you mean.

The clarity part, that I could see, it's the main thing I identified as maybe a problem. Possible:

const single = 5;
const multiple = .{6, 7, 8};
switch(a) {
    single => {},
   .{multiple} => |val| {_ = val;},
   // when integer ranges are an option, idk, this?
   ...range => |val|  {_ = val;},
}

Zig doesn't have a "spread" syntax, so .{multiple..} seems like a lot of syntactical novelty for a what it does, but maybe that would track as more familiar, IDK. I guess unbounded for loop ranges are pretty close to that.

I'm not fully sold on not knowing whether an identifier is one value or several is going to be a problem, but if we get this with some disambiguating syntax, that would be fine by me.

mlugg commented 1 month ago

I don't quite see the bit about result types

I'm not talking about captures. Result types are a component of RLS, which causes e.g. const x: S = .{ .x = 1 }; to work without trying to coerce an anonymous struct. They're also the mechanism behind decl literals, which is why they're desirable for switch items; it'd be nice to do e.g. switch (endian) { .native => ..., .foreign => ... } with native and foreign being decls on the Endian type. However, result types are fairly simple: you can't have a set of possible result types. So, for this to work, the type of the values left of => (the switch "items") must always be of a consistent type. For this proposal to work, then, we'd need a syntax differentiator telling the compiler to not provide this result type.

mnemnion commented 1 month ago

That does make more sense, thanks. It sounds like "perform this special coercion, but only in a switch prong" might not be an easy thing to just add as a special case.

The part about decl literals suggests that we might be talking about different features, in which case yes, there would absolutely need to be a syntax way of distinguishing between them.

rohlem commented 1 month ago

This seems a bit similar to https://github.com/ziglang/zig/issues/2473 , which proposes an error set prong to match any of the contained errors. Technically this proposal would allow the same functionality with an [N]anyerror prong (if I understand it correctly). EDIT: Also I believe a very similar idea was thought up in comment https://github.com/ziglang/zig/issues/15556#issuecomment-1532496771 , but I don't think there's been a separate proposal issue for it yet.

mlugg commented 1 month ago

Now that I have a bit more time, let me write up my use case.

I hit this issue recently while reworking CallingConvention (#21209). The specific details of that aren't relevant, but the general pattern of my use case is as follows.

I have a tagged union, where there are many tags, but the payloads are one of a small set of possible types. For instance:

const MyUnion = union(enum(u8)) {
    a: Foo,
    b: Foo,
    c: Foo,
    d: Bar,
    e: Bar,
    f,
    g,

    pub const Foo = struct { x: u8 };
    pub const Bar = struct { x: u32 };
};

I want to write logic which takes a MyUnion and serializes it. (In my actual use case, I'm converting to a DOD-friendly internal representation.) So, I could write something like this:

fn serialize(u: MyUnion, w: anytype) !void {
    try w.writeByte(@intFromEnum(u)); // tag
    switch (u) {
        .a, .b, .c => |foo| {
            try w.writeByte(foo.x);
        }.
        .d, .e => |bar| {
            try w.writeInt(.little, bar.x);
        },
        .f, .g => {},
    }
}

The thing is, I don't want to have to write out all of the tags here. I have several switches of this form, and in reality, there are around 100 tags. So, I'd really like to be able to write this (with the fictional [many_tags] syntax, which, to be clear, I don't think is a good syntax choice):

fn serialize(u: MyUnion, w: anytype) !void {
    try w.writeByte(@intFromEnum(u)); // tag
    switch (u) {
        [MyUnion.foo_tags] => |foo| {
            try w.writeByte(foo.x);
        }.
        [MyUnion.bar_tags] => |bar| {
            try w.writeInt(.little, bar.x);
        },
        [MyUnion.void_tags] => {},
    }
}

Unfortunately, there's no good way to do this today. I could do something kinda close by unpacking the payload into a separate type with an initial switch:

fn serialize(u: MyUnion, w: anytype) !void {
    try w.writeByte(@intFromEnum(u)); // tag
    const Payload = union(enum) {
        foo: Foo,
        bar: Bar,
        void,
    };
    const payload: Payload = switch (u) {
        inline else => |payload| switch (@TypeOf(payload)) {
            MyUnion.Foo => .{ .foo = payload },
            MyUnion.Bar => .{ .bar = payload },
            void => .void,
            else => comptime unreachable,
        },
    };
    switch (payload) {
        .foo => |foo| {
            try w.writeByte(foo.x);
        }.
        .bar => |bar| {
            try w.writeInt(.little, bar.x);
        },
        .void => {},
    }
}

...but this still may give suboptimal codegen, and -- more to the point -- it's incredibly clunky. In practice, I've ended up just doing an inline else on the entire thing:

fn serialize(u: MyUnion, w: anytype) !void {
    try w.writeByte(@intFromEnum(u)); // tag
    switch (u) {
        inline else => |payload| switch (@TypeOf(payload)) {
            MyUnion.Foo => try w.writeByte(payload.x),
            MyUnion.Bar => try w.writeInt(.little, payload.x),
            void => {},
            else => comptime unreachable,
        },
    }
}

Now, the optimizer here could detect that a bunch of cases have identical bodies (e.g. a, b, c), and deduplicate them. But that's a quite fragile optimization; I'm not sure if LLVM will do it in my case, and even if it does, relying on that feels a little dangerous.

So, this proposal would allow me to more clearly communicate my intent to the compiler to generate optimal code. It also results in a much more legible implementation than either workaround offered here.

jayrod246 commented 1 month ago

If I follow the proposal correctly, @mlugg's code could look something like:

const MyUnion = union(enum(u8)) {
    a: Foo,
    b: Foo,
    c: Foo,
    d: Bar,
    e: Bar,
    f,
    g,

    pub const Foo = struct { x: u8 };
    pub const Bar = struct { x: u32 };

    // simple arrays of `enum literal`
    pub const foo_tags = .{ .a, .b, .c };
    pub const bar_tags = .{ .d, .e };
    pub const void_tags = .{ .f, .g };
};

fn serialize(u: MyUnion, w: anytype) !void {
    try w.writeByte(@intFromEnum(u)); // tag
    switch (u) {
        // literals coerce to union's tag type
        .foo_tags => |foo| {
            try w.writeByte(foo.x);
        },
        .bar_tags => |bar| {
            try w.writeInt(.little, bar.x);
        },
        .void_tags => {},
    }
}
mnemnion commented 1 month ago

@jayrod246 That's the idea, although as written up top it would be this:

switch (u) {
    // literals coerce to union's tag type
    foo_tags => |foo| {
        try w.writeByte(foo.x);
    },
    bar_tags => |bar| {
        try w.writeInt(.little, bar.x);
    },
    void_tags => {},
}

They're identifiers with enum tags, not enums themselves. @mlugg elided the definition of foo_tags, but from context, decl literal, seems like the same thing.

I still don't 100% understand the result type problem with this:

The rule would be that an array of coercible-to-T, with comptime-known values, can coerce to a prong of a switch on T which matches any value in the array.

I know enough to recognize that no other coercion works this way, but there's also no other place in the language where it really makes sense to do. It might well be the case that a special-case rule for this type in that location is not feasible, I'm operating on a fairly surface level here, like "this type is not compatible with the result type, the result location is in a switch prong, the type is an array, it's comptime, and the elements coerce to the result type <special magic happens>".

Would something like this work?

switch (u) {
    // literals coerce to union's tag type
    @spread(foo_tags) => |foo| {
        try w.writeByte(foo.x);
    },
    @spread(bar_tags) => |bar| {
        try w.writeInt(.little, bar.x);
    },
    @spread(void_tags) => {},
}

I almost went with @splat since we already have it, but I thought about what the documentation would look like, and it would be describing one builtin which does two almost entirely separate things. @unpack maybe, that's the name of the analogous function in Lua.

The issue with foo_tags.. is pretty obvious, foo_tags..,.baz is waaay too much like foo...baz. Maybe with required parentheses? (foo_tags..)?

I don't hate this, but it would be nice to be able to just use an identifier directly. It's funny because a preprocessor macro could just drop the list in place directly, I'm realizing I've been thinking of this almost as a source-level transformation: drop the interior of the array literal as a string right where the identifier is located, and see if it compiles.

Obviously we want it to look pretty, and consistent with the rest of the language, that matters. But what it looks like matters much less to me than being able to do it.

nektro commented 1 month ago

going off of your example, if it was a builtin then as a reader i would expect it to be a function that returns a single instance of type @TypeOf(u). what about inline foo_tags? this may not work if inline switch cases currently accept any comptime-known value rather than only literals. or ...foo_tags? i also do also like mlugg's initial example of [foo_tags]

mnemnion commented 1 month ago

That's why I mentioned @splat, which takes a scalar and produces a vector. This operation is not the same one, but also seems like a reasonable result for a magic builtin to provide.

Builtins cover a lot of ground, but are mostly things which need the compiler's cooperation to work. I don't think this is an unreasonable place to use a builtin, and it has the advantage of not adding syntax to the language.

Best of the ideas? Eh, probably not.

I do like that [a] has brackets in it, but they're kind of in the wrong place? Something like this, perhaps:

    foo_tags[..] => |cap| { _ = cap; },
    // Better as this, I think:
    foo_tags[...] => |cap| { _ = cap; },

I'd say that does a good job of indicating what's going on here. This one's growing on me. The first one looks too much like a slice with no bounds, so someone might want to try foo_tags[3..6], and that's just another array.

On the other hand, this would be perfectly normal Zig: foo_tags[3..6][...]. Slicing an array with comptime-known bounds gives another array, and then [...] spreads it as switch matches. Using [...] is good because ... only shows up in switch prongs, the semiotics are clearer.

I am coming around to the view that we do need syntax here. Even if it's practical to make the coercion happen directly, there's still the problem where we go from knowing that an identifier is one choice in a prong, to the possibility of it being several.

Which I'm starting to see as a real problem, because when it composes, you have this issue:

   foo_tags, baz_tag => |val| {
           // ???
           _ = val;
   }
   // vs:
   foo_tags[...], baz_tag => |val| { _ = val;}

A prong using a lot of identifiers would not be easy to read if you can't tell which are collections and which are scalar values, so yeah, I think this needs syntax on user interface grounds, as well as to cooperate with the compiler as currently implemented.

Last point in favor of [...] is that in the event that integer ranges happen, it's reasonable to reuse this syntax for those as well. The difference between a scalar and a collection is bigger than the difference between a discrete collection and a collection expressed as a range.

Also, this is kind of its own thing, but a good companion to this feature would be making this work:

    const alpha_num: [63]comptime_int = .{'a' ... 'z', 'A' ... 'Z', '0' ... '9', '_'};

It would help with refactoring, and there are other contexts where it would be useful as well. This can just be a comment on this issue, I don't see an advantage in breaking it off as a proposal just yet.