ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
32.16k stars 2.35k forks source link

Grammar change - require semicolons at the end of every statement #8856

Open yohannd1 opened 3 years ago

yohannd1 commented 3 years ago

Note: I'm making some assumptions about the language - specially about the grammar. I've read part of the grammar but I'm not very confident on if I understood it well, so if I make some mistake please correct me.

There's a thing in zig that has been bothering me for a while: depending on the situation, it's either needed or forbidden to put a semicolon. As an example:

// Since this first one is a comptime statement, it takes an expression and
// needs a semicolon on the end.
comptime doSomeThing();

// For this one, though, since it's a block statement, there's no need
comptime {
  ...
} // no semicolon here..? And, if I add it, parsing error

// This simple if statement also is a block statement
if (expr) {
    std.debug.print("Hello there!", .{});
} // no semicolon here, and if I add it, parsing error

// A combination of the two: the if statement is an expression and
// therefore needs an ending semicolon
comptime if (expr) {
    @compileError("Hello there!");
}; // a semicolon here! If I remove it, parsing error

(There are other situations where this might show up, but I think this one I showed is one of the most interesting.)

I think this might cause quite a lot of confusion for beginners but, not only that, it might still be a little bit annoying for people that are already used to Zig.

My solution to this is to require semicolons at the end of every statement. With this change, what we currently have as:

if (expr) {
    doSomething();
}

if (expr) doSomething();

switch (some_other_expr) {
    else => 10,
}

{
    const value = 10;
    var x: i32 = 9;
}

comptime {
    _ = 9;
}

defer {
    defer std.debug.print("H\n", .{});

    std.debug.print("End of scope! Yay!\n", .{});
}

Would become:

if (expr) {
    doSomething();
};

if (expr) doSomething();

switch (some_other_expr) {
    else => 10,
};

{
    const value = 10;
    var x: i32 = 9;
};

comptime {
    _ = 9;
};

defer {
    defer std.debug.print("H\n", .{});

    std.debug.print("End of scope! Yay!\n", .{});
};

The grammar definition would also benefit from this, since currently the SEMICOLON token is declared in several different ways to accomodate how each kind of statement works.

# Currently
Statement
    <- KEYWORD_comptime? VarDecl
     / KEYWORD_comptime BlockExprStatement
     / KEYWORD_nosuspend BlockExprStatement
     / KEYWORD_suspend (SEMICOLON / BlockExprStatement)
     / KEYWORD_defer BlockExprStatement
     / KEYWORD_errdefer BlockExprStatement
     / IfStatement
     / LabeledStatement
     / SwitchExpr
     / AssignExpr SEMICOLON

# After the change (not fully accurate representation)
Statement
    <- (KEYWORD_comptime? VarDecl
        / KEYWORD_comptime BlockExpr
        / KEYWORD_nosuspend BlockExpr
        / KEYWORD_suspend BlockExpr?
        / KEYWORD_defer BlockExpr
        / KEYWORD_errdefer BlockExpr
        / IfExpr
        / LabeledExpr
        / SwitchExpr
        / AssignExpr) SEMICOLON

Wrapping up, here's a list of pros and cons I could think of with the implementation of this proposal as-is.

Pros

Cons


Note: I've also thought about removing the requirement of having semicolons at the end of a statement that ends with braces but I think it's harder to understand, the grammar gets more complex and the example below would be ambiguous:

const MyType = struct {
    val: i32,
} // no semicolon needed! yay!

// (using new function-as-expression syntax from #1717)
const main = fn () void {} // this would also be valid, but like this,
                           // in one just line, it feels very off
                           // compared to normal statements

const x = blk: {
    break :blk 10;
} // the parser might need to do some trickery to understand this is not yet
  // the end of the expression
+
10;
amrojjeh commented 3 years ago

Should functions have semi-columns at the end also?

fn main() void {
};

Edit: I've not kept up on #1717 , so I'm not sure if the old way of writing function is going away. I apologize for my ignorance

yohannd1 commented 3 years ago

Yeah, #1717 would change things, but maybe the old way will still be available? I'm not sure. If it is kept, yeah, it would have a semicolon at the end, since it's a statement that defines a function (as far as I know).

nektro commented 3 years ago

semicolons are already required at the end of statements. the situations you described that do not require one are not statements at all, they're expressions/blocks.

example

//expression
// conditional block in this case
if (a) { b(); } else { c(); }

//statement (semicolon required)
// assignment is the statement
// the value is the result of the conditional expression
const x = if (a) b else c;
ifreund commented 3 years ago

semicolons are already required at the end of statements. the situations you described that do not require one are not statements at all, they're expressions/blocks.

They are statements according to the grammar, lets all use the terminology from the grammar to avoid confusion.

Statement
    <- KEYWORD_comptime? VarDecl
     / KEYWORD_comptime BlockExprStatement
     / KEYWORD_nosuspend BlockExprStatement
     / KEYWORD_suspend (SEMICOLON / BlockExprStatement)
     / KEYWORD_defer BlockExprStatement
     / KEYWORD_errdefer BlockExprStatement
     / IfStatement
     / LabeledStatement
     / SwitchExpr
     / AssignExpr SEMICOLON
yohannd1 commented 3 years ago

After taking a look at the grammar again, I noticed there is not only the IfStatement rule, but also the IfExpr rule, and they seem to be used in different contexts (I suppose IfStatement can't be used as an expression, but IfExpr can). If I understood what @nektro meant, I suppose it's technically right, but I think the grammar could be tweaked in that regard. I think with IfStatement removed, IfExpr can be used in its place just like SwitchExpr, which is one of the sub-rules in Statement.

ifreund commented 3 years ago

To clarify more, there are three major levels which the grammar defines: top/decl level, block/statement level, and expression level. The first deals with unordered declarations inside a container type (note: files are container types as they are implicitly structs). The second deals with ordered statements in blocks/function bodies. This is where it is proposed to always require a semicolon. The final expression level appears in certain places within the other two.

Here is some example code to hopefully illustrate this better than words can manage:

// Top/Decl level. the following line is a declaration
const x = 42;

// This comptime block is also a declaration, we don't require a semicolon here
comptime {
    // Block/Statement level. the following line is a statement:
    const x = 42;
    // this line is also a statement, as is the `foo();` inside the block of the if:
    if (true) { foo(); }
    // Here we use an if expression inside a function call statement:
    bar(if (true) 1 else 2);
}
zeramorphic commented 3 years ago

Another alternative is to do a go-style automatic semicolon rule, which would make more semicolons required but less of them visible. To a certain extent it perhaps doesn't really matter what the final decision is, as long as it's consistent and that the compiler makes a reasonable effort to tell users when they've got it wrong.

const x = 42  // compiler infers semicolon since `42` can end a declaration

comptime {
    const x = 42  // compiler infers semicolon since `42` can end a statement
    if (true) { foo() }   // compiler infers semicolon after `)` and after closing brace, which mismatches today's grammar
    bar(if (true) 1 else 2)  // compiler infers semicolon since `)` can end a statement
}  // compiler infers semicolon after closing brace, which mismatches today's grammar
yohannd1 commented 3 years ago

Well, as for automatic semicolons, I think this has been discussed before (https://github.com/ziglang/zig/issues/7938, https://github.com/ziglang/zig/issues/3188, https://github.com/ziglang/zig/issues/483). It seems most people don't want semicolons (I personally don't either), but I agree with you on prioritizing consistency.

lemaitre commented 3 years ago

I prefer semicolons rather than newlines when it comes to end statements. (newlines are blank characters, just like spaces)

vladimir-kraus commented 2 years ago

From language purist perspective this sounds justified. But from practical perspective, this would terribly hurt adoption of zig by people coming from C, C++ and many other languages. Having to write semicolon after each block just feels so obtrusive...

vladimir-kraus commented 2 years ago

newlines are blank characters, just like spaces

Yes, but these blank chars have a very important meaning. Spaces can separate two code tokens. Thanks to the space const x has very different meaning than constx. And like spaces separating tokens, newlines can separate statements. Therefore I do not see any benefit in forcing semicolons after every statement. IMO.

Hejsil commented 2 years ago

Seems to me that the actual issue is that some things that look like they shouldn't require semicolons actually do, like:

defer if (true) {};

I think it is reasonable to be confused by this, so it seems like this is the actual issue that should be solved. Playing around a bit, we can change the grammar to this to make this code compile:

Statement
    <- KEYWORD_comptime? VarDecl
     / KEYWORD_comptime Statement
     / KEYWORD_nosuspend Statement
     / KEYWORD_suspend Statement
     / KEYWORD_defer Statement
     / KEYWORD_errdefer Payload? Statement
     / IfStatement
     / LabeledStatement
     / SwitchExpr
     / AssignExpr SEMICOLON

Ofc, this is not a full solution and this needs to be checked for edge cases.

mruncreative commented 2 years ago

I'll argue for the opposite:

  1. The semicolon is only easy to type on american keyboards (same with @).
  2. Redundancy for safety has already been tried in Ada. (Have a look at it.) Ada didn't become popular and it didn't help safety. An uneven amount of brackets is just as easy to notice as "end if" or "end NAMEOFPROCEDURE". The same applies to semicolons.
  3. Many languages have optional semicolons or require none at all. Swift, Javascript, Lua, Tcl for example.
  4. Usually it can be expected whether something is to follow or not. If something is to follow, just don't treat new line as semicolon. (bracket that aren't closed, operators, etc can help) It's more natural, we think the same way. This totally works without semicolons:
    var
    a
    :
    i32
    =
    foo(
    a
    ,
    b
    ,
    c
    ) +
    8
  5. If it isn't possibile to adjust the grammer in a way that requires no semicolons or newlines at all to keep statements apart, you can still have them for writing multiple statements in one line.
  6. You can make them optional to satisfy semicolon purists.
buzmeg commented 1 year ago

As someone who got used to Erlang's use of punctuation, I can live with most choices. I think I probably slightly prefer explicit semicolons after everything. Just don't make them "optional". There is no reason to repeat the disaster that optional semicolons wrought in Javascript.

However, requiring the semicolon would seem to make parsing recovery after syntax errors a lot easier for tooling. That's probably more newbie-friendly than the actual fact of having more semicolons than C/C++.

After all, people got used to Rust where the existence or not of a semicolon changes what your return value is. That is far more intrusive than requiring semicolons everywhere.

himazawa commented 1 year ago

As already stated, semicolon are not super-easy to type with a non-ANSI layout.

I really like zig, but having used languages without semicolons for the last ~7 years, going back to semicolons gives me PTSD of the C ages.

Honestly with Go, absence of semicolons works pretty well, and I still can't find a real argument against removing them other than "code clarity" but that shouldn't depend on the semicolon TBH.

I have seen a lot of issues regarding this and strong opinions from both sides but not a single official answer from the team on which is the intended direction. I understand that taking a decision is not simple but having a clear direction would be nice for people that want to approach the languge.

mnemnion commented 2 months ago

I think I probably slightly prefer explicit semicolons after everything. Just don't make them "optional". There is no reason to repeat the disaster that optional semicolons wrought in Javascript.

As with many things Javascript, the implementation of "automatic semicolon insertion" is uniquely bad. Lua and Julia are two languages which do what I would prefer to see Zig do also: a newline terminates an expression, a semicolon is allowed, and to write on one line what would otherwise require a newline, you must use a semicolon. There's no ambiguity, because semicolons aren't "inserted" via a hard-to-understand rule. They're allowed where a newline is required, and are required if the newline isn't present.

This does mean that some expressions have to be written with parentheses which would otherwise not be needed. Another option is backslash-escaping a newline to make a multi-line expression behave like a single line. I've found the tradeoff to be well worth it.

987Nabil commented 1 month ago

I have been using Scala for ~10 years (And some Kotlin too, that has the same behaviour). It had optional semicolons basically like what @mnemnion described from the time I started to use it. I never had a problem of ambiguity. One thing that can be unituative was

1 + //compiles, since the compiler finds the rhs in the next line
1
1
+ 1 // does not compile

Scala 3 this is gone, since we have meaningful whitespace

1
 + 1 // has a leading space, so it is part of the expression of the line above.

I don't say do meaningful whitespace, but optional semicolons are no practical issue.

mnemnion commented 1 month ago

This is a bit of meander about syntax, specifically, if expressions. My premise is that if any part of the language would suffer from optional semicolons, it would be if expressions.

Background: when writing C, I have an ironclad rule to always put braces around the prongs of if statements. It's just too easy to add another statement and silently take an unconditional action, there have been some famous bugs from this, such as 'goto fail'.

The problem in a nutshell:

// starts like this:
if (someCondition())
    whenTrue();

// becomes this
if (someCondition())
    whenTrue();
    oops(); // executed unconditionally.

However, this is also valid Zig. The reason I don't follow that rule with Zig code is actually cultural: like ~everyone else, I use zig fmt, it triggers on save, and the indentation would tell me I did something wrong. It's still a somewhat dangerous construct, my personal style is that an if-only branch with a single expression should go on a single line, so I use the style above only when there's an else:

if (someCondition())
    whenTrue()
else
    whenFalse();

Here we get a compile error if something is added to the true prong— but this would be true in C also. We've pushed the problem to the false prong, and once again, it's formatting which comes to the rescue and makes this tolerably safe.

So my conclusion is that the absence of semicolons wouldn't harm this construct at all. The example would just be this:

// still caught by formatting, if anything
if (someCondition())
    whenTrue()
    oops()

Or this:

if (someCondition())
    whenTrue()
    oops() // still a compile error
else
    whenFalse()

So nothing has actually changed.

Other cases where semicolons appear to be semantically meaningful don't, on close examination, appear to matter at all.

Like this one:

// bare switch:
switch (anEnum)  {
    .fee => sayFee(),
    .fie => sayFie(),
    .foe => sayFoe(),
} // <- Semicolon not allowed

// Assigning switch
const said = switch(anEnum) {
    .fee => "fee",
    .fie => "fie",
    .foe => "foe",
}; // <- Semicolon is mandatory

If semicolons were optional, the first one would still be an illegal place to put a semicolon, and the second one would allow it but not require it. That would be neither harder to parse, nor more difficult to read and understand.

In other words, semicolons and their lack enforce a distinction between expressions and statements, but I haven't found a place where they uniquely determine that difference. That is, where adding a semicolon changes a statement to an expression without any further changes, and resulting in a valid program with a different meaning.

TL;DR, I've convinced myself that semicolons could be made completely optional in Zig, except when multiple expressions are placed on the same line. It would be a 100% backward compatible change, requiring no alteration to any existing programs, and it wouldn't harm understanding of semicolon-free programs, or result in any new dangerous ambiguities. If there's a counterexample, I'm eager to hear it.

As a note, I've worked professionally with and on Parsing Expression Grammars for years, this is a change I could plausibly make myself. From reading the grammar, it should be literally as cheap as making the semicolon rule accept a newline as well as a semicolon.

Making it all function is likely to be enough work that I would want some reasonable buy-in on the idea before proceeding, but I'd like the core team to consider this as an offer, rather than a request or, heaven forfend, a demand.

mnemnion commented 1 month ago

I never had a problem of ambiguity. One thing that can be unintuitive was

1 + //compiles, since the compiler finds the rhs in the next line
1

1
+ 1 // does not compile

Well drat, I take it back about optional semicolons not invalidating any currently well-formed programs.

But I think that in practice this is no problem at all, because of, once again, zig fmt. I just checked, and, no surprise, it puts that sort of spread-out math statement onto one line.

A breaking change is a breaking change, though. I'm fairly sure that Zig doesn't enforce a rule that code must be run through zig fmt, and without that, there are conceivable programs which would not compile with optional semicolons.

As a practical matter, the number of actual programs to which this would apply could be as few as "none". If there's an example of a possible program which would compile, but would change meaning, that could be a show-stopper.

This would still compile:

const result = 1 + // semicolon illegal here, just whitespace
   1;

This would not:

const result = 1  // valid semicolon, inserted
 + 1; // invalid fragment, compiler error

This would call for a careful analysis to see if there are cases which wouldn't fall into one or the other category.

mnemnion commented 1 month ago

I'll save some time (would have been better to read the linked issues thoroughly before I went off like this):

// when does this invoke some_function()?
  if (some_long_condition()) return
      some_function();

  // same
  for (some_slice) |x| {
      if (some_long_condition()) break
           some_function();
  }

These are genuine ambiguities. An optional-semicolon grammar would count the newline after the return and break as a terminator, which would be wrong, and the program would compile.

Only mitigation here is good ol' zig fmt, which turns those statements into this:

    if (some_long_condition()) return some_function();

    // same
    for (some_slice) |x| {
        if (some_long_condition()) break some_function();
    }

But this weakens the proposal considerably. If you add enough characters to some_long_condition, it won't reformat to a single line, either.

Detectable? Yes, this is detectable, using a before-and-after parse, and I would venture that all ambiguities would be detectable that way.

Bad style? Absolutely, no one should be stranding a break or return statement that way, it's awful style.

Worth it? Maybe. But breaking change? Well and truly would be.