ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.87k stars 2.55k forks source link

Syntax flaw: Block statements and terminating semicolon #1677

Closed Hejsil closed 6 years ago

Hejsil commented 6 years ago

Currently, block expressions (for, if, while) are only terminated with ;, if the ending block of that expression is not a block. This makes the grammar very hard (maybe even impossible) to make context-free. My best bet at formalizing a grammar for this have been something along these lines:

Statement 
    : IfStatement
    | Block
    ...
    | Expr Semicolon

IfStatement
    : "if" GroupedExpr option(Payload) Statement
    | "if" GroupedExpr option(Payload) Expr "else" option(Payload) Statement

IfExpr
    : "if" GroupedExpr option(Payload) Expr option("else" option(Payload) Expr)

Sadly, this grammar requires N token lookahead at best, and at worst it is ambiguous because a Block is also an Expr.

Hejsil commented 6 years ago

Consider this example:

test "" {
    if (true)
        if (true)
            if (true)
                if (true)
                    if (true)
                        if (true)
                            return;
}
kyle-github commented 6 years ago

Personally I prefer to enforce braces around block bodies. Perl does this and skips a lot of pain. C-like languages generally allow you to skip the braces if there is only one statement in the body.

As Apple found out, it is not always a good idea to leave out the braces.

Hejsil commented 6 years ago

I think the solution to this is related to #760 #114 (this comment) and maybe even #1676.

We can change the syntax of the language so that syntactic constructs that will never pass semantic analysis will give a parse error. Consider this piece of code:

test "" {
    1 + 1; // error: expression value is ignored
}

No matter what the LHS and RHS are for this operator, this code will never pass semantic analysis, because + always returns an expression that can shouldn't be ignored. We can, therefore, split expressions into categories, where only some are valid as statements. This is the same solution as this. With a system like this, we can disallow the if expressions at the statement level and the ambiguity is solved.

@kyle-github We can then consider enforcing blocks, but it is not necessary to solve this issue (i think).

thejoshwolfe commented 6 years ago

I'm not an expert on formal language theory, so I don't understand the problem in this issue. Does the syntax flaw manifest in any way other than purely theoretical?

Hejsil commented 6 years ago

@thejoshwolfe Well, if we wanna have a formal grammar, it should be unambiguous, so that differently implemented parses don't vary in behavior. Otherwise, what is the point of having a grammar?

We could ask the same question for #760. The stage1 compiler takes a choice, and always follows that choice, so in implementation, there is no ambiguity. We still consider it a problem though, because ambiguities make code harder to reason about and read (even if the parser is consistent).

Hejsil commented 6 years ago

Also, it makes using parser generators to parse the Zig language trivial, which helps with specifying and testing the grammar itself. If we can have a grammar that can actually parse code by generating a parser, then, if some compilers vary in what syntax they parse, then you can point to the grammar and its generates parser and say "well, the parser that does the same as this generated one is correct" (This will help when new syntax have to be added to stage1 and stage2).

Hejsil commented 6 years ago

I've also heard, that different C++ compiler disagrees on what syntax is valid. I'd be nice if we could avoid such a mess. :)

Hejsil commented 6 years ago

3rdly! Having a grammar we can generate from allows from quick prototyping of syntax. We still have #114, and that is a big change. We should probably ensure we get it right, before rewriting a 2000 line parser :)

thejoshwolfe commented 6 years ago

Here are some excerpts from the grammar at the bottom of the langref docs:

Block = option(Symbol ":") "{" many(Statement) "}"

Statement = LocalVarDecl ";"
          | Defer(Block)
          | Defer(Expression) ";"
          | BlockExpression(Block)
          | Expression ";"
          | ";"

Defer(body) = ("defer" | "errdefer") body

BlockExpression(body) = Block
                      | IfExpression(body)
                      | IfErrorExpression(body)
                      | TestExpression(body)
                      | WhileExpression(body)
                      | ForExpression(body)
                      | SwitchExpression
                      | CompTimeExpression(body)
                      | SuspendExpression(body)

IfExpression(body) = "if" "(" Expression ")" body option("else" BlockExpression(body))

I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body) pattern doesn't qualify for that.

I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block) matches, then don't bother trying to match Defer(Expression) ";". Is this not good enough?

thejoshwolfe commented 6 years ago

We don't need the rules in #114 to be encoded in the grammar. Those rules can be enforced after parsing by examining the AST. I'm more optimistic about that approach generating more helpful error messages anyway. I understand that that will result in sloppy implementations failing to reject invalid Zig programs, but I think that's a minor concern compared to implementations failing to parse correct Zig programs.

thejoshwolfe commented 6 years ago

It sounds like you may be suggesting that an if at statement level should always require { braces } around the body/bodies. I count 112 violations of this already in the std lib:

$ grep -RP '^ *if .*[^{]$' std/zig/ | head
std/zig/parse.zig:        if (token_ptr.id == Token.Id.Eof) break;
std/zig/parse.zig:        if (shebang_tok_ptr.id != Token.Id.ShebangLine) break :shebang;
std/zig/parse.zig:        if (next_tok.id != Token.Id.LineComment) break;
std/zig/parse.zig:                    if (eatToken(&tok_it, &tree, Token.Id.Comma) == null)
std/zig/parse.zig:        if (next_tok.id != Token.Id.LineComment) return result;
std/zig/parse.zig:        if (prev_tok.id == Token.Id.LineComment) continue;
std/zig/ast.zig:            if (self.shebang) |shebang| return shebang;
std/zig/ast.zig:                if (i < 1) return type_node;
std/zig/ast.zig:                if (i < 1) return align_node;
std/zig/ast.zig:                if (i < 1) return init_node;

There are 0 violations of such a rule for for and while loops, but if is an important case to take seriously.

If we required braces on statement-level if, it would turn the 1-line if (something) return; idiom into a 3-line:

if (something) {
    return;
}

This isn't necessarily out of the question, but I don't want to go with this option just yet.

There are a few other cases where braces imply semicolons, but these all have 1-token lookahead, so I don't think anyone's too concerned about these:

comptime foo();
comptime {
    foo();
}

defer foo();
defer {
    foo();
}

suspend; // not sure about this one.
suspend {
    resume @handle();
}

if is the important case to think about.

tgschultz commented 6 years ago

There are 0 violations of such a rule for for and while loops,

That's not true, there are a couple. Here's one from std.mem.

pub fn set(comptime T: type, dest: []T, value: T) void {
    for (dest) |*d|
        d.* = value;
}
Hejsil commented 6 years ago

@thejoshwolfe

I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body) pattern doesn't qualify for that.

Correct. I think we should stick to BNF for representing our grammar. No EBNF or other variations, as they are less supported in parser generators (and all variations are supersets, so if we keep it simple, we save everyone a lot of pain).

I'm currently using bison + flex to validate my grammar work, and bison can report when there are conflicts (places where bison had to take a choice between two actions). These conflicts tend to be ambiguities, and bison reported one for the block expressions.

This issue is my theory as to why bison had problems. I'm not 100% sure my theory is correct, as bison reports conflicts in the state machine and not in the grammar (which makes it hard to figure out the exact problem). Bison generates a LARL(1) parser so I thought this was a lookahead issue.

I'll look more into this later, when I can work with the grammar again. #1676 is also causing conflicts and those might be interfering with other rules.

There are 0 violations of such a rule for for and while loops, but if is an important case to take seriously.

If we required braces on statement-level if, it would turn the 1-line if (something) return; idiom into a 3-line:

I don't think we have to give up the shorthand versions of the block expressions to resolve this issue.

Also, we have the dangeling else problem :)

Statement = IfStatement | Block | Expr ;
IfStatement = if ( Expr ) Block | if ( Expr ) Expr else IfStatement
Expr = 1 | 1 + Expr | IfExpr | Block
IfExpr = if ( Expr ) Expr | if ( Expr ) Expr else Expr
Block = { Statement }

./cfg-checker grammar.cfg
Found a sentential form with two different derivations:

  if ( Expr ) if ( Expr ) Expr else Expr ;

Derivation 1:

  0: Statement
  1: Expr ;
  2: IfExpr ;
  3: if ( Expr ) Expr else Expr ;
  4: if ( Expr ) IfExpr else Expr ;
  5: if ( Expr ) if ( Expr ) Expr else Expr ;

Derivation 2:

  0: Statement
  1: Expr ;
  2: IfExpr ;
  3: if ( Expr ) Expr ;
  4: if ( Expr ) IfExpr ;
  5: if ( Expr ) if ( Expr ) Expr else Expr ;
Hejsil commented 6 years ago

I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block) matches, then don't bother trying to match Defer(Expression) ";". Is this not good enough?

Grammars are ambiguous as long as there is one input that has two ways of expanding the grammar. Some parsing techniques do not have the concept of "highest-to-lowest", so they will parse grammars differently from parsers that do if the grammar is ambiguous (LARL parsers expand the grammar into states, where a state can handle N rules at the same time.)

EDIT: The Defer(Block) | Defer(Expression) ";" is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:

defer {}
defer {}; // error: invalid token: ';'
Hejsil commented 6 years ago

More fun grammar that requires N token lookahead:

async<if (true) A else B> fn()void {}
async<comptime A> fn()void {}
thejoshwolfe commented 6 years ago

@Hejsil is your work on the flex/bison parser online anywhere? That sounds like it might be a good starting point for overhauling the grammar specification.

Hejsil commented 6 years ago

@thejoshwolfe I've pushed my work to here. It can parse most of Zigs std (except async calls and a few statements). Currently, I'm working on identifying what syntactic constructs bison emits conflict warnings for, and seeing if I can restructure the grammar to get rid of them.

Edit: The grammar I'm working on reverts the changes made in #1628 and encodes the rules for #1047 (which makes the grammar simpler)

ghost commented 6 years ago
test "" {
    if (true)
        if (true)
            if (true)
                if (true)
                    if (true)
                        if (true)
                            return;
}

I don't see how this requires N token lookahead. I think lookahead refers to reading extra tokens to decide what to do with the already read tokens, for example you must read more before you reduce "a + b". Lookahead will always be required in some cases, so I don't see what's so bad about lookahead.

When you've seen if (expr) { } (in the general case, I know nothing about Zig) then you can immediately reduce this to say an IfStmt. It does not have to look at anything after '}'. The same is true for ';' in the code above. No other token needs to be looked at to collapse the whole stack of if stmts. No token after the semicolon could possibly alter how this is parsed.

Hejsil commented 6 years ago

@UniqueID1 N token lookahead (lookahead that scales with input) can be a problem for some parsing stratigies.

Anyways, the real problem in this issue is this, and how to specify Zigs syntax rules in grammar:

EDIT: The Defer(Block) | Defer(Expression) ";" is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:

defer {}
defer {}; // error: invalid token: ';'

I have a solution to this, but it required changing when ifs can be expressions.

Hejsil commented 6 years ago

Fixed: #1685