Closed Hejsil closed 6 years ago
Consider this example:
test "" {
if (true)
if (true)
if (true)
if (true)
if (true)
if (true)
return;
}
Personally I prefer to enforce braces around block bodies. Perl does this and skips a lot of pain. C-like languages generally allow you to skip the braces if there is only one statement in the body.
As Apple found out, it is not always a good idea to leave out the braces.
I think the solution to this is related to #760 #114 (this comment) and maybe even #1676.
We can change the syntax of the language so that syntactic constructs that will never pass semantic analysis will give a parse error. Consider this piece of code:
test "" {
1 + 1; // error: expression value is ignored
}
No matter what the LHS and RHS are for this operator, this code will never pass semantic analysis, because +
always returns an expression that can shouldn't be ignored. We can, therefore, split expressions into categories, where only some are valid as statements. This is the same solution as this. With a system like this, we can disallow the if
expressions at the statement level and the ambiguity is solved.
@kyle-github We can then consider enforcing blocks, but it is not necessary to solve this issue (i think).
I'm not an expert on formal language theory, so I don't understand the problem in this issue. Does the syntax flaw manifest in any way other than purely theoretical?
@thejoshwolfe Well, if we wanna have a formal grammar, it should be unambiguous, so that differently implemented parses don't vary in behavior. Otherwise, what is the point of having a grammar?
We could ask the same question for #760. The stage1 compiler takes a choice, and always follows that choice, so in implementation, there is no ambiguity. We still consider it a problem though, because ambiguities make code harder to reason about and read (even if the parser is consistent).
Also, it makes using parser generators to parse the Zig language trivial, which helps with specifying and testing the grammar itself. If we can have a grammar that can actually parse code by generating a parser, then, if some compilers vary in what syntax they parse, then you can point to the grammar and its generates parser and say "well, the parser that does the same as this generated one is correct" (This will help when new syntax have to be added to stage1 and stage2).
I've also heard, that different C++ compiler disagrees on what syntax is valid. I'd be nice if we could avoid such a mess. :)
3rdly! Having a grammar we can generate from allows from quick prototyping of syntax. We still have #114, and that is a big change. We should probably ensure we get it right, before rewriting a 2000 line parser :)
Here are some excerpts from the grammar at the bottom of the langref docs:
Block = option(Symbol ":") "{" many(Statement) "}"
Statement = LocalVarDecl ";"
| Defer(Block)
| Defer(Expression) ";"
| BlockExpression(Block)
| Expression ";"
| ";"
Defer(body) = ("defer" | "errdefer") body
BlockExpression(body) = Block
| IfExpression(body)
| IfErrorExpression(body)
| TestExpression(body)
| WhileExpression(body)
| ForExpression(body)
| SwitchExpression
| CompTimeExpression(body)
| SuspendExpression(body)
IfExpression(body) = "if" "(" Expression ")" body option("else" BlockExpression(body))
I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body)
pattern doesn't qualify for that.
I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block)
matches, then don't bother trying to match Defer(Expression) ";"
. Is this not good enough?
We don't need the rules in #114 to be encoded in the grammar. Those rules can be enforced after parsing by examining the AST. I'm more optimistic about that approach generating more helpful error messages anyway. I understand that that will result in sloppy implementations failing to reject invalid Zig programs, but I think that's a minor concern compared to implementations failing to parse correct Zig programs.
It sounds like you may be suggesting that an if
at statement level should always require {
braces }
around the body/bodies. I count 112 violations of this already in the std lib:
$ grep -RP '^ *if .*[^{]$' std/zig/ | head
std/zig/parse.zig: if (token_ptr.id == Token.Id.Eof) break;
std/zig/parse.zig: if (shebang_tok_ptr.id != Token.Id.ShebangLine) break :shebang;
std/zig/parse.zig: if (next_tok.id != Token.Id.LineComment) break;
std/zig/parse.zig: if (eatToken(&tok_it, &tree, Token.Id.Comma) == null)
std/zig/parse.zig: if (next_tok.id != Token.Id.LineComment) return result;
std/zig/parse.zig: if (prev_tok.id == Token.Id.LineComment) continue;
std/zig/ast.zig: if (self.shebang) |shebang| return shebang;
std/zig/ast.zig: if (i < 1) return type_node;
std/zig/ast.zig: if (i < 1) return align_node;
std/zig/ast.zig: if (i < 1) return init_node;
There are 0 violations of such a rule for for
and while
loops, but if
is an important case to take seriously.
If we required braces on statement-level if
, it would turn the 1-line if (something) return;
idiom into a 3-line:
if (something) {
return;
}
This isn't necessarily out of the question, but I don't want to go with this option just yet.
There are a few other cases where braces imply semicolons, but these all have 1-token lookahead, so I don't think anyone's too concerned about these:
comptime foo();
comptime {
foo();
}
defer foo();
defer {
foo();
}
suspend; // not sure about this one.
suspend {
resume @handle();
}
if
is the important case to think about.
There are 0 violations of such a rule for
for
andwhile
loops,
That's not true, there are a couple. Here's one from std.mem
.
pub fn set(comptime T: type, dest: []T, value: T) void {
for (dest) |*d|
d.* = value;
}
@thejoshwolfe
I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body) pattern doesn't qualify for that.
Correct. I think we should stick to BNF for representing our grammar. No EBNF or other variations, as they are less supported in parser generators (and all variations are supersets, so if we keep it simple, we save everyone a lot of pain).
I'm currently using bison + flex to validate my grammar work, and bison can report when there are conflicts (places where bison had to take a choice between two actions). These conflicts tend to be ambiguities, and bison reported one for the block expressions.
This issue is my theory as to why bison had problems. I'm not 100% sure my theory is correct, as bison reports conflicts in the state machine and not in the grammar (which makes it hard to figure out the exact problem). Bison generates a LARL(1) parser so I thought this was a lookahead issue.
I'll look more into this later, when I can work with the grammar again. #1676 is also causing conflicts and those might be interfering with other rules.
There are 0 violations of such a rule for for and while loops, but if is an important case to take seriously.
If we required braces on statement-level if, it would turn the 1-line if (something) return; idiom into a 3-line:
I don't think we have to give up the shorthand versions of the block expressions to resolve this issue.
Also, we have the dangeling else
problem :)
Statement = IfStatement | Block | Expr ;
IfStatement = if ( Expr ) Block | if ( Expr ) Expr else IfStatement
Expr = 1 | 1 + Expr | IfExpr | Block
IfExpr = if ( Expr ) Expr | if ( Expr ) Expr else Expr
Block = { Statement }
./cfg-checker grammar.cfg
Found a sentential form with two different derivations:
if ( Expr ) if ( Expr ) Expr else Expr ;
Derivation 1:
0: Statement
1: Expr ;
2: IfExpr ;
3: if ( Expr ) Expr else Expr ;
4: if ( Expr ) IfExpr else Expr ;
5: if ( Expr ) if ( Expr ) Expr else Expr ;
Derivation 2:
0: Statement
1: Expr ;
2: IfExpr ;
3: if ( Expr ) Expr ;
4: if ( Expr ) IfExpr ;
5: if ( Expr ) if ( Expr ) Expr else Expr ;
I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block) matches, then don't bother trying to match Defer(Expression) ";". Is this not good enough?
Grammars are ambiguous as long as there is one input that has two ways of expanding the grammar. Some parsing techniques do not have the concept of "highest-to-lowest", so they will parse grammars differently from parsers that do if the grammar is ambiguous (LARL parsers expand the grammar into states, where a state can handle N
rules at the same time.)
EDIT: The Defer(Block) | Defer(Expression) ";"
is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:
defer {}
defer {}; // error: invalid token: ';'
More fun grammar that requires N
token lookahead:
async<if (true) A else B> fn()void {}
async<comptime A> fn()void {}
@Hejsil is your work on the flex/bison parser online anywhere? That sounds like it might be a good starting point for overhauling the grammar specification.
@thejoshwolfe I've pushed my work to here. It can parse most of Zigs std (except async calls and a few statements). Currently, I'm working on identifying what syntactic constructs bison emits conflict warnings for, and seeing if I can restructure the grammar to get rid of them.
Edit: The grammar I'm working on reverts the changes made in #1628 and encodes the rules for #1047 (which makes the grammar simpler)
test "" {
if (true)
if (true)
if (true)
if (true)
if (true)
if (true)
return;
}
I don't see how this requires N token lookahead. I think lookahead refers to reading extra tokens to decide what to do with the already read tokens, for example you must read more before you reduce "a + b". Lookahead will always be required in some cases, so I don't see what's so bad about lookahead.
When you've seen if (expr) { }
(in the general case, I know nothing about Zig) then you can immediately reduce this to say an IfStmt. It does not have to look at anything after '}'. The same is true for ';' in the code above. No other token needs to be looked at to collapse the whole stack of if stmts. No token after the semicolon could possibly alter how this is parsed.
@UniqueID1
N
token lookahead (lookahead that scales with input) can be a problem for some parsing stratigies.
Anyways, the real problem in this issue is this, and how to specify Zigs syntax rules in grammar:
EDIT: The
Defer(Block) | Defer(Expression) ";"
is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:defer {} defer {}; // error: invalid token: ';'
I have a solution to this, but it required changing when if
s can be expressions.
Fixed: #1685
Currently, block expressions (
for
,if
,while
) are only terminated with;
, if the ending block of that expression is not a block. This makes the grammar very hard (maybe even impossible) to make context-free. My best bet at formalizing a grammar for this have been something along these lines:Sadly, this grammar requires
N
token lookahead at best, and at worst it is ambiguous because aBlock
is also anExpr
.