Grammar ambiguity - Githubissues

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

34.81k stars 2.54k forks source link

Grammar ambiguity #2954

Closed andersfr closed 5 years ago

andersfr commented 5 years ago

In PtrTypeStart we have the following syntax: KEYWORD_align LPAREN Expr (COLON INTEGER COLON INTEGER)? RPAREN

Or as simplified example: align(Expr:1:32)

This conflicts with the block expression syntax: Identifier: {...}

The issue arises because Identifier is a valid Expr production and thus it becomes impossible to deduce the meaning of the colon token without consuming additional input. In more theoretical terms this is an instance of a shift-reduce conflict. The simplest solution is to introduce a :: token and use it instead of the leading colon. It looks almost identical. KEYWORD_align LPAREN Expr (COLONCOLON INTEGER COLON INTEGER)? RPAREN

Or as simplified example: align(Expr::1:32)

hryx commented 5 years ago

To be precise, a Parsing Expression Grammars (PEG) is by definition unambiguous -- the first match is the correct match.

A simple solution is to enclose the identifier in parentheses:

align((Expr):1:32)

And just for fun, to demonstrate the grammar precedence (not an honest suggestion):

align(blk: {
    break :blk Expr;
}:1:32)

andersfr commented 5 years ago

The problem is not that it cannot eventually be disambiguated. It is that this issue breaks the possibility of making a context-free grammar, e.g. using LALR(1) which I happen to be using. Your conclusion on first match is also incorrect. Upon seeing Identifier: the first match is to assume it is a block label that must be followed by a block, because we are currently parsing an expression. The parser then either throws an error upon seeing an IntegerLiteral or must resort to backtracking to resolve the conflict. The Zig compiler indeed breaks on this problem as explained:

pub fn main() void {
    const identifier: usize = 8;
    const test1 = *align(label: { break :label 8; }:1:32) u8;
    const test2 = *align(identifier:1:32) u8;
}

issue.zig:4:37: error: invalid token: '1' const test2 = *align(identifier:1:32) u8;

hryx commented 5 years ago

using LALR(1) which I happen to be using

If your goal is to write a ~CFG~ LL/LR/LALR for Zig, that may not be possible. I'm not an expert, but from Wikipedia:

[...] but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.

Regarding the error from your example, that is the expected result with today's grammar. identifier:1:32 will not work. Of course this won't help if you're writing a parser, but as a user, enclosing identifier in parentheses makes the intention possible.

Your conclusion on first match is also incorrect.

What I meant was to describe the rule-matching behavior of a PEG in general.

Note that syntax ambiguities are precisely what inspired moving to the new grammar type. Hope that helps or clears some things up.

EDIT: I mistakenly said that a CFG might not be possible to write, when I should have said LR

andersfr commented 5 years ago

A Parsing Expression Grammar (PEG) will indeed be able to parse identifier:1:32 because it has unlimited lookahead and this is only ambiguous with lookahead less than 2. It is not an instance of prioritization between rules (they are mutually exclusive).

Increasing lookahead or context dependence slows down compilers. Either by having excessive branching in the parser logic or postponing the problem to semantic analysis.

It is trivial to fix the bug in the Zig compiler (after ALIGN LPAREN check for the edge case before branching into the ast_parse_expression). But let's rather fix the syntax instead of forcing the compiler into submission with manual edge case handling.

hryx commented 5 years ago

From what I can tell, this is a proposal and not a bug. Maybe I'm conflating the spec and its implementation. But I should disengage from the conversation since I seem to be out of my depth. Apologies for convoluting the issue.

andersfr commented 5 years ago

No apologies needed. It gave me chance to discover the zig-spec PEG implementation and give it a test run. I will close the issue and resubmit as proposal for changing the align syntax.