yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
863 stars 112 forks source link

Nondeterministic parsing failure in grammar #299

Open computablee opened 2 months ago

computablee commented 2 months ago

Hello! I'm encountering some bizarre nondeterministic parsing errors when using cpp-peglib. I cannot reproduce the issue on another system, but I have a "minimal reproducer" which demonstrates the issue on my machine (I'll give machine details further down the issue).

Here's a small grammar which demonstrates the behavior on my machine:

Statement <- PrintStatement / PrintfStatement / PrintlnStatement / Expression

Expression <- Infix_Expression(Atom, Infix_Operator) / FunctionCall
Atom       <- WriteableExpression / Prefix_Operator Atom / '(' Expression ')' / Number

FunctionCall  <- VariableReference '(' ExpressionList ')'
ExpressionList <- Expression (',' Expression)*

PrintStatement <- 'print' '(' ExpressionList ')'
PrintfStatement <- 'printf' '(' ExpressionList ')'
PrintlnStatement <- 'println' '(' ExpressionList ')'

WriteableExpression <- VariableReference / PointerDereference
VariableReference   <- Identifier
PointerDereference  <- '*' WriteableExpression

Identifier <- < !Keyword Name >
Keyword    <- ( 'true' | 'false' ) !Name

Name   <- < [A-Za-z_][A-Za-z0-9_]* >
Number <- < [0-9]+ >

Prefix_Operator <- < [!-] >
Infix_Operator  <- < ( '>=' | '<=' | '&&' | '||' | '==' | '!=' | '-' | '+' | '/' | '>' | '<' | '*' ) >

%whitespace <- [ \t]*
%word       <- [A-Za-z]+

Infix_Expression(A, O) <- A (O A)* {
  precedence
    L && ||
    L >= <= > < == !=
    L + -
    L * /
}

(Side note: I know this grammar is extremely flawed, it's not meant to actually be a good or useful grammar!)

On my machine, the longer the input is, the higher probability of an unexpected parsing fail. It is nondeterministic, so:

./parse "a + a + a + a" # gives a valid parse tree all of the time
./parse "a + a + a + a + a" # gives a valid parse tree about 20% of the time, but 80% of the time fails with "1:17: syntax error, unexpected 'a', expecting <Infix_Operator>."
./parse "a + a + a + a + a + a" # always fails

The grammar was tested in the online sandbox and worked as intended, without any of the failures. Additionally, the grammar was tested on a different machine (with the same version of cpp-peglib and it also worked as intended.

Relevant information:

cpp-peglib version: up-to-date with master as of May 7, 2024 compiler version: GNU g++ 11.4.0-1ubuntu1~22.04 environment: Ubuntu 22.04 podman container

Compiled with g++ -std=c++17 -lpthread source.cxx -o parse

source.cxx doesn't do anything special-- it simply calls enable_ast, enable_packrat_parsing, and ast_to_s. The packrat parsing does not seem to affect the bug.

mingodad commented 2 months ago

Have you tested it with valgrind ?

computablee commented 2 months ago

I just did-- getting several Invalid read errors.

yhirose commented 1 month ago

@computablee sorry for the late reply. Could you show me the content of source.cxx?