yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
880 stars 112 forks source link

Symbols remain defined after failure and backtrack #231

Closed ChrisHixon closed 2 years ago

ChrisHixon commented 2 years ago

This is a contrived example to test how symbols behave when set during a rule match that fails, causing backtrack. I would expect the failure and backtrack to undo the symbol being set.

I dealt with similar issues when I added back-references to chpeg. I ended up creating an 'undo action' system that allows actions to be run upon failure/backtrack to undo what has been done. Perhaps this is something to consider for cpp-peglib. My initial back-reference implementation did something like what cpp-peglib appears to do: push a back reference stack and copy results back upon success. This seems inefficient and has to occur before/after every operation that can possibly set a reference.

Grammar:

S            <- (DeclBT / Decl / Ref)*
DeclBT       <- 'decl' symbol 'backtrack' # match fails, so symbol should not be set
Decl         <- 'decl' symbol
Ref          <- 'ref' is_symbol
Name         <- < [a-zA-Z]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol       <- Name { declare_symbol var_table } # Declare symbol instruction
is_symbol    <- Name { check_symbol var_table }   # Check symbol instruction

Input:

decl foo
ref foo

Result:

1:6 'foo' already exists.

Expected: no failure

yhirose commented 2 years ago

@ChrisHixon, thanks for reporting the bug! I'll look into it.

mingodad commented 2 years ago

Somehow relate this project has a good description of desired features in a parser https://github.com/Lercher/CocoR that it somehow implements (worth short reading).

yhirose commented 2 years ago

I fixed the reported problem as well as S <- DeclBT* Decl Ref. However, there remain the following bugs:

1) It doesn't work on packrat mode because the packrat table doesn't record declared symbols. (I can add it, but it will double the packrat memory usage since we need to make another cache variable like cache_values for semantic values.)

2) There is no way to initialize symbol tables before parsing starts. (The feature is not usable to support the C/C++ typedef feature because there is no way to register builtin types like long to the symbol table before parsing starts.)

3) It cannot support 'nested scopes' as usual lexical scope languages support as bellow:

int x = 0; // The global scope
void foo() { // A function scope
   int x = 1;
   { // A block scope
      int x = 2;
   }
}

I am now skeptical of supporting this builtin symbol table feature. I think it can be usable only for simple cases. If we want to use it for more realistic situations, it may not be helpful seriously. Also it's not good to use unnecessary additional memory for the rarely used feature.

IMHO, this kind of thing should be done during semantic checking instead of syntactic checking, so that the parser implementation can stay simple and efficient, also the semantic check process with AST anyway can do much better job for symbol checking in a much more efficient way. So I'll be planning to remove this incomplete feature unless I come up with a good reasonable solution to fix the remaining problems easily...

yhirose commented 2 years ago

Closed by #233