strange bug when using packrat parsing

mqnc commented 3 years ago

It's been a while since I've last used peglib! The error reporting is amazing!

I want to write a program that parses python and I think I have encountered a bug when using packrat parsing.

Without packrat parsing everything works fine. With packrat parsing, after the code is parsed and LineBreak? CodeSegment from the first rule are matched, it tries to match another CodeSegment but for some reason from the middle of the input. My parser prints debug output to show this.

I am sorry I didn't reduce the grammar to narrow down the error but maybe you know what's going on right away and I can spare the effort.

Here is my parser:

#include "peglib.h"

#include <string>
#include <sstream>
#include <iostream>
#include <fstream>

using namespace peg;
using namespace std;

std::string readFile(std::string name){
    ifstream inputFile;
    inputFile.open(name);
    std::stringstream buffer("");
    buffer << inputFile.rdbuf();
    std::string content = buffer.str();
    inputFile.close();
    return content;
}

std::string filter(std::string s){
    std::stringstream result("");
    for(const auto &c:s){
        if('\t' == c){result << "→";}
        else if('\n' == c){result << "↲";}
        else if(' ' == c){result << "␣";}
        else{result << c;}
    }
    return result.str();
}

int main(void) {

    parser parser;

    parser.log = [](size_t line, size_t col, const string& msg) {
        cerr << line << ":" << col << ": " << msg << "\n";
    };

    std::string grammar = readFile("python.peg");
    auto ok = parser.load_grammar(grammar);

    if(!ok){
        cout << "could not create parser\n";
        return 1;
    }

    std::string source = readFile("test.py");

    parser.enable_packrat_parsing(); // comment to make it work

    const auto rules = parser.get_rule_names();
    int indent = 0;
    for(auto& rule:rules){

        parser[rule.c_str()].enter = [rule, &indent](const char* s, size_t n, any& dt) {
            for(int i = 0; i < indent; i++){std::cout << "|  ";}
            std::cout << rule << " [?] " << filter(std::string(s)) << "\n";
            indent++;
        };

        parser[rule.c_str()] = [rule](const SemanticValues &vs) {
            if(vs.tokens.size() > 0){
                return vs.token_to_string();
            }
            else{
                auto result = std::stringstream("");
                for(const auto &v:vs){
                    result << any_cast<std::string>(v);
                }
                return result.str();
            }
        };

        parser[rule.c_str()].leave = [rule, &indent](const char* s, size_t n, size_t matchlen, any& value, any& dt) {
            indent--;
            for(int i = 0; i < indent; i++){std::cout << "|  ";}
            std::cout << rule << (success(matchlen)? " [v] ":" [X] ") << filter(std::string(s)) << '\n';
        };
    }

    std::string output;
    std::cout << "parsing..." << "\n";
    ok = parser.parse(source, output);

    if(ok){
        cout << "parsing successful\n";
    }
    else{
        cout << "parsing not successful\n";
    }

    ofstream outputFile;
    outputFile.open("test.ff.py");
    outputFile << output;
    outputFile.close();

    return 0;
}

Here is my grammar python.peg:

Program <- $Indent<''> LineBreak? CodeSegment*
Wrong <- .*

Token <- (Misc / String / Group / !ColonBreak !LineBreak !Whitespace .)
Code <- Token (Whitespace / Token)*
ContinuingCode <- (Token / LineBreak)+
CodeLine <- <Code (LineBreak / EndOfFile)> # covers inline compound
Compound <- $(CompoundHeader MeasureMoreIndent CodeSegment IndentedCodeSegment*) CompoundEnd?
CompoundHeader <- <Code ColonBreak>
CompoundEnd <- SpaceTab* CompoundEndMarker

CodeSegment <- CodeLine / Compound
IndentedCodeSegment <- $Indent CodeSegment
MeasureMoreIndent <- $Indent<$Indent AdditionalIndent>
AdditionalIndent <-SpaceTab+

Group <- Parenthesized / Bracketed / Braced
Parenthesized <- '(' ContinuingCode? ')'
Bracketed <- '[' ContinuingCode? ']'
Braced <- '{' ContinuingCode? '}'

String <- TripleQuoteString / HexaQuoteString / SingleQuoteString / DoubleQuoteString
SingleQuoteString <- "'" (!"'" !'\n' StringCharacter)* "'"
DoubleQuoteString <- '"' (!'"' !'\n' StringCharacter)* '"'
TripleQuoteString <- "'''" (!"'''" StringCharacter)* "'''"
HexaQuoteString <- '"""' (!'"""' StringCharacter)* '"""'
StringCharacter <- ('\\' .) / .

SpaceTab <- ' ' / '\t'
Continuation <- '\\\n'
LineBreak <- (Whitespace? Comment? '\n')+
ColonBreak <- ':' LineBreak
Whitespace <- (SpaceTab / Continuation)+
Comment <- !CompoundEndMarker '#' (!'\n' .)*
CompoundEndMarker <- '#.\n'
EndOfFile <- !.

Misc <- [-!$%&*+,./0-9;<=>?@A-Z^_`a-z|~]+

here is my input test.py:

and here is the debug output:

parsing...
Program [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  LineBreak [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  ...
|  LineBreak [X] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  ...
|  CodeSegment [v] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [?] →4:↲→→5↲
|  |  ...
|  CodeSegment [X] →4:↲→→5↲
Program [v] 1:↲→2:↲→→3↲→4:↲→→5↲
4:2: syntax error, unexpected '4'.
parsing not successful

yhirose commented 3 years ago

@mqnc, sorry for the late reply. It seems like there is a problem in the packrat memoize logic. It's bit hard to track down the problem... Is it possible to make the smallest possible grammar to reproduce the problem? Thanks!

mqnc commented 3 years ago

I managed to reduce the grammar:

Program <- $Indent<''> '\n'? CodeSegment*

CodeLine <- [a-z0-9]+ '\n'
Block <- $(Header MeasureMoreIndent CodeSegment IndentedCodeSegment*)
Header <- [a-z0-9]+ (':' '\n')

CodeSegment <- CodeLine / Block
IndentedCodeSegment <- $Indent CodeSegment
MeasureMoreIndent <- $Indent<$Indent '\t'+>

Whitespace <- '\t'+

which produces the following output:

parsing...
no packrat:
Program [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  CodeLine [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  CodeLine [✕] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  Block [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  Header [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  Header [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  MeasureMoreIndent [?] →2:↲→→3↲→4:↲→→5↲
|  |  |  MeasureMoreIndent [✓] →2:↲→→3↲→4:↲→→5↲
|  |  |  CodeSegment [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  CodeLine [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  CodeLine [✕] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  Block [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  Header [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  Header [✓] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  MeasureMoreIndent [?] →→3↲→4:↲→→5↲
|  |  |  |  |  MeasureMoreIndent [✓] →→3↲→4:↲→→5↲
|  |  |  |  |  CodeSegment [?] 3↲→4:↲→→5↲
|  |  |  |  |  |  CodeLine [?] 3↲→4:↲→→5↲
|  |  |  |  |  |  CodeLine [✓] 3↲→4:↲→→5↲
|  |  |  |  |  CodeSegment [✓] 3↲→4:↲→→5↲
|  |  |  |  |  IndentedCodeSegment [?] →4:↲→→5↲
|  |  |  |  |  IndentedCodeSegment [✕] →4:↲→→5↲
|  |  |  |  Block [✓] 2:↲→→3↲→4:↲→→5↲
|  |  |  CodeSegment [✓] 2:↲→→3↲→4:↲→→5↲
|  |  |  IndentedCodeSegment [?] →4:↲→→5↲
|  |  |  |  CodeSegment [?] 4:↲→→5↲
|  |  |  |  |  CodeLine [?] 4:↲→→5↲
|  |  |  |  |  CodeLine [✕] 4:↲→→5↲
|  |  |  |  |  Block [?] 4:↲→→5↲
|  |  |  |  |  |  Header [?] 4:↲→→5↲
|  |  |  |  |  |  Header [✓] 4:↲→→5↲
|  |  |  |  |  |  MeasureMoreIndent [?] →→5↲
|  |  |  |  |  |  MeasureMoreIndent [✓] →→5↲
|  |  |  |  |  |  CodeSegment [?] 5↲
|  |  |  |  |  |  |  CodeLine [?] 5↲
|  |  |  |  |  |  |  CodeLine [✓] 5↲
|  |  |  |  |  |  CodeSegment [✓] 5↲
|  |  |  |  |  Block [✓] 4:↲→→5↲
|  |  |  |  CodeSegment [✓] 4:↲→→5↲
|  |  |  IndentedCodeSegment [✓] →4:↲→→5↲
|  |  Block [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
Program [✓] 1:↲→2:↲→→3↲→4:↲→→5↲

packrat:
Program [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  CodeLine [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  CodeLine [✕] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  Block [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  Header [?] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  Header [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
|  |  |  MeasureMoreIndent [?] →2:↲→→3↲→4:↲→→5↲
|  |  |  MeasureMoreIndent [✓] →2:↲→→3↲→4:↲→→5↲
|  |  |  CodeSegment [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  CodeLine [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  CodeLine [✕] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  Block [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  Header [?] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  Header [✓] 2:↲→→3↲→4:↲→→5↲
|  |  |  |  |  MeasureMoreIndent [?] →→3↲→4:↲→→5↲
|  |  |  |  |  MeasureMoreIndent [✓] →→3↲→4:↲→→5↲
|  |  |  |  |  CodeSegment [?] 3↲→4:↲→→5↲
|  |  |  |  |  |  CodeLine [?] 3↲→4:↲→→5↲
|  |  |  |  |  |  CodeLine [✓] 3↲→4:↲→→5↲
|  |  |  |  |  CodeSegment [✓] 3↲→4:↲→→5↲
|  |  |  |  |  IndentedCodeSegment [?] →4:↲→→5↲
|  |  |  |  |  IndentedCodeSegment [✕] →4:↲→→5↲
|  |  |  |  Block [✓] 2:↲→→3↲→4:↲→→5↲
|  |  |  CodeSegment [✓] 2:↲→→3↲→4:↲→→5↲
|  |  Block [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
|  CodeSegment [?] →4:↲→→5↲ <-------------------------- puzzling
|  |  CodeLine [?] →4:↲→→5↲
|  |  CodeLine [✕] →4:↲→→5↲
|  |  Block [?] →4:↲→→5↲
|  |  |  Header [?] →4:↲→→5↲
|  |  |  Header [✕] →4:↲→→5↲
|  |  Block [✕] →4:↲→→5↲
|  CodeSegment [✕] →4:↲→→5↲
Program [✓] 1:↲→2:↲→→3↲→4:↲→→5↲ <-------------------------- also puzzling
4:1: syntax error, unexpected '\t', expecting <Header>, <CodeLine>.

I think the only unorthodox thing I do is MeasureMoreIndent <- $Indent<$Indent '\t'+> which is supposed to capture the indentation of the outer block together with further indentation and store that in the same named capture.

My suspicion is that it is in general problematic to use named captures together with packrat parsing because it introduces state to the parser which has to be considered in memoization. Consider this:

Start <- (Branch1 / Branch2)
Branch1 <- $Capture<'A'> 'B' Captured
Branch2 <- 'A' $Capture<'B'> Captured
Captured <- $Capture

Running this grammar on "ABB" works without packrat parsing but with packrat parsing, in Branch2 when it reaches the second B it falsely remembers 'I already know that this is not Captured'.

It produces the following output, as expected:

parsing...
no packrat:
Start [?] ABB
|  Branch1 [?] ABB
|  |  Captured [?] B
|  |  Captured [✕] B
|  Branch1 [✕] ABB
|  Branch2 [?] ABB
|  |  Captured [?] B
|  |  Captured [✓] B
|  Branch2 [✓] ABB
Start [✓] ABB

packrat:
Start [?] ABB
|  Branch1 [?] ABB
|  |  Captured [?] B
|  |  Captured [✕] B
|  Branch1 [✕] ABB
|  Branch2 [?] ABB
|  Branch2 [✕] ABB
Start [✕] ABB
1:3: syntax error, unexpected 'B'.

However, this doesn't explain the two situations that I marked with puzzling in the first output... Why does it suddenly restart parsing at →4:↲→→5↲ after accepting that the complete text is a CodeSegment? Why does it end with Program [✓] but still report failure?

mqnc commented 3 years ago

I don't know how to combine named captures together with packrat parsing. Either you need to consider the state of all captures in

auto idx = def_count * static_cast<size_t>(col) + def_id;
if (cache_registered[idx]) {

or you need to deactivate packrat parsing whenever a back reference is made (including in the complete rule stack that led to the reference). Those are the options I see for now... Or you just disallow packrat parsing completely whenever there are named captures in the grammar.

yhirose commented 3 years ago

@mqnc, yes, it's very hard to make the packrat parsing work with back reference operators. In your last example, we can make it work with a slight modification like this.

Start <- (Branch1 / Branch2)
Branch1 <- $Capture<'A'> 'B' $Capture
Branch2 <- 'A' $Capture<'B'> $Capture

In other words, it's safe to use back references as long as they are in the same definition rule as the corresponding capture operators. But as soon as the back references go into another rules, then the packrat can no longer handle the situation.

Or you just disallow packrat parsing completely whenever there are named captures in the grammar.

This is the easiest way to handle this issue, even if we miss some of grammars that can actually work with packrat mode like the above grammar. But I am thinking to take this safer approach at this point. I'll ponder over this issue more to see if I can take a more sophisticated approach with decent amount of change.

Thanks for your excellent research!

yhirose / cpp-peglib

strange bug when using packrat parsing #147