Closed mqnc closed 3 years ago
@mqnc, sorry for the late reply. It seems like there is a problem in the packrat memoize logic. It's bit hard to track down the problem... Is it possible to make the smallest possible grammar to reproduce the problem? Thanks!
I managed to reduce the grammar:
Program <- $Indent<''> '\n'? CodeSegment*
CodeLine <- [a-z0-9]+ '\n'
Block <- $(Header MeasureMoreIndent CodeSegment IndentedCodeSegment*)
Header <- [a-z0-9]+ (':' '\n')
CodeSegment <- CodeLine / Block
IndentedCodeSegment <- $Indent CodeSegment
MeasureMoreIndent <- $Indent<$Indent '\t'+>
Whitespace <- '\t'+
which produces the following output:
parsing...
no packrat:
Program [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| CodeSegment [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | CodeLine [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | CodeLine [✕] 1:↲→2:↲→→3↲→4:↲→→5↲
| | Block [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | Header [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | Header [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | MeasureMoreIndent [?] →2:↲→→3↲→4:↲→→5↲
| | | MeasureMoreIndent [✓] →2:↲→→3↲→4:↲→→5↲
| | | CodeSegment [?] 2:↲→→3↲→4:↲→→5↲
| | | | CodeLine [?] 2:↲→→3↲→4:↲→→5↲
| | | | CodeLine [✕] 2:↲→→3↲→4:↲→→5↲
| | | | Block [?] 2:↲→→3↲→4:↲→→5↲
| | | | | Header [?] 2:↲→→3↲→4:↲→→5↲
| | | | | Header [✓] 2:↲→→3↲→4:↲→→5↲
| | | | | MeasureMoreIndent [?] →→3↲→4:↲→→5↲
| | | | | MeasureMoreIndent [✓] →→3↲→4:↲→→5↲
| | | | | CodeSegment [?] 3↲→4:↲→→5↲
| | | | | | CodeLine [?] 3↲→4:↲→→5↲
| | | | | | CodeLine [✓] 3↲→4:↲→→5↲
| | | | | CodeSegment [✓] 3↲→4:↲→→5↲
| | | | | IndentedCodeSegment [?] →4:↲→→5↲
| | | | | IndentedCodeSegment [✕] →4:↲→→5↲
| | | | Block [✓] 2:↲→→3↲→4:↲→→5↲
| | | CodeSegment [✓] 2:↲→→3↲→4:↲→→5↲
| | | IndentedCodeSegment [?] →4:↲→→5↲
| | | | CodeSegment [?] 4:↲→→5↲
| | | | | CodeLine [?] 4:↲→→5↲
| | | | | CodeLine [✕] 4:↲→→5↲
| | | | | Block [?] 4:↲→→5↲
| | | | | | Header [?] 4:↲→→5↲
| | | | | | Header [✓] 4:↲→→5↲
| | | | | | MeasureMoreIndent [?] →→5↲
| | | | | | MeasureMoreIndent [✓] →→5↲
| | | | | | CodeSegment [?] 5↲
| | | | | | | CodeLine [?] 5↲
| | | | | | | CodeLine [✓] 5↲
| | | | | | CodeSegment [✓] 5↲
| | | | | Block [✓] 4:↲→→5↲
| | | | CodeSegment [✓] 4:↲→→5↲
| | | IndentedCodeSegment [✓] →4:↲→→5↲
| | Block [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
| CodeSegment [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
Program [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
packrat:
Program [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| CodeSegment [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | CodeLine [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | CodeLine [✕] 1:↲→2:↲→→3↲→4:↲→→5↲
| | Block [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | Header [?] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | Header [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
| | | MeasureMoreIndent [?] →2:↲→→3↲→4:↲→→5↲
| | | MeasureMoreIndent [✓] →2:↲→→3↲→4:↲→→5↲
| | | CodeSegment [?] 2:↲→→3↲→4:↲→→5↲
| | | | CodeLine [?] 2:↲→→3↲→4:↲→→5↲
| | | | CodeLine [✕] 2:↲→→3↲→4:↲→→5↲
| | | | Block [?] 2:↲→→3↲→4:↲→→5↲
| | | | | Header [?] 2:↲→→3↲→4:↲→→5↲
| | | | | Header [✓] 2:↲→→3↲→4:↲→→5↲
| | | | | MeasureMoreIndent [?] →→3↲→4:↲→→5↲
| | | | | MeasureMoreIndent [✓] →→3↲→4:↲→→5↲
| | | | | CodeSegment [?] 3↲→4:↲→→5↲
| | | | | | CodeLine [?] 3↲→4:↲→→5↲
| | | | | | CodeLine [✓] 3↲→4:↲→→5↲
| | | | | CodeSegment [✓] 3↲→4:↲→→5↲
| | | | | IndentedCodeSegment [?] →4:↲→→5↲
| | | | | IndentedCodeSegment [✕] →4:↲→→5↲
| | | | Block [✓] 2:↲→→3↲→4:↲→→5↲
| | | CodeSegment [✓] 2:↲→→3↲→4:↲→→5↲
| | Block [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
| CodeSegment [✓] 1:↲→2:↲→→3↲→4:↲→→5↲
| CodeSegment [?] →4:↲→→5↲ <-------------------------- puzzling
| | CodeLine [?] →4:↲→→5↲
| | CodeLine [✕] →4:↲→→5↲
| | Block [?] →4:↲→→5↲
| | | Header [?] →4:↲→→5↲
| | | Header [✕] →4:↲→→5↲
| | Block [✕] →4:↲→→5↲
| CodeSegment [✕] →4:↲→→5↲
Program [✓] 1:↲→2:↲→→3↲→4:↲→→5↲ <-------------------------- also puzzling
4:1: syntax error, unexpected '\t', expecting <Header>, <CodeLine>.
I think the only unorthodox thing I do is MeasureMoreIndent <- $Indent<$Indent '\t'+>
which is supposed to capture the indentation of the outer block together with further indentation and store that in the same named capture.
My suspicion is that it is in general problematic to use named captures together with packrat parsing because it introduces state to the parser which has to be considered in memoization. Consider this:
Start <- (Branch1 / Branch2)
Branch1 <- $Capture<'A'> 'B' Captured
Branch2 <- 'A' $Capture<'B'> Captured
Captured <- $Capture
Running this grammar on "ABB" works without packrat parsing but with packrat parsing, in Branch2
when it reaches the second B it falsely remembers 'I already know that this is not Captured
'.
It produces the following output, as expected:
parsing...
no packrat:
Start [?] ABB
| Branch1 [?] ABB
| | Captured [?] B
| | Captured [✕] B
| Branch1 [✕] ABB
| Branch2 [?] ABB
| | Captured [?] B
| | Captured [✓] B
| Branch2 [✓] ABB
Start [✓] ABB
packrat:
Start [?] ABB
| Branch1 [?] ABB
| | Captured [?] B
| | Captured [✕] B
| Branch1 [✕] ABB
| Branch2 [?] ABB
| Branch2 [✕] ABB
Start [✕] ABB
1:3: syntax error, unexpected 'B'.
However, this doesn't explain the two situations that I marked with puzzling in the first output...
Why does it suddenly restart parsing at →4:↲→→5↲ after accepting that the complete text is a CodeSegment
? Why does it end with Program [✓]
but still report failure?
I don't know how to combine named captures together with packrat parsing. Either you need to consider the state of all captures in
auto idx = def_count * static_cast<size_t>(col) + def_id;
if (cache_registered[idx]) {
or you need to deactivate packrat parsing whenever a back reference is made (including in the complete rule stack that led to the reference). Those are the options I see for now... Or you just disallow packrat parsing completely whenever there are named captures in the grammar.
@mqnc, yes, it's very hard to make the packrat parsing work with back reference operators. In your last example, we can make it work with a slight modification like this.
Start <- (Branch1 / Branch2)
Branch1 <- $Capture<'A'> 'B' $Capture
Branch2 <- 'A' $Capture<'B'> $Capture
In other words, it's safe to use back references as long as they are in the same definition rule as the corresponding capture operators. But as soon as the back references go into another rules, then the packrat can no longer handle the situation.
Or you just disallow packrat parsing completely whenever there are named captures in the grammar.
This is the easiest way to handle this issue, even if we miss some of grammars that can actually work with packrat mode like the above grammar. But I am thinking to take this safer approach at this point. I'll ponder over this issue more to see if I can take a more sophisticated approach with decent amount of change.
Thanks for your excellent research!
It's been a while since I've last used peglib! The error reporting is amazing!
I want to write a program that parses python and I think I have encountered a bug when using packrat parsing.
Without packrat parsing everything works fine. With packrat parsing, after the code is parsed and
LineBreak? CodeSegment
from the first rule are matched, it tries to match another CodeSegment but for some reason from the middle of the input. My parser prints debug output to show this.I am sorry I didn't reduce the grammar to narrow down the error but maybe you know what's going on right away and I can spare the effort.
Here is my parser:
Here is my grammar python.peg:
here is my input test.py:
and here is the debug output: