zevv / npeg

PEGs for Nim, another take
MIT License
329 stars 22 forks source link

Failure to consume in particular case #25

Closed JohnAD closed 4 years ago

JohnAD commented 4 years ago

I'm working on a SQL parser that converts a "WHEN" expression into reverse-polish notation. I'm coming across a problem where the system is ... possibly .. failing to consume the end-point numbers.

I must admit I'm new to PEG in general. So, if I'm making a newbie mistake, I do apologize.

Rather than go into great detail, I wrote a tiny program that demonstrates it:

import npeg
import strutils

let parser = peg("equation", p: seq[string]):

  equation <- allExpr

  plusMinus <- '+' | '-'

  number <- (integer)

  integer <- >+Digit:
    p.add "literal:" & $1

  whitespace <- +(Blank | Space | Cntrl)

  addExpr <- number * ?whitespace * >plusMinus * ?whitespace * (addExpr | number):
    p.add "op:" & $1

  allExpr <- (addExpr | number)

let testStr = "1 + 2 - 3 + 4 - 5"

var answer: seq[string] = @[]

let successFlag = parser.match(testStr, answer).ok

echo "test = [", testStr, "]"

echo "success = ", $successFlag

echo "answer = "
for item in answer:
  echo "  ", $item

The result I'm getting in answer is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  literal:5
  op:-
  op:+
  op:-
  op:+

but what I'd expect is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  op:-
  op:+
  op:-
  op:+

So, the section that is parsing a number is running twice on the same "5" text. Am I missing something or is there a bug?

zevv commented 4 years ago
                     ╭──────»───────╮  ╭╶╶╶╶╶╶╶╶╶╶╶╶╶╮  ╭──────»───────╮                  
addExpr o──[number]─»┴─[whitespace]─┴»───[plusMinus]───»┴─[whitespace]─┴»─┬─[addExpr]─┬──o
                                       ╰╶╶╶╶╶╶╶╶╶╶╶╶╶╯                    ╰─[number]──╯   

This is the main rule of your grammar (generated with -d:npegGraph). Here's what is happening:

The problem you see is with the part "if that fails". This will cause NPeg to backtrack, basically saying "Hey, that didn't work, lets try something else". If you would let NPeg collect the captures for you, it would have thrown the last one out at that point, so all would be ok. The issue is that for code block captures (copied from the manual) "the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked.".

So: at the last bit of the line, NPeg tries to find another addExpr, it matches your number and calls your code block. Then it goes "Hey, that didn't work, backtracks to go search a number instead, and again calls your code block.

Partly, this is a limitation of NPeg, and given the other issues that are currently open this is something for which I'd need to find a solution. The problem is not unique for NPeg, I know other Peg implementations also show the same behaviour, notably LPeg, which was the inspiration for NPegs implementation.

In idiomatic PEG, I'd rewrite your grammar like this: first match a number, and then try to match zero or more times ( plusMinus followed by a number)

let parser = peg("equation", p: seq[string]):

  equation <- allExpr

  plusMinus <- >('+' | '-'):
    p.add "op:" & $1    

  number <- (integer)

  integer <- >+Digit:    
    echo "got int ", $1  
    p.add "literal:" & $1               

  whitespace <- +(Blank | Space | Cntrl)                                

  addExpr <- ?whitespace * >plusMinus * ?whitespace * number

  allExpr <- number * *addExpr       

This is what the graph now looks like;

          ╭──────»───────╮  ╭╶╶╶╶╶╶╶╶╶╶╶╶╶╮  ╭──────»───────╮             
addExpr o─┴─[whitespace]─┴»───[plusMinus]───»┴─[whitespace]─┴»─[number]──o
                            ╰╶╶╶╶╶╶╶╶╶╶╶╶╶╯                               

                     ╭──────»──────╮  
allExpr o──[number]─»┴┬─[addExpr]─┬┴─o
                      ╰─────«─────╯   

Non related, but just a comment: if your whitespace is always optional, move the ? to the whitespace rule itself so you don't have to put that before every whitespace. Explicit handling of whitespace is always a bit of a pain in the behind in PEGs, as there simpy is no lexer doing that. I usually use a very short name like s for that so it's a bit less intrusive.

zevv commented 4 years ago

Hey @JohnAD, anything I can do for you with this, or shall I close the issue?

JohnAD commented 4 years ago

Feel free to close the issue. Your explanation was very useful. Thank you!

I've decided to use npeg strictly for tokenizing the SQL statements; which is very useful. I'll then handle with "WHEN" clauses with hand-built code for interpretation.

zevv commented 4 years ago

I've decided to use npeg strictly for tokenizing the SQL statements; which is very useful. I'll then handle with "WHEN" clauses with hand-built code for interpretation.

Sounds like a good plan, just pick the appropriate tools for the job, and npeg often isnt :)