zevv / npeg

PEGs for Nim, another take
MIT License
330 stars 22 forks source link

Prevent code blocks captures from executing #45

Closed Patitotective closed 1 year ago

Patitotective commented 2 years ago

I want to make a pattern that uses another pattern but doesn't execute that pattern's code block capture.

import npeg

# This parser parses words and no words
# Words are one or more alpha characters
# And no words are words with a dash before them
const parser = peg("nodes", data: seq[string]):
  nodes <- *(node * (' ' | !1))
  node <- noWord | word
  noWord <- '-' * word:
    data.add($0)
  word <- +Alpha:
    data.add($0)

var data: seq[string]
assert parser.match("a -b", data).ok
assert data == @["a", "b", "-b"] # I want it to be @["a", "-b"]

There could be, perhaps, an operator that allowed it:

%P # matches P without executing it's code block capture

image

Patitotective commented 2 years ago

Actually this is a much better example of my problem:

import npeg

# This parser parses words and no words
# Words are one or more alpha characters
# And no words are words with a dash after them
const parser = peg("nodes", data: seq[string]):
  nodes <- *(node * (' ' | !1))
  node <- noWord | word
  noWord <- word * dash
  word <- +Alpha:
    data.add($0)
  dash <- '-':
    data.add($0)

var data: seq[string]
assert parser.match("a b-", data).ok
assert data == @["a", "a", "b", "-"]

I want the word's capture block to be executed from noWord only if noWord matches successfully.

zevv commented 2 years ago

Hm, I see what you're trying to do, but I'm not sure it is a good idea to solve it with yet another operator. As you have seen in the manual code block captures are a bit of a PITA because they always match, even if they are part of a backtracked pattern.

One solution would be to be explicit about word and notWord having the trailing dash or not by using the ! operator, like so:

  body <- +Alpha
  word <- body * !'-'
  noWord <- body * '-'

body is a simple pattern matching a string of alpha characters, word will match if and only if the body is not followed by a -, while noWord will only match if body is followed by a -

You can incorporate this into your example like this:

const parser = peg("nodes", data: seq[string]):   
  nodes <- *(node * (' ' | !1))                  
  node <- noWord | word          
  body <- +Alpha        
  word <- >body * !'-':   
    data.add("word " & $1)
  noWord <- >body * '-':      
    data.add("noword " & $1)

var data: seq[string]          
assert parser.match("a b- c", data).ok
assert data == @["word a", "noword b", "word c"]

Does this solve your problem?

Patitotective commented 2 years ago

My actual peg is a little more complicated, it is meant to be a lexer and it adds tokens to a stack whenever it finds them. So my word pattern in there, has more patterns inside that would match (and execute the capture code block) before I can check that it is a noWord (!'-').

I'll try to explain my actual use case: I'm trying to implement the KDL document language in Nim, it's syntax is pretty straightforward, it goes like this:

# node val key=val val1 key1=val1 val2 # Properties and arguments
node "Hello" "name"="zevv" 1 true age=20

Therefore in my lexerPeg I have (prop | value) because properties and values can be interspersed by spaces all over the node. The issue is that prop has the strOrIdent pattern inside that matches an identifier (without quotes) or a string, and the capture code block of the string is called before it checks whether there is a '=' after it or not. And adding the tokens to the stack when prop matches instead of it's sub-patterns, makes it a little harder because I would need to parse value (remember a property is a key=val) again to know which kind is it (I could perhaps parse the values and set them to lexer.currentValueToken that is then used on other patterns (like prop or node)).

Patitotective commented 2 years ago

This could be an example of my more complex peg:

import npeg

const parser = peg("nodes", data: seq[string]):
  nodes <- *(node * (' ' | !1))
  node <- extraWord | word
  extraWord <- word * extra * dash
  word <- +Alpha:
    data.add($0)
  dash <- '-':
    data.add($0)
  extra <- number | dot
  number <- +Digit:
    data.add($0 & "(int)")
  dot <- '.':
    data.add($0 & "(dot)")

var data: seq[string]
assert parser.match("a b1- c.", data).ok
assert data == @["a", "a", "b", "1(int)", "-", "c", ".(dot)", "c"]
zevv commented 2 years ago

Hmm, your real grammar is quite big already, it's going to cost me some time to properly get into that, so I'll just be ignorant and look at your smaller examples for now...

I think that the general idea with code block captures is that you should run these as late as possible - that is, when you are sure you have a proper match that will not backtrack. In the mean time you can collect everything you need in regular captures in the nested rules, and access these in your code block using $1 .. $9. Alternatively you could pass around some local state to store things yourself you need later. So don't run a code block capture for your word, just make it a normal string capture so it will be available later when you have decided it is either a prop or a value; then in your prop and value rules, if will be available for you in one of the $ variables.

Can you give the asser() of what you would like as the result for your last example's match?

Patitotective commented 2 years ago

Thanks for the suggestions. I will try what you're saying, I also think adding tokens independently without validating them is not good. I expect it to be assert data == @["a", "b", "1(int)", "-", "c"] and then fail because . shouldn't match any pattern.

zevv commented 1 year ago

Closing this for inactivity, feel free to reopen if appropriate.