yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
880 stars 112 forks source link

Inter-token and inter-rule "noise" #248

Closed kfsone closed 1 year ago

kfsone commented 1 year ago

Could you make a very subtle change to how %whitespace is handled so that instead of looking for it like a regular token, you instead only test it against unmatched tokens (outside <>).

Problem: Most parser generators are either (1) totally ignore whitespace, so you can't easily require it, (2) clutter your definition with all the places whitespace is allowed.

(2) is really awful when you have a grammar that has levels of whitespace: space/tab between terminals, space/tab/comment/newline between "statements".

Given:

IfThen
  <- IfCondition Newline Action Newline

IfCondition
  <- 'if' Spacing+ IfCondition Spacing*

Action
  <- 'let' Spacing+ Identifier Spacing* Equals Spacing* Value Spacing*

Spacing
  <- [ \t]

Newline
  <- Spacing* (LineComment / '\r'? '\n' / ! .)

this grammar is complicated by 'LineComment', because we probably don't want to allow

if  // why would you put a comment here?
(  // ok, inside the parens is maybe different
  x = // really you're going to linebreak and comment here?
  1  // well this is less awful
)  // please don't put lots of comments between the if and the one-line action
// the fools
let // you probably don't want a comment here
x//or here
=//or here, because it looks like =/
// stop with the comments
1 // no!

(the grammar I wrote has a problem because Action's "Spacing*" and IfThen's "Newline" will conflict)

Peglib's own .peg grammar is hard to read because there is so much of it focused on whitespace allowances.

Making this change to %whitespace would also allow this grammar to work, since you're not looking for %whitespace, you're just not presenting an error when you do encounter it outside <>s.

image (I should probably have used Space+ for efficiency)

It also matches "hello \t\t\t 1", but it would require at least one space/tab between 'hello' and '1'.

This makes writing a "newline" rule much simpler, e.g for pseudo-go where we have to allow for nestable one-line block comments as well as whitespaces, since we don't have to worry about Spacing in the grammar, but if there are scenarios where we require an explicit space, we'd still be able to check for them.

Spacing <- ( [ \t] / InLineBlockComment )

%whitespace <- Spacing+
# a /* block comment */ with no newline in it.
InLineBlockComment <- '/*' (![/*\r\n] / [InLineBlockComment/ '*'+ [^*/\r\n] / '/'+ [^*/\r\n])* '*/'

Grammar <- EndOfLine* (Statement NL)* Statement? EOI

~NL <- (';' EndOfLine* / EndOfLine+ / EOI)
EndOfLine <- LineComment? '\r'? '\n'
LineComment <- '//' ![\n]+

IfStatement <- 'if' Condition '{' NL Statements NL '}'

Statements <- ((Statement NL)* Statement)?
yhirose commented 1 year ago

I think we can't live without 'noise' with PEG. Other PEG libraries provide similar 'white space' handling feature, but I don't think any of those provide the perfect solution. So I am not pursuing this, but a pull request is always welcome!