yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
880 stars 112 forks source link

Optional suffix with no %whitespace #250

Closed kfsone closed 1 year ago

kfsone commented 1 year ago

Trying to capture command-line arguments as "switch and value" without allowing whitespace between the two parts.

The following grammar "works", but unfortunately allows spaces between CMDSWITCH and CMDVALUE.

Grammar
    <-  ( _ / Statement )* EOI

Statement
    <-  Default

Default
    <-  Asset ↑ CmdlineArg+ ↑ Newline+
Asset
    <-  < NAME >

CmdlineArg
    <-  CMDSWITCH ( '=' (STRING / CMDVALUE)? )?

~Newline
    <- < COMMENT? [\r\n]+ >

%whitespace
    <- [ \t]*

%word
    <- [a-zA-Z][a-zA-Z0-9_.-]*

~_
    <- (COMMENT? EOL)+

~EOL
    <- '\r'? '\n'

~EOI
    <- ! .      # literally: not anything

~COMMENT
    <- < '//' [^\n]* >

NAME
    <- < [a-zA-Z_] [a-zA-Z0-9_]* >

STRING
    <- < '"' ( '\\' !('\r'? '\n') / '""' / [^"\\\r\n] )* '"' >

CMDSWITCH
    <- < '-' '-'? [a-zA-Z0-9_] [a-zA-Z0-9_.-]* >

CMDVALUE
    <- < ( '\\' [^\r\n] / [^ \t\r\n\\] )+ >

image

Using '<' ... '>' in CmdArg results in it only producing a single value, however

image

Is there some way to both prevent whitespace AND still retain the split values?

yhirose commented 1 year ago

Could you make the grammar as small as possible, so that I can understand what you are mentioning? Thanks!

kfsone commented 1 year ago
Grammar <- ( NAME CmdlineArg EOL )* EOI

%whitespace <- [ \t]*
%word <- [a-zA-Z][a-zA-Z0-9_.-]*
~EOL <- ('\r'? '\n')+
~EOI <- ! .

CmdlineArg <- CMDSWITCH ( '=' (STRING / CMDVALUE)? )?

NAME <- < [a-zA-Z_] [a-zA-Z0-9_]* >

STRING <- < '"' ( '\\' !('\r'? '\n') / '""' / [^"\\\r\n] )* '"' >

CMDSWITCH <- < '-' '-'? [a-zA-Z0-9_] [a-zA-Z0-9_.-]* >

CMDVALUE <- < ( '\\' [^\r\n] / [^ \t\r\n\\] )+ >

inputs, correct:

thing1 --switch
thing2 --switch=value

but it also incorrectly allows

thing3 --switch = something

And, demonstration that it currently parses the last, incorrectly:

image

kfsone commented 1 year ago

One work around is to write a token.immediate to do this match, but because it can't use named patterns/named terminals, it results in duplication of grammar segments, and I really don't want to embed the string pattern multiple times -- STRING is used frequently in the real grammar.

yhirose commented 1 year ago

As long as you use %whitespace, there is no way to disallow spaces between tokens. Sorry...