yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
900 stars 112 forks source link

Missing mention to "rep" in operators list in README #193

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

I'm still trying to extract the peglib grammar (why not it's already available?) and found that the operator rep used on it is not listed on the README like the others.

mingodad commented 2 years ago

Doing my tests with my extracted grammar I noticed that when the grammar ends with a comment without newline the parser reject it, see the culebra.peg or the one shown bellow from the README on the playground (notice now newline after the last line), removing the EndLine from the Comenet fixes the problem and doesn't seem to have negative side effects (see bellow).

KEYWORD   <- 'keyword'
KEYWORDI  <- 'case_insensitive_keyword'
WORD      <-  < [a-zA-Z0-9] [a-zA-Z0-9-_]* >    # token boundary operator is used.
IDNET     <-  < IDENT_START_CHAR IDENT_CHAR* >  # token boundary operator is used.

Output:

4:83 syntax error

Actual hardcoded grammar:

    g["Comment"] <=
        seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())), g["EndOfLine"]);

Fixed to handle comments not ending in newline:

    g["Comment"] <=
        seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())));
mingodad commented 2 years ago

OBS.: I edited this message with the latest fully working manually extracted grammar and the EBNF.

Here is the last extracted grammar, it has trouble parsing Sum ← List(Product, SumOpe), List(I, D) ← I (D I)* and IdentStart <- !"↑" !"⇑" ([a-zA-Z_%] / [\u0080-\uFFFF]), any help on fixing it is appreciated .

# Setup PEG syntax parser
Grammar <-  Spacing  Definition+  EndOfFile

Definition <-
    Ignore  IdentCont  Parameters  LEFTARROW Expression  Instruction?
    / Ignore  Identifier  LEFTARROW  Expression Instruction?

Expression <-  Sequence  (SLASH  Sequence)*

Sequence <-  (CUT /  Prefix)*

Prefix <-  (AND /  NOT)?  SuffixWithLabel

SuffixWithLabel <- Suffix  (LABEL  Identifier)?

Suffix <-  Primary  Loop?

Loop <-  QUESTION /  STAR /  PLUS /  Repetition

Primary <-
    Ignore  IdentCont  Arguments !LEFTARROW
    / Ignore  Identifier !(Parameters?  LEFTARROW)
    / OPEN  Expression  CLOSE
    / BeginTok  Expression  EndTok
    / BeginCapScope  Expression  EndCapScope
    / BeginCap  Expression  EndCap
    / BackRef
    / LiteralI
    / Dictionary
    / Literal
    / NegatedClass
    / Class
    /  DOT

Identifier <-  IdentCont  Spacing

IdentCont <- IdentStart  IdentRest*

IdentStart <-  !"↑"  !"⇑" ([a-zA-Z_%] /  [\u0080-\uFFFF])

IdentRest <-  IdentStart /  [0-9]

Dictionary <-  LiteralD  (PIPE  LiteralD)+

lit_ope <-
    [']  <(![']  Char)*> [']  Spacing
    / ["]  <(!["]  Char)*> ["]  Spacing

Literal <-  lit_ope

LiteralD <-  lit_ope

LiteralI <-
    [']  <(![']  Char)*>  "'i" Spacing
    / ["]  <(!["]  Char)*>  '"i' Spacing

# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class <-  '['  !'^' <(!']'  Range)+>  ']' Spacing
NegatedClass <-  "[^" <(!']'  Range)+>  ']' Spacing

Range <-  (Char  '-'  Char) /  Char

Char <-
    '\\'  [nrt'\"[\]\\^]
    / '\\'  [0-3]  [0-7]  [0-7]
    / '\\'  [0-7]  [0-7]?
    / "\\x"  [0-9a-fA-F]  [0-9a-fA-F]?
    / "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5})
    / !'\\'   .

Repetition <- BeginBlacket  RepetitionRange  EndBlacket

RepetitionRange <-
    Number  COMMA  Number
    / Number  COMMA
    /  Number
    / COMMA  Number

Number <-  [0-9]+  Spacing

LEFTARROW <-  ("<-" / "←")  Spacing

~SLASH <-  '/'  Spacing
~PIPE <-  '|'  Spacing
AND <-  '&'  Spacing
NOT <-  '!'  Spacing
QUESTION <- '?'  Spacing
STAR <-  '*'  Spacing
PLUS <-  '+'  Spacing
~OPEN <-  '('  Spacing
~CLOSE <- ')'  Spacing
DOT <-  '.'  Spacing

CUT <-  "↑"  Spacing
~LABEL <-  ('^' /  "⇑")  Spacing

~Spacing <-  (Space /  Comment)*
Comment <- '#'  (!EndOfLine   . )*
Space <-  ' ' /  '\t' /  EndOfLine
EndOfLine <-  "\r\n" /  '\n' /  '\r'
EndOfFile <-  ! .

~BeginTok <-  '<'  Spacing
~EndTok <-  '>'  Spacing

~BeginCapScope <-  '$'  '('  Spacing
~EndCapScope <-  ')'  Spacing

BeginCap <-  '$'  <IdentCont>  '<'  Spacing
~EndCap <-  '>'  Spacing

BackRef <-  '$'  <IdentCont>  Spacing

IGNORE <-  '~'

Ignore <-  IGNORE?
Parameters <-  OPEN  Identifier (COMMA  Identifier)*  CLOSE
Arguments <-  OPEN  Expression (COMMA  Expression)*  CLOSE
~COMMA <-  ','  Spacing

# Instruction grammars
Instruction <-
    BeginBlacket (InstructionItem  (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem <- PrecedenceClimbing /  ErrorMessage /  NoAstOpt
~InstructionItemSeparator <-  ';'  Spacing

~SpacesZom <-  Space*
~SpacesOom <-  Space+
~BeginBlacket <-  '{'  Spacing
~EndBlacket <-  '}'  Spacing

# PrecedenceClimbing instruction
PrecedenceClimbing <- "precedence"  SpacesOom  PrecedenceInfo (SpacesOom  PrecedenceInfo)*  SpacesZom
PrecedenceInfo <- PrecedenceAssoc (~SpacesOom  PrecedenceOpe)+
PrecedenceOpe <-
    ['] <(!(Space /  ['])  Char)*> [']
    / ["] <(!(Space /  ["])  Char)*> ["]
    / <(!(PrecedenceAssoc /  Space /  '}')  . )+>
PrecedenceAssoc <-  [LR]

# Error message instruction
ErrorMessage <- "message"  SpacesOom  LiteralD  SpacesZom

# No Ast node optimazation instruction
NoAstOpt <-  "no_ast_opt"  SpacesZom

And here converted to the EBNF to be viewed at https://www.bottlecaps.de/rr/ui:

//# Setup PEG syntax parser
Grammar::=  Spacing  Definition+  EndOfFile

Definition::=
    Ignore  IdentCont  Parameters  LEFTARROW Expression  Instruction?
    | Ignore  Identifier  LEFTARROW  Expression Instruction?

Expression::=  Sequence  (SLASH  Sequence)*

Sequence::=  (CUT |  Prefix)*

Prefix::=  (AND |  NOT)?  SuffixWithLabel

SuffixWithLabel::= Suffix  (LABEL  Identifier)?

Suffix::=  Primary  Loop?

Loop::=  QUESTION |  STAR |  PLUS |  Repetition

Primary::=
    Ignore  IdentCont  Arguments _NOT_ LEFTARROW
    | Ignore  Identifier _NOT_ (Parameters?  LEFTARROW)
    | OPEN  Expression  CLOSE
    | BeginTok  Expression  EndTok
    | BeginCapScope  Expression  EndCapScope
    | BeginCap  Expression  EndCap
    |  BackRef
    | LiteralI
    |  Dictionary
    |  Literal
    |  NegatedClass
    | Class
    |  DOT

Identifier::=  IdentCont  Spacing

IdentCont::= IdentStart  IdentRest*

IdentStart::=  _NOT_ "↑"  _NOT_ "⇑" ([a-zA-Z_%] |  [\u0080-\uFFFF])

IdentRest::=  IdentStart |  [0-9]

Dictionary::=  LiteralD  (PIPE  LiteralD)+

lit_ope::=
    ['] _TKOPEN_ (_NOT_ [']  Char)* _TKCLOSE_ [']  Spacing
    | ["] _TKOPEN_ (_NOT_ ["]  Char)* _TKCLOSE_ ["]  Spacing

Literal::=  lit_ope

LiteralD::=  lit_ope

LiteralI::=
    ['] _TKOPEN_ (_NOT_ [']  Char)* _TKCLOSE_  "'i" Spacing
    | ["] _TKOPEN_ (_NOT_ ["]  Char)* _TKCLOSE_  '"i' Spacing

//# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class::=  '['  _NOT_ '^' _TKOPEN_ ( _NOT_ ']'  Range)+ _TKCLOSE_  ']' Spacing
NegatedClass::=  "[^" _TKOPEN_ ( _NOT_ ']'  Range)+ _TKCLOSE_  ']' Spacing

Range::=  (Char  '-'  Char) |  Char

Char::=
    '\\'  [nrt'\"#x5B\#x5d\\^]
    | '\\'  [0-3]  [0-7]  [0-7]
    | '\\'  [0-7]  [0-7]?
    | "\\x"  [0-9a-fA-F]  [0-9a-fA-F]?
    | "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]'{4,4}') / [0-9a-fA-F]'{4,5}')
    | _NOT_ '\\'   .

Repetition::= BeginBlacket  RepetitionRange  EndBlacket

RepetitionRange::=
    Number  COMMA  Number
    | Number  COMMA
    |  Number
    | COMMA  Number

Number::=  [0-9]+  Spacing

LEFTARROW::=  ("<-" | "←")  Spacing

/*~*/SLASH::=  '/'  Spacing
/*~*/PIPE::=  '|'  Spacing
AND::=  '&'  Spacing
NOT::=  '!'  Spacing
QUESTION::= '?'  Spacing
STAR::=  '*'  Spacing
PLUS::=  '+'  Spacing
/*~*/OPEN::=  '('  Spacing
/*~*/CLOSE::= ')'  Spacing
DOT::=  '.'  Spacing

CUT::=  "↑"  Spacing
/*~*/LABEL::=  ('^' |  "⇑")  Spacing

/*~*/Spacing::=  (Space |  Comment)*
Comment::= '#'  (_NOT_ EndOfLine   . )*
Space::=  ' ' |  '\t' |  EndOfLine
EndOfLine::=  "\r\n" |  '\n' |  '\r'
EndOfFile::=  _NOT_  .

/*~*/BeginTok::=  '<'  Spacing
/*~*/EndTok::=  '>'  Spacing

/*~*/BeginCapScope::=  '$'  '('  Spacing
/*~*/EndCapScope::=  ')'  Spacing

BeginCap::=  '$' _TKOPEN_ IdentCont _TKCLOSE_  '<'  Spacing
/*~*/EndCap::=  '>'  Spacing

BackRef::=  '$' _TKOPEN_ IdentCont _TKCLOSE_  Spacing

IGNORE::=  '~'

Ignore::=  IGNORE?
Parameters::=  OPEN  Identifier (COMMA  Identifier)*  CLOSE
Arguments::=  OPEN  Expression (COMMA  Expression)*  CLOSE
/*~*/COMMA::=  ','  Spacing

//# Instruction grammars
Instruction::=
    BeginBlacket (InstructionItem  (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem::= PrecedenceClimbing |  ErrorMessage |  NoAstOpt
/*~*/InstructionItemSeparator::=  ';'  Spacing

/*~*/SpacesZom::=  Space*
/*~*/SpacesOom::=  Space+
/*~*/BeginBlacket::=  '{'  Spacing
/*~*/EndBlacket::=  '}'  Spacing

//# PrecedenceClimbing instruction
PrecedenceClimbing::= "precedence"  SpacesOom  PrecedenceInfo (SpacesOom  PrecedenceInfo)*  SpacesZom
PrecedenceInfo::= PrecedenceAssoc (/*~*/SpacesOom  PrecedenceOpe)+
PrecedenceOpe::=
    ['] _TKOPEN_ (_NOT_ (Space |  ['])  Char)* _TKCLOSE_ [']
    | ["] _TKOPEN_ (_NOT_ (Space |  ["])  Char)* _TKCLOSE_ ["]
    | _TKOPEN_ (_NOT_ (PrecedenceAssoc |  Space |  '}')  . )+ _TKCLOSE_
PrecedenceAssoc::=  [LR]

//# Error message instruction
ErrorMessage::= "message"  SpacesOom  LiteralD  SpacesZom

//# No Ast node optimazation instruction
NoAstOpt::=  "no_ast_opt"  SpacesZom

//Tokens add for EBNF
_NOT_ ::= '!'
_TKOPEN_ ::= '<'
_TKCLOSE_ ::= '>'
mingodad commented 2 years ago

One of the problematic rule is this one (that is hardcoded):

"\\u" ('0'  [0-9a-fA-F] /  "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}

When replaced by:

"\\u" [0-9a-fA-F]{4,5}

Then it pass parsing IdentStart <- [\u0080-\uFFFF], another problem that I found and was my fault was the COMMA <- ' ' Spacing on one of my search and replace I wiped out the ,.

I fixed and updated my previous post with the working grammar and EBNF, but still I'm puzzled by this expression "\\u" ('0' [0-9a-fA-F] / "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}

mingodad commented 2 years ago

Again looking carefully I found again my mistake when manually converting this expression "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5}) (shown here fixed/correctly).

So two problems found with this manual conversion of the hardcoded grammar in peglib.h the Comment without newline at then end of the grammar and the missing rep operator in the README.

Thanks for all help !

yhirose commented 2 years ago

Thanks for the report. I just added rep in the operator table in README.