Closed mingodad closed 2 years ago
Doing my tests with my extracted grammar I noticed that when the grammar ends with a comment without newline the parser reject it, see the culebra.peg
or the one shown bellow from the README on the playground (notice now newline after the last line), removing the EndLine
from the Comenet
fixes the problem and doesn't seem to have negative side effects (see bellow).
KEYWORD <- 'keyword'
KEYWORDI <- 'case_insensitive_keyword'
WORD <- < [a-zA-Z0-9] [a-zA-Z0-9-_]* > # token boundary operator is used.
IDNET <- < IDENT_START_CHAR IDENT_CHAR* > # token boundary operator is used.
Output:
4:83 syntax error
Actual hardcoded grammar:
g["Comment"] <=
seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())), g["EndOfLine"]);
Fixed to handle comments not ending in newline:
g["Comment"] <=
seq(chr('#'), zom(seq(npd(g["EndOfLine"]), dot())));
OBS.: I edited this message with the latest fully working manually extracted grammar and the EBNF.
Here is the last extracted grammar, it has trouble parsing Sum ← List(Product, SumOpe)
, List(I, D) ← I (D I)*
and IdentStart <- !"↑" !"⇑" ([a-zA-Z_%] / [\u0080-\uFFFF])
, any help on fixing it is appreciated .
# Setup PEG syntax parser
Grammar <- Spacing Definition+ EndOfFile
Definition <-
Ignore IdentCont Parameters LEFTARROW Expression Instruction?
/ Ignore Identifier LEFTARROW Expression Instruction?
Expression <- Sequence (SLASH Sequence)*
Sequence <- (CUT / Prefix)*
Prefix <- (AND / NOT)? SuffixWithLabel
SuffixWithLabel <- Suffix (LABEL Identifier)?
Suffix <- Primary Loop?
Loop <- QUESTION / STAR / PLUS / Repetition
Primary <-
Ignore IdentCont Arguments !LEFTARROW
/ Ignore Identifier !(Parameters? LEFTARROW)
/ OPEN Expression CLOSE
/ BeginTok Expression EndTok
/ BeginCapScope Expression EndCapScope
/ BeginCap Expression EndCap
/ BackRef
/ LiteralI
/ Dictionary
/ Literal
/ NegatedClass
/ Class
/ DOT
Identifier <- IdentCont Spacing
IdentCont <- IdentStart IdentRest*
IdentStart <- !"↑" !"⇑" ([a-zA-Z_%] / [\u0080-\uFFFF])
IdentRest <- IdentStart / [0-9]
Dictionary <- LiteralD (PIPE LiteralD)+
lit_ope <-
['] <(!['] Char)*> ['] Spacing
/ ["] <(!["] Char)*> ["] Spacing
Literal <- lit_ope
LiteralD <- lit_ope
LiteralI <-
['] <(!['] Char)*> "'i" Spacing
/ ["] <(!["] Char)*> '"i' Spacing
# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class <- '[' !'^' <(!']' Range)+> ']' Spacing
NegatedClass <- "[^" <(!']' Range)+> ']' Spacing
Range <- (Char '-' Char) / Char
Char <-
'\\' [nrt'\"[\]\\^]
/ '\\' [0-3] [0-7] [0-7]
/ '\\' [0-7] [0-7]?
/ "\\x" [0-9a-fA-F] [0-9a-fA-F]?
/ "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5})
/ !'\\' .
Repetition <- BeginBlacket RepetitionRange EndBlacket
RepetitionRange <-
Number COMMA Number
/ Number COMMA
/ Number
/ COMMA Number
Number <- [0-9]+ Spacing
LEFTARROW <- ("<-" / "←") Spacing
~SLASH <- '/' Spacing
~PIPE <- '|' Spacing
AND <- '&' Spacing
NOT <- '!' Spacing
QUESTION <- '?' Spacing
STAR <- '*' Spacing
PLUS <- '+' Spacing
~OPEN <- '(' Spacing
~CLOSE <- ')' Spacing
DOT <- '.' Spacing
CUT <- "↑" Spacing
~LABEL <- ('^' / "⇑") Spacing
~Spacing <- (Space / Comment)*
Comment <- '#' (!EndOfLine . )*
Space <- ' ' / '\t' / EndOfLine
EndOfLine <- "\r\n" / '\n' / '\r'
EndOfFile <- ! .
~BeginTok <- '<' Spacing
~EndTok <- '>' Spacing
~BeginCapScope <- '$' '(' Spacing
~EndCapScope <- ')' Spacing
BeginCap <- '$' <IdentCont> '<' Spacing
~EndCap <- '>' Spacing
BackRef <- '$' <IdentCont> Spacing
IGNORE <- '~'
Ignore <- IGNORE?
Parameters <- OPEN Identifier (COMMA Identifier)* CLOSE
Arguments <- OPEN Expression (COMMA Expression)* CLOSE
~COMMA <- ',' Spacing
# Instruction grammars
Instruction <-
BeginBlacket (InstructionItem (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem <- PrecedenceClimbing / ErrorMessage / NoAstOpt
~InstructionItemSeparator <- ';' Spacing
~SpacesZom <- Space*
~SpacesOom <- Space+
~BeginBlacket <- '{' Spacing
~EndBlacket <- '}' Spacing
# PrecedenceClimbing instruction
PrecedenceClimbing <- "precedence" SpacesOom PrecedenceInfo (SpacesOom PrecedenceInfo)* SpacesZom
PrecedenceInfo <- PrecedenceAssoc (~SpacesOom PrecedenceOpe)+
PrecedenceOpe <-
['] <(!(Space / [']) Char)*> [']
/ ["] <(!(Space / ["]) Char)*> ["]
/ <(!(PrecedenceAssoc / Space / '}') . )+>
PrecedenceAssoc <- [LR]
# Error message instruction
ErrorMessage <- "message" SpacesOom LiteralD SpacesZom
# No Ast node optimazation instruction
NoAstOpt <- "no_ast_opt" SpacesZom
And here converted to the EBNF to be viewed at https://www.bottlecaps.de/rr/ui:
//# Setup PEG syntax parser
Grammar::= Spacing Definition+ EndOfFile
Definition::=
Ignore IdentCont Parameters LEFTARROW Expression Instruction?
| Ignore Identifier LEFTARROW Expression Instruction?
Expression::= Sequence (SLASH Sequence)*
Sequence::= (CUT | Prefix)*
Prefix::= (AND | NOT)? SuffixWithLabel
SuffixWithLabel::= Suffix (LABEL Identifier)?
Suffix::= Primary Loop?
Loop::= QUESTION | STAR | PLUS | Repetition
Primary::=
Ignore IdentCont Arguments _NOT_ LEFTARROW
| Ignore Identifier _NOT_ (Parameters? LEFTARROW)
| OPEN Expression CLOSE
| BeginTok Expression EndTok
| BeginCapScope Expression EndCapScope
| BeginCap Expression EndCap
| BackRef
| LiteralI
| Dictionary
| Literal
| NegatedClass
| Class
| DOT
Identifier::= IdentCont Spacing
IdentCont::= IdentStart IdentRest*
IdentStart::= _NOT_ "↑" _NOT_ "⇑" ([a-zA-Z_%] | [\u0080-\uFFFF])
IdentRest::= IdentStart | [0-9]
Dictionary::= LiteralD (PIPE LiteralD)+
lit_ope::=
['] _TKOPEN_ (_NOT_ ['] Char)* _TKCLOSE_ ['] Spacing
| ["] _TKOPEN_ (_NOT_ ["] Char)* _TKCLOSE_ ["] Spacing
Literal::= lit_ope
LiteralD::= lit_ope
LiteralI::=
['] _TKOPEN_ (_NOT_ ['] Char)* _TKCLOSE_ "'i" Spacing
| ["] _TKOPEN_ (_NOT_ ["] Char)* _TKCLOSE_ '"i' Spacing
//# NOTE: The original Brian Ford's paper uses 'zom' instead of 'oom'.
Class::= '[' _NOT_ '^' _TKOPEN_ ( _NOT_ ']' Range)+ _TKCLOSE_ ']' Spacing
NegatedClass::= "[^" _TKOPEN_ ( _NOT_ ']' Range)+ _TKCLOSE_ ']' Spacing
Range::= (Char '-' Char) | Char
Char::=
'\\' [nrt'\"#x5B\#x5d\\^]
| '\\' [0-3] [0-7] [0-7]
| '\\' [0-7] [0-7]?
| "\\x" [0-9a-fA-F] [0-9a-fA-F]?
| "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]'{4,4}') / [0-9a-fA-F]'{4,5}')
| _NOT_ '\\' .
Repetition::= BeginBlacket RepetitionRange EndBlacket
RepetitionRange::=
Number COMMA Number
| Number COMMA
| Number
| COMMA Number
Number::= [0-9]+ Spacing
LEFTARROW::= ("<-" | "←") Spacing
/*~*/SLASH::= '/' Spacing
/*~*/PIPE::= '|' Spacing
AND::= '&' Spacing
NOT::= '!' Spacing
QUESTION::= '?' Spacing
STAR::= '*' Spacing
PLUS::= '+' Spacing
/*~*/OPEN::= '(' Spacing
/*~*/CLOSE::= ')' Spacing
DOT::= '.' Spacing
CUT::= "↑" Spacing
/*~*/LABEL::= ('^' | "⇑") Spacing
/*~*/Spacing::= (Space | Comment)*
Comment::= '#' (_NOT_ EndOfLine . )*
Space::= ' ' | '\t' | EndOfLine
EndOfLine::= "\r\n" | '\n' | '\r'
EndOfFile::= _NOT_ .
/*~*/BeginTok::= '<' Spacing
/*~*/EndTok::= '>' Spacing
/*~*/BeginCapScope::= '$' '(' Spacing
/*~*/EndCapScope::= ')' Spacing
BeginCap::= '$' _TKOPEN_ IdentCont _TKCLOSE_ '<' Spacing
/*~*/EndCap::= '>' Spacing
BackRef::= '$' _TKOPEN_ IdentCont _TKCLOSE_ Spacing
IGNORE::= '~'
Ignore::= IGNORE?
Parameters::= OPEN Identifier (COMMA Identifier)* CLOSE
Arguments::= OPEN Expression (COMMA Expression)* CLOSE
/*~*/COMMA::= ',' Spacing
//# Instruction grammars
Instruction::=
BeginBlacket (InstructionItem (InstructionItemSeparator InstructionItem)*)? EndBlacket
InstructionItem::= PrecedenceClimbing | ErrorMessage | NoAstOpt
/*~*/InstructionItemSeparator::= ';' Spacing
/*~*/SpacesZom::= Space*
/*~*/SpacesOom::= Space+
/*~*/BeginBlacket::= '{' Spacing
/*~*/EndBlacket::= '}' Spacing
//# PrecedenceClimbing instruction
PrecedenceClimbing::= "precedence" SpacesOom PrecedenceInfo (SpacesOom PrecedenceInfo)* SpacesZom
PrecedenceInfo::= PrecedenceAssoc (/*~*/SpacesOom PrecedenceOpe)+
PrecedenceOpe::=
['] _TKOPEN_ (_NOT_ (Space | [']) Char)* _TKCLOSE_ [']
| ["] _TKOPEN_ (_NOT_ (Space | ["]) Char)* _TKCLOSE_ ["]
| _TKOPEN_ (_NOT_ (PrecedenceAssoc | Space | '}') . )+ _TKCLOSE_
PrecedenceAssoc::= [LR]
//# Error message instruction
ErrorMessage::= "message" SpacesOom LiteralD SpacesZom
//# No Ast node optimazation instruction
NoAstOpt::= "no_ast_opt" SpacesZom
//Tokens add for EBNF
_NOT_ ::= '!'
_TKOPEN_ ::= '<'
_TKCLOSE_ ::= '>'
One of the problematic rule is this one (that is hardcoded):
"\\u" ('0' [0-9a-fA-F] / "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}
When replaced by:
"\\u" [0-9a-fA-F]{4,5}
Then it pass parsing IdentStart <- [\u0080-\uFFFF]
, another problem that I found and was my fault was the COMMA <- ' ' Spacing
on one of my search and replace I wiped out the ,
.
I fixed and updated my previous post with the working grammar and EBNF, but still I'm puzzled by this expression "\\u" ('0' [0-9a-fA-F] / "10") [0-9a-fA-F]{4,4} / [0-9a-fA-F]{4,5}
Again looking carefully I found again my mistake when manually converting this expression "\\u" (((('0' [0-9a-fA-F]) / "10") [0-9a-fA-F]{4,4}) / [0-9a-fA-F]{4,5})
(shown here fixed/correctly).
So two problems found with this manual conversion of the hardcoded grammar in peglib.h
the Comment
without newline at then end of the grammar and the missing rep
operator in the README.
Thanks for all help !
Thanks for the report. I just added rep
in the operator table in README.
I'm still trying to extract the
peglib
grammar (why not it's already available?) and found that the operatorrep
used on it is not listed on the README like the others.