yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
880 stars 112 forks source link

AST keys/tags/names how to ? #229

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

I'm posting this here because it's a related question: With the bison/yacc grammar shown bellow how could I get an AST like this:

+ input
  + grammar
    + rules_or_grammar_declaration/0[rules]
      - id_colon/0[ID] (identifier_word)
      + COLON
      + rhses
        - rhs/0[ID] (Identifier)
        + rhs/1
    + rules_or_grammar_declaration/0[rules]
      - id_colon/0[ID] (identifier)
      + COLON
      + rhses
        - rhs/0[ID] (identifier_word)
        - prec (SHIFT_THERE)    ####!!!! here is what I want
  - YYEOF ()

Actual AST output:

+ input
  + grammar
    + rules_or_grammar_declaration/0[rules]
      - id_colon/0[ID] (identifier_word)
      + COLON
      + rhses
        - rhs/0[ID] (Identifier)
        + rhs/1
    + rules_or_grammar_declaration/0[rules]
      - id_colon/0[ID] (identifier)
      + COLON
      + rhses
        - rhs/0[ID] (identifier_word)
        - rhs/4[ID] (SHIFT_THERE)    ####!!!!! here is what I'm getting
  - YYEOF ()

input:

%%
identifier_word:                    Identifier                                                  { $$ = $1; }
identifier:                         identifier_word                     %prec SHIFT_THERE

bison/yacc grammar:

# From bison 3.8.2 src/parse-gram.y and src/scan-gram.l
# ../../cpp-peglib0/build/lint/peglint --ast --profile bison.peglib ../../bison-dad/src/parse-gram.y
# ../../cpp-peglib0/build/lint/peglint --ast --profile bison.peglib ../../../A_databases/postgresql-13.3/src/backend/parser/gram.y

input <-
    sp prologue_declaration* "%%" sp grammar epilogue? YYEOF

prologue_declaration <-
    grammar_declaration
    / PROLOGUE
    / ("%<flag>" / "%locations") sp
    / "%define" sp variable value?
    / "%header" sp string_opt?
    / "%error-verbose" sp
    / "%expect"[-_]"rr" sp INT_LITERAL
    / "%expect" sp INT_LITERAL
    / "%file"[-_]"prefix" sp eqopt STRING
    / "%glr"[-_]"parser" sp
    / "%initial"[-_]"action" sp braces
    / "%language" sp STRING
    / "%name"[-_]"prefix" sp ("=" sp)? STRING
    / "%nondeterministic-parser"
    / "%no"[-_]"lines" sp
    / "%output" sp STRING
    / ("%param" / "%parse"[-_]"param" / "%lex"[-_]"param") sp params
    / "%pure"[-_]"parser" sp
    / "%require" sp STRING
    / "%skeleton" sp STRING
    / "%token"[-_]"table" sp
    / "%verbose" sp
    / "%yacc" sp
    / error sp SEMICOLON
    / SEMICOLON

params <-
    braces+

grammar_declaration <-
    symbol_declaration
    / "%start" sp  symbol+
    / code_props_type braces generic_symlist_item*
    / "%default"[-_]"prec" sp
    / "%no"[-_]"default"[-_]"prec" sp
    / "%code" sp ID? braces
    / "%union" sp union_name? braces

code_props_type <-
    "%destructor" sp
    / "%printer" sp

union_name <-
    ID

symbol_declaration <-
    "%nterm" sp nterm_decls
    / "%token" sp token_decls
    / "%term" sp token_decls
    / "%type" sp symbol_decls
    / precedence_declarator token_decls_for_prec

precedence_declarator <-
    "%left" sp
    / "%right" sp
    / "%nonassoc" sp
    / "%precedence" sp

string_opt <-
    STRING

generic_symlist_item <-
    symbol
    / tag

tag_opt <-
    tag

tag <-
    "<" ( "*" / (!">" .)*) ">" sp

nterm_decls <-
    token_decls

token_decls <-
    (tag? token_decl+)+

token_decl <-
    id int_opt? alias?

int_opt <-
    INT_LITERAL sp

alias <-
    string_as_id
    / TSTRING

token_decls_for_prec <-
    (tag? token_decl_for_prec+)+

token_decl_for_prec <-
    id int_opt?
    / string_as_id

symbol_decls <-
    (tag? symbol+)+

grammar <-
    rules_or_grammar_declaration*

rules_or_grammar_declaration <-
    rules
    / grammar_declaration SEMICOLON
    / error SEMICOLON

rules <-
    id_colon named_ref_opt? COLON rhses? SEMICOLON?

rhses <-
    rhs* (PIPE rhs*)*

rhs <-
    symbol named_ref_opt?
    / tag_opt? braces named_ref_opt?
    / "%"? braces
    / "%empty" sp
    / prec
    / "%dprec" sp INT_LITERAL
    / "%merge" sp tag
    / "%expect"[-_]"rr" sp INT_LITERAL
    / "%expect" sp INT_LITERAL

prec <- "%prec" sp symbol

named_ref <-
    '[' sp ID ']' sp

named_ref_opt <-
    named_ref !':'

variable <-
    ID

value <-
    ID
    / STRING
    / braces

id <-
    ID
    / CHAR_LITERAL

id_colon <-
    ID &([:] / named_ref &[:])

symbol <-
    id !':'
    / string_as_id

string_as_id <-
    STRING

~epilogue <-
    "%%" .*

YYEOF <-
    !.

#Tokens

letter <-
    [.a-zA-Z_]

ID <-
    <letter (letter / [-0-9])*> sp

int <-
    [0-9]+ sp

xint <-
    '0'[xX][0-9a-fA-F]+ sp

INT_LITERAL <-
    int
    / xint

eol <-
    [\n][\r]?
    / [\r][\n]?

# UTF-8 Encoded Unicode Code Point, from Flex's documentation.
#mbchar  <-  [\x09\x0A\x0D\x20-\x7E] / [\xC2-\xDF][\x80-\xBF] / \xE0[\xA0-\xBF][\x80-\xBF] / [\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2}) / \xED[\x80-\x9F][\x80-\xBF] / \xF0[\x\90-\xBF]([\x80-\xBF]{2}) / [\xF1-\xF3]([\x80-\xBF]{3}) / \xF4[\x80-\x8F]([\x80-\xBF]{2})

# Zero or more instances of backslash-newline.  Following GCC, allow
#   white space between the backslash and the newline.
splice <-
    ('\\'[ \f\t\v]* eol)*

comment <-
    [/] ([/] (!eol .)* eol? / [*] (!"*/" .)* "*/")

~sp <-
    (
    [ \t\n\r] #[[:space:]]*
    / comment
    )*

# An equal sign, with optional leading whitespaces. This is used in some
#   deprecated constructs.
eqopt <-
    (sp EQUAL)?

COLON <-  ":" sp
EQUAL <-  "=" sp
PIPE <- "|" sp
SEMICOLON <-  ";" sp

~PROLOGUE <- "%{" (!"%}" .)* "%}" sp

# Code in between braces.
~braces <-
    "{" sp <braces_body*> sp "}" sp

braces_body <-
    &[{"'] (braces / STRING)
    / ! '}' .

STRING <-
     ( ['] <( ! ( ['] / eol ) char )*> ['] ) sp
    / ( ["] <( ! ( ["] / eol ) char )*> ["] ) sp

TSTRING <-
    "_(" STRING ")" sp

CHAR_LITERAL <-
    STRING

char <-
     ( '\\' [-abefnrtv'"\[\]\\] )
    / ( '\\' 'x' [0-9A-Fa-f] [0-9A-Fa-f] )
    / ( '\\' 'x' [0-9A-Fa-f] )
    / ( '\\' [0-3] [0-7] [0-7] )
    / ( '\\' [0-7] [0-7]? )
    / ( ! '\\' . )

error <-
    "error" sp
yhirose commented 2 years ago

Thanks for the report. Unfortunately, there is no way to do what you want unless you make your own AST optimizer. It's not too difficult as you see in the default optimizer code.

The default optimizer code can specify no_ast_opt with All mode. By doing this, we can get a similar result. Is it enought?

image
mingodad commented 2 years ago

It's better than before in my point of view.

mingodad commented 2 years ago

With the Lua grammar/input shown bellow I'm getting an unexpected AST (nothing about Return on it): Lua grammar:

# Adapted from https://github.com/edubart/lpegrex/blob/main/parsers/lua.lua

chunk         <- SHEBANG? SKIP Block (!.)

Block         <- ( Label / Return / Break / Goto / Do / While / Repeat / If / 'for' WB (ForNum / ForIn)
                  / FuncDef / 'local' WB (FuncDecl / VarDecl) / Assign / call / ';' SKIP)*
Label         <- '::' NAME '::' SKIP
Return        <- 'return' WB exprlist?
Break         <- 'break' WB
Goto          <- 'goto' WB NAME
Do            <- 'do' WB Block 'end' WB
While         <- 'while' WB expr 'do' WB Block 'end' WB
Repeat        <- 'repeat' WB Block 'until' WB expr
If            <- 'if' WB expr 'then' WB Block ('elseif' WB expr 'then' WB Block)* ('else' WB Block)? 'end' WB
ForNum        <- Id '=' SKIP expr ',' SKIP expr (',' SKIP expr)? 'do' WB Block 'end' WB
ForIn         <- idlist 'in' WB exprlist 'do' WB Block 'end' WB
FuncDef       <- 'function' WB funcname funcbody
FuncDecl      <- 'function' WB Id funcbody
VarDecl       <- iddecllist ('=' SKIP exprlist)?
Assign        <- varlist '=' SKIP exprlist

Number        <- NUMBER SKIP
String        <- STRING SKIP
Boolean       <- 'false' WB / 'true' WB
Nil           <- 'nil' WB
Varargs       <- '...' SKIP
Id            <- NAME
IdDecl        <- NAME ('<' SKIP NAME '>' SKIP)?
Function      <- 'function' WB funcbody
Table         <- '{' SKIP (field (fieldsep field)* fieldsep?)? '}' SKIP
Paren         <- '(' SKIP expr ')' SKIP
Pair          <- '[' SKIP expr ']' SKIP '=' SKIP expr / NAME '=' SKIP expr

Call          <- callargs
CallMethod    <- ':' SKIP NAME callargs
DotIndex      <- '.' SKIP NAME
ColonIndex    <- ':' SKIP NAME
KeyIndex      <- '[' SKIP expr ']' SKIP

indexsuffix   <- DotIndex / KeyIndex
callsuffix    <- Call / CallMethod

var           <- (exprprimary (callsuffix+ indexsuffix / indexsuffix)+) / Id
call          <- (exprprimary (indexsuffix+ callsuffix / callsuffix)+)
exprsuffixed  <- (exprprimary (indexsuffix / callsuffix)*)
funcname      <- (Id DotIndex* ColonIndex?)

funcbody      <- '(' SKIP funcargs ')' SKIP Block 'end' WB
field         <- Pair / expr
fieldsep      <- ',' SKIP / ';' SKIP

callargs      <- '(' SKIP (expr (',' SKIP expr)*)? ')' SKIP / Table / String
idlist        <- Id (',' SKIP Id)*
iddecllist    <- IdDecl (',' SKIP IdDecl)*
funcargs      <- (Id (',' SKIP Id)* (',' SKIP Varargs)? / Varargs)?
exprlist      <- expr (',' SKIP expr)*
varlist       <- var (',' SKIP var)*

opor      <- 'or' WB exprand
opand     <- 'and' WB exprcmp
opcmp     <- ('==' SKIP / '~=' SKIP / '<=' SKIP / '>=' SKIP / '<' SKIP / '>' SKIP) exprbor
opbor     <- '|' SKIP exprbxor
opbxor    <- '~' SKIP exprband
opband    <- '&' SKIP exprbshift
opbshift  <- ('<<' SKIP / '>>' SKIP) exprconcat
opconcat  <- '..' SKIP exprconcat
oparit    <- ('+' SKIP / '-' SKIP) exprfact
opfact    <- ('*' SKIP / '//' SKIP / '/' SKIP / '%' SKIP) exprunary
oppow     <- '^' SKIP exprunary
opunary    <- ('not' WB / '#' SKIP / '-' SKIP / '~' SKIP) exprunary

expr          <- expror
expror        <- (exprand opor*)
exprand       <- (exprcmp opand*)
exprcmp       <- (exprbor opcmp*)
exprbor       <- (exprbxor opbor*)
exprbxor      <- (exprband opbxor*)
exprband      <- (exprbshift opband*)
exprbshift    <- (exprconcat opbshift*)
exprconcat    <- (exprarit opconcat*)
exprarit      <- (exprfact oparit*)
exprfact      <- (exprunary opfact*)
exprunary     <- opunary / exprpow
exprpow       <- (exprsimple oppow*)
exprsimple    <- Nil / Boolean / Number / String / Varargs / Function / Table / exprsuffixed
exprprimary   <- Id / Paren

STRING        <- STRING_SHRT / STRING_LONG
STRING_LONG  <-
      '[' $lsep<'='*> '[' LINEBREAK? <( ! (']' $lsep ']') . )*> ']' $lsep ']'

STRING_SHRT  <-
     $quote<["']> <( ESCAPE_SEQ / ( ! ( $quote / LINEBREAK ) . ) )*> $quote
ESCAPE_SEQ    <- '\\' ESCAPE
ESCAPE        <- [\\'"] /
                  ('n' / 't' / 'r' / 'a' / 'b' / 'v' / 'f') /
                  ('x' HEX_DIGIT HEX_DIGIT) /
                  ('u' '{' HEX_DIGIT HEX_DIGIT+ '}') /
                  ('z' SPACE*) /
                  (DEC_DIGIT DEC_DIGIT !DEC_DIGIT / [012] DEC_DIGIT DEC_DIGIT) /
                  (LINEBREAK)

NUMBER        <- <HEX_NUMBER / DEC_NUMBER>
HEX_NUMBER    <- '0' [xX] HEX_PREFIX ([pP] EXP_DIGITS)?
DEC_NUMBER    <- DEC_PREFIX ([eE] EXP_DIGITS)?
HEX_PREFIX    <- HEX_DIGIT+ ('.' HEX_DIGIT*)? / '.' HEX_DIGIT+
DEC_PREFIX    <- DEC_DIGIT+ ('.' DEC_DIGIT*)? / '.' DEC_DIGIT+
EXP_DIGITS    <- [+-]? DEC_DIGIT+

COMMENT       <- '--' (COMMENT_LONG / COMMENT_SHRT)
COMMENT_LONG  <- STRING_LONG
COMMENT_SHRT  <- (!LINEBREAK .)*

NAME          <- !KEYWORD <NAME_PREFIX NAME_SUFFIX?> SKIP
NAME_PREFIX   <- [_a-zA-Z]
NAME_SUFFIX   <- [_a-zA-Z0-9]+

SHEBANG       <- '#!' (!LINEBREAK .)* LINEBREAK?
~SKIP          <- (SPACE+ / COMMENT)*
~LINEBREAK     <- ('\n' '\r'? / '\r' '\n'?)
SPACE   <-  ' ' / '\t' / LINEBREAK

HEX_DIGIT     <- [0-9a-fA-F]
DEC_DIGIT     <- [0-9]
#EXTRA_TOKENS  <- '[[' '[=' '--' # unused rule, here just to force defining these tokens

~WB <-
    !NAME_PREFIX SKIP

KEYWORD <-
     ('and'
    / 'break'
    / 'do'
    / 'elseif'
    / 'else'
    / 'end'
    / 'false'
    / 'for'
    / 'function'
    / 'goto'
    / 'if'
    / 'in'
    / 'local'
    / 'nil'
    / 'not'
    / 'or'
    / 'repeat'
    / 'return'
    / 'then'
    / 'true'
    / 'until'
    / 'while') WB

Input:

return lpegrex

--[[
The MIT License (MIT)

Copyright (c) 2021 Eduardo Bart
Copyright (c) 2014-2020 Sérgio Medeiros
Copyright (c) 2007-2019 Lua.org, PUC-Rio.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
]]

AST (All):

- chunk/0[NAME] (lpegrex)

AST (Only):

+ chunk
  + Block
    + Return
      + exprlist
        + expr
          + expror
            + exprand
              + exprcmp
                + exprbor
                  + exprbxor
                    + exprband
                      + exprbshift
                        + exprconcat
                          + exprarit
                            + exprfact
                              + exprunary/1
                                + exprpow
                                  + exprsimple/7
                                    + exprsuffixed
                                      + exprprimary/0
                                        + Id
                                          - NAME (lpegrex)
mingodad commented 2 years ago

But with this input then I'm getting something reasonable:

local a
return lpegrex

--[[
The MIT License (MIT)

Copyright (c) 2021 Eduardo Bart
Copyright (c) 2014-2020 Sérgio Medeiros
Copyright (c) 2007-2019 Lua.org, PUC-Rio.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
]]

AST (All):

+ chunk/0[Block]
  - VarDecl/0[NAME] (a)
  - Return/0[NAME] (lpegrex)
yhirose commented 2 years ago

If a node only contains a child, the child will be hoisted. So if you want to see as below, you have to add { no_ast_opt } instruction to Block.

+ chunk/0[Block]
  - Return/0[NAME] (lpegrex)

If you want to control the shape of your AST tree completely, the default AST optimizer shouldn't be used, but you have to create your own AST optimizer instead.

mingodad commented 2 years ago

Thank you for your help ! Could we have an analyzes pass like in C/C++ -Wextra to detect thinks like this and tell the user about it ?

yhirose commented 2 years ago

IMHO, a tool can't judge hoisting is good or not correctly. That's why I put two AST panes in the playground, and I still think it's a user's responsibility.

mingodad commented 2 years ago

I agree with you about user responsibility and a tool can't judge hoisting is good or not correctly but if it can point out areas where mistakes (from user/tool) can happen that would help users on non trivial grammars, that's why I mention similar to -Wextra users often bitten (like me) would appreciate that info.