yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
916 stars 113 forks source link

Symbol table extension #228

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

I've just got this parser https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua converted to peg running with https://github.com/mingodad/peg, it needs a hashtable/dicitionary available to be usable by the end users (I'm using khash, included). Nez (https://github.com/ChrisHixon/chpeg/issues/20#issuecomment-1144808438 ) has this:

Extension
   <- LT addExtension S* GT

addExtension
   <- IF       FlagName #If
    / ON        FlagName Expression #On
    / BLOCK     Expression #Block
    / SYMBOL    Expression #Symbol
    / DEF       Expression #Symbol
    / MATCH     Expression #Match
    / IS        TableName #Is
    / ISA       TableName #Isa
    / EXISTS    TableName #Exists
    / LOCAL     TableName Expression #Local
    / Identifier Expression (COMMA Expression)* #Expand
    / (!('>') .)+ #Undefined

PEG/LEG (https://github.com/mingodad/peg) has this:

A special form of the '&' predicate is provided:

       &{ expression }
              In  this  predicate  the  simple C expression (not statement) is
              evaluated immediately when the parser reaches the predicate.  If
              the  expression  yields non-zero (true) the 'match' succeeds and
              the parser continues with the next element in the  pattern.   If
              the  expression  yields  zero  (false) the 'match' fails and the
              parser backs up to look for an alternative parse of the input.

My implementation (see attached) has this:

typedef_name  <-
     &  (  identifier  )  identifier  &{is_typedef(yy, yytext)}

typedef_identifier  <-
     &  (  identifier  )  identifier &{set_typedef(yy, yytext)}

Lpegrex (https://github.com/edubart/lpegrex) has this:

typedef-name <==
  &(identifier => is_typedef) identifier

typedef-identifier <==
  &(identifier => set_typedef) identifier

Can we have something like this in cpp-peglib ?

c11-lpegrex.zip

yhirose commented 2 years ago

Sorry, but I don't understand what you are mentioning here... Could you explain what's needed more concisely?

mingodad commented 2 years ago

The C11 grammar needs a way to set and query an identifier as typedef_name in it's simplest way to only parse valid code a simple dictionary/hashtable would suffice (like I did in the attached zip using khash form klib and was able to parse sqlite3 amalgamation).

Based on your help/suggestion here https://github.com/yhirose/cpp-peglib/issues/229#issuecomment-1168560217 I can see a possible way extending :

InstructionItem = PrecedenceClimbing | ErrorMessage | NoAstOpt

like:

InstructionItem = PrecedenceClimbing | ErrorMessage | NoAstOpt | SetSymbol | ExistsSymbol

Where SetSymbol and ExistsSymbol exposes a simple dictionary/hashtable to store/query for symbols (mostly identifiers) then the C11 parser could be used on the playground.

yhirose commented 2 years ago

Thanks for the explanation. How do you the same with Nez?

mingodad commented 2 years ago

Nez https://github.com/nez-peg/nez seems to be abaneware, I did some attempts to bring it to life but it still need more work, it's main documentation is here https://arxiv.org/pdf/1511.08307.pdf (worth reading) and bellow is a sample that describes it's main elements, in theory it's very powerful (but the implementation is not there yet).

   PEG                Type             Proc.    Description
 ’ ’                Primary          5        Matches text
 []                 Primary          5        Matches character class
 .                  Primary          5        Any character
 A                  Primary          5        Non-terminal application
 (e)                Primary          5        Grouping
 e?                 Unary suffix     4        Option
 e∗                 Unary suffix     4        Zero-or-more repetitions
 e+                 Unary suffix     4        One-or-more repetitions
 &e                 Unary prefix     3        And-predicate
 !e                 Unary prefix     3        Negation
 e1 e2              Binary           2        Sequencing
 e1 /e2             Binary           1        Prioritized Choice
 AST
 {e}                Construction     5        Constructor
 $(e)               Construction     5        Connector
 {$ e }             Construction     5        Left-folding
 #t                 Construction     5        Tagging
 ‘ ‘                Construction     5        Replaces a string
 Symbols
 <symbol A>         Action           5        Symbolize Nonterminal A
 <exists A>         Predicate        5        Exists symbols
 <exists A x>       Predicate        5        Exists x symbol
 <match A>          Predicate        5        Matches symbol
 <is A>             Predicate        5        Equals symbol
 <isa A>            Predicate        5        Contains symbol
 <block e>          Action           5        Nested scope for e
 <local A e>        Action           5        Isolated local scope for e
 Conditional
 <on c e>           Action           5        e on c is defined
 <on !c e>          Action           5        e on c is undefined
 <if c>             Predicate        5        If c is defined
 <if !c>            Predicate        5        If c is undefined

Table 1. Nez operators: ”Action” stands for symbol mutators and
”predicate” stands for symbol predicates.

    In the first contribution of Nez, we model a single extended
parser state, called a symbol table, to describe a variety of context-
sensitive patterns appearing in programming language constructs.

     The key idea behind the symbol table is simple. We allow any
parsed substrings to be handled as symbols, a specialized word             
in another context. The symbol table is a runtime, growing, and              
recursive storage for such symbols. If we handle typedef-defined             
names as symbols for example, we can realize different parsing               
behavior through referencing the symbol table. Nez provides a                
small set of symbol operators, including matching, containment               
testing, and scoping. More importantly, since the symbol table               
is a single, unified data structure for managing various types of            
symbols, we can easily trace state changes and then automate                 
state management when backtracking. In addition, the symbol table            
itself is simply a stack-based data structure; we can translate it into      
any programming language.

    Another important role of the traditional action code is the con-        
struction of ASTs. Basically, PEGs are just a syntactic specifica-           
tion, not a schematic specification for tree structures that represent       
AST nodes. In particular, it is hard to derive the left-associative          
pair of trees due to the limited left recursion in PEGs. In Nez, we          
introduce an operator to specify tree structures, modeled on cap-            
turing in perl compatible regular expressions (PCREs), and extend            
tree manipulations including tagging for typing a tree, labeling for         
sub-nodes, and left-folding for the left-associative structure. Due          
to these extensions, a Nez parser produces arbitrary representations         
of ASTs, which would leads to less impedance mismatching in se-              
mantic analysis.       

    To evaluate the expressiveness of Nez, we have performed ex-             
tensive case studies to specify various popular programming lan-             
guages, including C, C#, Java8, JavaScript, Python, Ruby, and Lua.           
Surprisingly, the introduction of a simple symbol table improves             
the declarative expressiveness of PEGs in a way that Nez can rec-            
ognize almost all of our studied languages and then produce a prac-          
tical representation of ASTs. Our case studies are not complete,             
but they indicate that the Nez approach is promising for a practical         
grammar specification without any action code. 

    At last but not least, parsing performance is an important fac-
tor since the acceptance of a parser generator relies definitively on        
practical performance. In this light, Nez can generate three types           
of differently implemented parsers, based on the traditional source
generation, the grammar translation, and the parsing machine[11].
From the perspective of open grammars, we highlight the latter
parsing machine that enables a PCRE-style parser library that loads          
a grammar at runtime. To achieve practical performance, Nez op-              
erators are assembled into low-level virtual instructions, includ-
ing automated state modifications when backtracking, transactional           
controls of AST construction, and efficient memoization in packrat           
parsing. We will demonstrate that the resulting parsers are portable         
and achieve very competitive performance compared to other ex-               
isting standard parsers for Java, JavaScript, and XML.  

    The remainder of this paper proceeds as follows. Section 2 is an         
informal introduction to the Nez grammar specification language.             
Section 3 is a formal definition of Nez. Section 4 describes elim-           
inating parsing conditions from Nez. Section 5 describes parser              
runtime and implementation. Section 6 reports a summary of our               
case studies. Section 7 studies the parser performance. Section 8            
briefly reviews related work. Section 9 concludes the paper. The             
tools and grammars presented in this paper are available online at           
http://nez-peg.github.io/  

Bellow is a nez grammar usable on cpp-peglib playground (converted from https://github.com/takerusu/language-nez/blob/master/lib/nez.nez) and as input we can use one of the several grammars shown here https://github.com/nez-peg/nez-grammar like this one for C https://github.com/nez-peg/nez-grammar/blob/master/c.nez .

#/* Nez format */

#public
File
   <-  S* (Statement )* #List

#/* Code Layout  */

~_
   <- (S / COMMENT)*

~S
   <- [\t\n\r ]

~COMMENT
   <- '/*' (!('*/') .)* '*/'
   / '//' (!EOL .)* EOL

EOL
   <- '\n'
   / '\r' ('\n')?
   / EOT

EOT
   <- !(.)

#/* Operators */

LT <- '<' _
GT <- '>' _
COMMA <- ',' _
SEMI <- ';' _
ASSIGN <- '=' _
LPAR <- '(' _
LRAR <- ')' _
LBRAK <- '[' _
RBRAK <- ']' _
LCURLY <- '{' !AT _
RCURLY <- '}' _
OCHOICE <- '/' _
VERTBAR <- '|' _
STAR <- '*' _
DOT <- '.' _
AND <- '&' _
NOT <- '!' _
DOLLAR <- '$' _
TILDE <- '~' _
AT <- '@' _
PLUS <- '+' _
QMARK <- '?' _

#/* Keywords */

PUBLIC   <- 'public'   !W _
INLINE   <- 'inline'   !W _
IMPORT   <- 'import'   !W _
FROM     <- 'from'     !W _
GRAMMAR  <- 'grammar'  !W _
EXAMPLE  <- 'example'  !W _
TEMPLATE <- 'template' !W _
FORMAT   <- 'format'   !W _
TRUE     <- 'true'     !W _
FALSE    <- 'false'    !W _

IF       <- 'if'       !W _
ON       <- 'on'       !W _
BLOCK    <- 'block'    !W _
DEF      <- 'def'      !W _
MATCH    <- 'match'    !W _
IS       <- 'is'       !W _
ISA      <- 'isa'      !W _
EXISTS   <- 'exists'   !W _
LOCAL    <- 'local'    !W _
SYMBOL   <- 'symbol'   !W _

#/* reserved */

#TYPE     <- 'type'    !W _
DEFINE   <- 'define'  !W _

KEYWORD
   <- PUBLIC
   / INLINE
   / IMPORT
   / FROM
   / GRAMMAR
   / EXAMPLE
   / TEMPLATE
   / FORMAT
   / TRUE
   / FALSE
   / DEFINE
  #// / TYPE

NAME
   <- !KEYWORD <LETTER W*>

LETTER
   <- [A-Z_a-z$]

W
   <- [0-9A-Z_a-z$]

Identifier
   <- ( NAME  ) _

#/* Statement */

Statement
   <- Document
   / ExampleStatement
   / ImportStatement
   / FormatStatement
   / TemplateStatement
   / Production

ImportStatement
   <-  IMPORT ImportName FROM (Character / String) #Import

ImportName
   <- ( STAR / NAME (DOT ( STAR / NAME ))? )  _

Document
   <- COMMENT (S* COMMENT)* _

#/* Production */

Production
   <- (Modifiers)? (Identifier / String) _SKIP_ ASSIGN (OldExample)* (Expression) #Production

Modifiers
   <- (Modifier)* #Sequence

Modifier
   <- PUBLIC
    / INLINE

_SKIP_
   <- _ANNOTATION_*

_ANNOTATION_
   <- LBRAK _DOC_ RBRAK _

_DOC_
   <- (!(RBRAK / LBRAK) .)* (LBRAK _DOC_ RBRAK _DOC_)?

#/* Expression */

Expression
   <- Sequence ( (OCHOICE Sequence)+ )?

Sequence
   <- Prefix ( (#/* !(SEMI / RuleHead ) */
        Prefix)+ )?

RuleHead
   <- (Modifiers)? (Identifier / String) _ _SKIP_ ASSIGN

Prefix
   <- ( AND #And
        / NOT #Not
        / DOLLAR (Label)? #Link
        / '@[' _ Integer _ RBRAK #Link
        / AT #Link
        / TILDE #Match
        ) Suffix
   / Suffix

Suffix
   <- Primary ( ( STAR (Integer)? #Repetition
        / PLUS #Repetition1
        / QMARK #Option
        ) )? _

Integer
   <-  INT #Integer

Primary
   <-  TRUE #Empty
   /  FALSE #Failure
   / Character
   / Charset
   /  DOT _ #Any
   / <'0x' HEX HEX> _ #Byte
   / <'U+' HEX HEX HEX HEX> _ #Byte
   / LPAR Expression LRAR
   / Constructor
   / Replace
   / Tagging
   / String !ASSIGN
   / Extension
   / NonTerminal !(ASSIGN / '[example:' / '[bad-example:')

NonTerminal
   <-  NAME (DOT NAME)? _ #NonTerminal

Character
   <- <"'"  ('\\\'' / '\\\\' / !['\n\r] .)*  "'"> _ #Character

String
   <- <'"'  ('\\"' / '\\\\' / !["\n\r] .)*  '"'> _ #String

Charset
   <- <LBRAK ((CHAR #Class
        ( '-' ( CHAR
            #Class
            ) #List
            )?))* RBRAK> _ #Class

CHAR
   <- <'\\u' HEX HEX HEX HEX>
   / <'\\x' HEX HEX>
   / '\\n'
   / '\\t'
   / '\\\\'
   / '\\r'
   / '\\v'
   / '\\f'
   / '\\-'
   / '\\]'
   / !(RBRAK) .

HEX
   <- <[0-9A-Fa-f]>

Label
  <-  <NAME (DOT NAME)?> #Label

Constructor
   <-  ('{@' S #LeftNew
        / '{$' (Label)? S #LeftNew
        / LCURLY #New
        ) _ (Expression)? RCURLY _

Tagging
   <- '#'  <LETTER+ W*> _ #Tagging

Replace
   <- '`' ('\\`' / '\\\\' / ![`\n\r] .)*  '`' _ #Replace

Extension
   <- LT addExtension S* GT

addExtension
   <- IF       FlagName #If
    / ON        FlagName Expression #On
    / BLOCK     Expression #Block
    / SYMBOL    Expression #Symbol
    / DEF       Expression #Symbol
    / MATCH     Expression #Match
    / IS        TableName #Is
    / ISA       TableName #Isa
    / EXISTS    TableName #Exists
    / LOCAL     TableName Expression #Local
    / Identifier Expression (COMMA Expression)* #Expand
    / (!('>') .)+ #Undefined

FlagName
   <- (NOT)? <LETTER W*> _ #Name

TableName
   <- <LETTER W*> _ #Name

#/* Other Statements */

TemplateStatement
   <- TEMPLATE Identifier LT TemplateParameter GT ASSIGN Expression _ #Template

TemplateParameter
   <- Identifier (COMMA Identifier)*

OldExample
   <-  LBRAK ('bad-')? 'example:' (!RBRAK .)* RBRAK _ #OldExample

ExampleStatement
   <-  EXAMPLE  NonTerminal (AND NonTerminal)* (TILDE <W+> #Hash
                _ )? addInputText _

addInputText
   <- '\'\'\'' EOL ( (!('\n' '\'\'\'') .)* ) '\n' '\'\'\''
   / '```' EOL ( (!('\n' '```') .)* ) '\n' '```'
   / '"""' EOL ( (!('\n' '"""') .)* ) '\n' '"""'
   / ( (!EOL .)* ) EOL

FormatStatement
   <- FORMAT '#' Identifier LBRAK FormatSize RBRAK '`' Formatter '`' _ #Format

FormatSize
   <- ( STAR / INT ) #Integer

Formatter
   <- #List
        (!('`') ('${' Identifier RCURLY
   / '$[' _ Index _ ( '`' Formatter '`' _ Index _ #Format
        )? RBRAK
   /  ( '$$' '$' / '\\`' '`'
   / (!('$$' / '${' / '$[' / '\\`' / '`') .)+ ) ))*

Index
   <- ('-')? INT #Integer

INT
   <- <DIGIT DIGIT*>

DIGIT
   <- [0-9]
yhirose commented 2 years ago

@mingodad I have just implemented simple builtin symbol table feature. Please see the following section in README. Hope it helps. Thanks!

https://github.com/yhirose/cpp-peglib#symbol-table

Here is an example:

https://github.com/yhirose/cpp-peglib/blob/5934f0abba875073942b929d323923d7a31808f5/test/test2.cc#L864-L901

yhirose commented 2 years ago

If you would like to do the same with your own hash table, you can use the predicate feature.

Here is an easy sample. (This isn't thread-safe...)

https://github.com/yhirose/cpp-peglib/blob/5934f0abba875073942b929d323923d7a31808f5/test/test2.cc#L903-L971

mingodad commented 2 years ago

Thank you for this feature ! But I'm trying to use it on the C11 grammar and it's not working as expected, see bellow a minimal grammar showing the problem:

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
typedef      <- 'typedef' is_symbol(Name) symbol(Name) ';'
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction

Input:

decl long
typedef 
    long 
        __off64_t;
typedef 
    __off64_t 
        __loff_t;

Output error:

4:9 'long' already exists.

I was expecting it to parse without errors.

mingodad commented 2 years ago

And when I try this inlined version I'm getting a syntax error:

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
#typedef      <- 'typedef' is_symbol(Name) symbol(Name) ';'
typedef      <- 'typedef' Name { check_symbol var_table } Name { declare_symbol var_table } ';'
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s2) <- s2 { check_symbol var_table }   # Check symbol instruction

Output:

5:64 syntax error
mingodad commented 2 years ago

And maybe adding a third form that only add if not exists would help detect errors. Maybe something like:

symbol_unique(s)    <- s { declare_unique_symbol var_table } # Declare unique symbol instruction
mingodad commented 2 years ago

If this new check_symbol or declare_symbol doesn't consume input it should be documented for the end user to know. I did tried this one and it also failed:

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
typedef      <- 'typedef' is_symbol(Name) Name symbol(Name) Name SEMICOL
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
SEMICOL      <- ';'
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction

Output:

4:18 syntax error, unexpected ';', expecting <Name>.
mingodad commented 2 years ago

It seems that the is_symbol/check_symbol is not consuming input and symbol/declare_symbol already check for duplicate and refuse to insert and give the error.

ChrisHixon commented 2 years ago

@mingodad, I had luck with the following modifications to your c99-mouse-perf.chpeg grammar (modified for cpp-peglib):

Declaration <- SymTypedef
             / DeclarationSpecifiers InitDeclaratorList? ExtensionSpecifier* SEMI

SymTypedef <- TYPEDEF TypeSpecifier symbol(&IdNondigit !Keyword IdNondigit IdChar*)

TypedefName <- &is_symbol(&IdNondigit !Keyword IdNondigit IdChar*) Identifier

symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction

I had to put the Identifer inline without Spacing since < ... > token boundary was not taking effect inside the symbol() / is_symbol(). This may not be completely correct for the C language, but seems to demonstrate the concept.

Input:

typedef struct _foo {int bar; } foo;

foo a;
baz b;

Result:

4:1 'baz' doesn't exist.

If you remove the baz line you'll see it picks up foo:

+ TranslationUnit
  + ExternalDeclaration/1
    + Declaration/0
      + SymTypedef
        + TYPEDEF
        + TypeSpecifier/11
          + StructOrUnionSpecifier
            + StructOrUnion/0
              + STRUCT
            - Identifier (_foo)
            + LRWingStructDecl
              + LWING
              + StructDeclaration
                + SpecifierQualifierList/1
                  + TypeSpecifier/3
                    + INT
                + StructDeclaratorList
                  + StructDeclarator/1
                    + Declarator
                      + DirectDeclarator
                        - Identifier (bar)
                + SEMI
              + RWING
        + IdNondigit/0
        + IdChar/0
        + IdChar/0
  + SEMI
  + ExternalDeclaration/1
    + Declaration/1
      + DeclarationSpecifiers/0
        + TypedefName
          - Identifier (foo)
      + InitDeclaratorList
        + InitDeclarator
          + Declarator
            + DirectDeclarator
              - Identifier (a)
      + SEMI
  - EOT ()
mingodad commented 2 years ago

After @ChrisHixon example I changed mine and it worked but the documentation should clarify about consuming/not conusming input.

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
typedef      <- 'typedef' &is_symbol(Name) Name symbol(Name) ';'
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction

Input:

decl long
typedef 
    long 
        __off64_t;
typedef 
    __off64_t 
        __loff_t;

Output:

+ S
  - Decl/0[Name] (long)
  + typedef
    - Name (long)
    - Name (__off64_t)
  + typedef
    - Name (__off64_t)
    - Name (__loff_t)
mingodad commented 2 years ago

But for usability it's better if all of then consume input as implied by the example on the README and this one should work:

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
typedef      <- 'typedef' is_symbol(Name) symbol(Name) ';'
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction
mingodad commented 2 years ago

After thinking for awhile I tried the one shown bellow and it worked, but if we need a lot of gues/try and error to use it that's a though sell, I still think it should work with the first example.

S            <- (Decl / Ref / typedef)*
Decl         <- 'decl' symbol(Name)
Ref          <- 'ref' is_symbol(Name)
typedef      <- 'typedef' typename symbol(Name) ';'
typename     <- is_symbol(Name)
Name         <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace  <- [ \t\r\n]*

# 'var_table' is a table name.
symbol(s)    <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table }   # Check symbol instruction

Input:

decl long
typedef 
    long 
        __off64_t;
typedef 
    __off64_t 
        __loff_t;

Output:

+ S
  - Decl/0[Name] (long)
  + typedef
    - typename/0[Name] (long)
    - Name (__off64_t)
  + typedef
    - typename/0[Name] (__off64_t)
    - Name (__loff_t)
yhirose commented 2 years ago

@mingodad, it seems like my symbol implementation with macros doesn't work... I made a change with regular definition rules instead of macros.

image