Closed mingodad closed 2 years ago
Sorry, but I don't understand what you are mentioning here... Could you explain what's needed more concisely?
The C11
grammar needs a way to set and query an identifier
as typedef_name
in it's simplest way to only parse valid code a simple dictionary/hashtable would suffice (like I did in the attached zip using khash
form klib
and was able to parse sqlite3
amalgamation).
Based on your help/suggestion here https://github.com/yhirose/cpp-peglib/issues/229#issuecomment-1168560217 I can see a possible way extending :
InstructionItem = PrecedenceClimbing | ErrorMessage | NoAstOpt
like:
InstructionItem = PrecedenceClimbing | ErrorMessage | NoAstOpt | SetSymbol | ExistsSymbol
Where SetSymbol
and ExistsSymbol
exposes a simple dictionary/hashtable to store/query for symbols (mostly identifiers) then the C11
parser could be used on the playground.
Thanks for the explanation. How do you the same with Nez?
Nez https://github.com/nez-peg/nez seems to be abaneware, I did some attempts to bring it to life but it still need more work, it's main documentation is here https://arxiv.org/pdf/1511.08307.pdf (worth reading) and bellow is a sample that describes it's main elements, in theory it's very powerful (but the implementation is not there yet).
PEG Type Proc. Description
’ ’ Primary 5 Matches text
[] Primary 5 Matches character class
. Primary 5 Any character
A Primary 5 Non-terminal application
(e) Primary 5 Grouping
e? Unary suffix 4 Option
e∗ Unary suffix 4 Zero-or-more repetitions
e+ Unary suffix 4 One-or-more repetitions
&e Unary prefix 3 And-predicate
!e Unary prefix 3 Negation
e1 e2 Binary 2 Sequencing
e1 /e2 Binary 1 Prioritized Choice
AST
{e} Construction 5 Constructor
$(e) Construction 5 Connector
{$ e } Construction 5 Left-folding
#t Construction 5 Tagging
‘ ‘ Construction 5 Replaces a string
Symbols
<symbol A> Action 5 Symbolize Nonterminal A
<exists A> Predicate 5 Exists symbols
<exists A x> Predicate 5 Exists x symbol
<match A> Predicate 5 Matches symbol
<is A> Predicate 5 Equals symbol
<isa A> Predicate 5 Contains symbol
<block e> Action 5 Nested scope for e
<local A e> Action 5 Isolated local scope for e
Conditional
<on c e> Action 5 e on c is defined
<on !c e> Action 5 e on c is undefined
<if c> Predicate 5 If c is defined
<if !c> Predicate 5 If c is undefined
Table 1. Nez operators: ”Action” stands for symbol mutators and
”predicate” stands for symbol predicates.
In the first contribution of Nez, we model a single extended
parser state, called a symbol table, to describe a variety of context-
sensitive patterns appearing in programming language constructs.
The key idea behind the symbol table is simple. We allow any
parsed substrings to be handled as symbols, a specialized word
in another context. The symbol table is a runtime, growing, and
recursive storage for such symbols. If we handle typedef-defined
names as symbols for example, we can realize different parsing
behavior through referencing the symbol table. Nez provides a
small set of symbol operators, including matching, containment
testing, and scoping. More importantly, since the symbol table
is a single, unified data structure for managing various types of
symbols, we can easily trace state changes and then automate
state management when backtracking. In addition, the symbol table
itself is simply a stack-based data structure; we can translate it into
any programming language.
Another important role of the traditional action code is the con-
struction of ASTs. Basically, PEGs are just a syntactic specifica-
tion, not a schematic specification for tree structures that represent
AST nodes. In particular, it is hard to derive the left-associative
pair of trees due to the limited left recursion in PEGs. In Nez, we
introduce an operator to specify tree structures, modeled on cap-
turing in perl compatible regular expressions (PCREs), and extend
tree manipulations including tagging for typing a tree, labeling for
sub-nodes, and left-folding for the left-associative structure. Due
to these extensions, a Nez parser produces arbitrary representations
of ASTs, which would leads to less impedance mismatching in se-
mantic analysis.
To evaluate the expressiveness of Nez, we have performed ex-
tensive case studies to specify various popular programming lan-
guages, including C, C#, Java8, JavaScript, Python, Ruby, and Lua.
Surprisingly, the introduction of a simple symbol table improves
the declarative expressiveness of PEGs in a way that Nez can rec-
ognize almost all of our studied languages and then produce a prac-
tical representation of ASTs. Our case studies are not complete,
but they indicate that the Nez approach is promising for a practical
grammar specification without any action code.
At last but not least, parsing performance is an important fac-
tor since the acceptance of a parser generator relies definitively on
practical performance. In this light, Nez can generate three types
of differently implemented parsers, based on the traditional source
generation, the grammar translation, and the parsing machine[11].
From the perspective of open grammars, we highlight the latter
parsing machine that enables a PCRE-style parser library that loads
a grammar at runtime. To achieve practical performance, Nez op-
erators are assembled into low-level virtual instructions, includ-
ing automated state modifications when backtracking, transactional
controls of AST construction, and efficient memoization in packrat
parsing. We will demonstrate that the resulting parsers are portable
and achieve very competitive performance compared to other ex-
isting standard parsers for Java, JavaScript, and XML.
The remainder of this paper proceeds as follows. Section 2 is an
informal introduction to the Nez grammar specification language.
Section 3 is a formal definition of Nez. Section 4 describes elim-
inating parsing conditions from Nez. Section 5 describes parser
runtime and implementation. Section 6 reports a summary of our
case studies. Section 7 studies the parser performance. Section 8
briefly reviews related work. Section 9 concludes the paper. The
tools and grammars presented in this paper are available online at
http://nez-peg.github.io/
Bellow is a nez grammar usable on cpp-peglib
playground (converted from https://github.com/takerusu/language-nez/blob/master/lib/nez.nez) and as input we can use one of the several grammars shown here https://github.com/nez-peg/nez-grammar like this one for C
https://github.com/nez-peg/nez-grammar/blob/master/c.nez .
#/* Nez format */
#public
File
<- S* (Statement )* #List
#/* Code Layout */
~_
<- (S / COMMENT)*
~S
<- [\t\n\r ]
~COMMENT
<- '/*' (!('*/') .)* '*/'
/ '//' (!EOL .)* EOL
EOL
<- '\n'
/ '\r' ('\n')?
/ EOT
EOT
<- !(.)
#/* Operators */
LT <- '<' _
GT <- '>' _
COMMA <- ',' _
SEMI <- ';' _
ASSIGN <- '=' _
LPAR <- '(' _
LRAR <- ')' _
LBRAK <- '[' _
RBRAK <- ']' _
LCURLY <- '{' !AT _
RCURLY <- '}' _
OCHOICE <- '/' _
VERTBAR <- '|' _
STAR <- '*' _
DOT <- '.' _
AND <- '&' _
NOT <- '!' _
DOLLAR <- '$' _
TILDE <- '~' _
AT <- '@' _
PLUS <- '+' _
QMARK <- '?' _
#/* Keywords */
PUBLIC <- 'public' !W _
INLINE <- 'inline' !W _
IMPORT <- 'import' !W _
FROM <- 'from' !W _
GRAMMAR <- 'grammar' !W _
EXAMPLE <- 'example' !W _
TEMPLATE <- 'template' !W _
FORMAT <- 'format' !W _
TRUE <- 'true' !W _
FALSE <- 'false' !W _
IF <- 'if' !W _
ON <- 'on' !W _
BLOCK <- 'block' !W _
DEF <- 'def' !W _
MATCH <- 'match' !W _
IS <- 'is' !W _
ISA <- 'isa' !W _
EXISTS <- 'exists' !W _
LOCAL <- 'local' !W _
SYMBOL <- 'symbol' !W _
#/* reserved */
#TYPE <- 'type' !W _
DEFINE <- 'define' !W _
KEYWORD
<- PUBLIC
/ INLINE
/ IMPORT
/ FROM
/ GRAMMAR
/ EXAMPLE
/ TEMPLATE
/ FORMAT
/ TRUE
/ FALSE
/ DEFINE
#// / TYPE
NAME
<- !KEYWORD <LETTER W*>
LETTER
<- [A-Z_a-z$]
W
<- [0-9A-Z_a-z$]
Identifier
<- ( NAME ) _
#/* Statement */
Statement
<- Document
/ ExampleStatement
/ ImportStatement
/ FormatStatement
/ TemplateStatement
/ Production
ImportStatement
<- IMPORT ImportName FROM (Character / String) #Import
ImportName
<- ( STAR / NAME (DOT ( STAR / NAME ))? ) _
Document
<- COMMENT (S* COMMENT)* _
#/* Production */
Production
<- (Modifiers)? (Identifier / String) _SKIP_ ASSIGN (OldExample)* (Expression) #Production
Modifiers
<- (Modifier)* #Sequence
Modifier
<- PUBLIC
/ INLINE
_SKIP_
<- _ANNOTATION_*
_ANNOTATION_
<- LBRAK _DOC_ RBRAK _
_DOC_
<- (!(RBRAK / LBRAK) .)* (LBRAK _DOC_ RBRAK _DOC_)?
#/* Expression */
Expression
<- Sequence ( (OCHOICE Sequence)+ )?
Sequence
<- Prefix ( (#/* !(SEMI / RuleHead ) */
Prefix)+ )?
RuleHead
<- (Modifiers)? (Identifier / String) _ _SKIP_ ASSIGN
Prefix
<- ( AND #And
/ NOT #Not
/ DOLLAR (Label)? #Link
/ '@[' _ Integer _ RBRAK #Link
/ AT #Link
/ TILDE #Match
) Suffix
/ Suffix
Suffix
<- Primary ( ( STAR (Integer)? #Repetition
/ PLUS #Repetition1
/ QMARK #Option
) )? _
Integer
<- INT #Integer
Primary
<- TRUE #Empty
/ FALSE #Failure
/ Character
/ Charset
/ DOT _ #Any
/ <'0x' HEX HEX> _ #Byte
/ <'U+' HEX HEX HEX HEX> _ #Byte
/ LPAR Expression LRAR
/ Constructor
/ Replace
/ Tagging
/ String !ASSIGN
/ Extension
/ NonTerminal !(ASSIGN / '[example:' / '[bad-example:')
NonTerminal
<- NAME (DOT NAME)? _ #NonTerminal
Character
<- <"'" ('\\\'' / '\\\\' / !['\n\r] .)* "'"> _ #Character
String
<- <'"' ('\\"' / '\\\\' / !["\n\r] .)* '"'> _ #String
Charset
<- <LBRAK ((CHAR #Class
( '-' ( CHAR
#Class
) #List
)?))* RBRAK> _ #Class
CHAR
<- <'\\u' HEX HEX HEX HEX>
/ <'\\x' HEX HEX>
/ '\\n'
/ '\\t'
/ '\\\\'
/ '\\r'
/ '\\v'
/ '\\f'
/ '\\-'
/ '\\]'
/ !(RBRAK) .
HEX
<- <[0-9A-Fa-f]>
Label
<- <NAME (DOT NAME)?> #Label
Constructor
<- ('{@' S #LeftNew
/ '{$' (Label)? S #LeftNew
/ LCURLY #New
) _ (Expression)? RCURLY _
Tagging
<- '#' <LETTER+ W*> _ #Tagging
Replace
<- '`' ('\\`' / '\\\\' / ![`\n\r] .)* '`' _ #Replace
Extension
<- LT addExtension S* GT
addExtension
<- IF FlagName #If
/ ON FlagName Expression #On
/ BLOCK Expression #Block
/ SYMBOL Expression #Symbol
/ DEF Expression #Symbol
/ MATCH Expression #Match
/ IS TableName #Is
/ ISA TableName #Isa
/ EXISTS TableName #Exists
/ LOCAL TableName Expression #Local
/ Identifier Expression (COMMA Expression)* #Expand
/ (!('>') .)+ #Undefined
FlagName
<- (NOT)? <LETTER W*> _ #Name
TableName
<- <LETTER W*> _ #Name
#/* Other Statements */
TemplateStatement
<- TEMPLATE Identifier LT TemplateParameter GT ASSIGN Expression _ #Template
TemplateParameter
<- Identifier (COMMA Identifier)*
OldExample
<- LBRAK ('bad-')? 'example:' (!RBRAK .)* RBRAK _ #OldExample
ExampleStatement
<- EXAMPLE NonTerminal (AND NonTerminal)* (TILDE <W+> #Hash
_ )? addInputText _
addInputText
<- '\'\'\'' EOL ( (!('\n' '\'\'\'') .)* ) '\n' '\'\'\''
/ '```' EOL ( (!('\n' '```') .)* ) '\n' '```'
/ '"""' EOL ( (!('\n' '"""') .)* ) '\n' '"""'
/ ( (!EOL .)* ) EOL
FormatStatement
<- FORMAT '#' Identifier LBRAK FormatSize RBRAK '`' Formatter '`' _ #Format
FormatSize
<- ( STAR / INT ) #Integer
Formatter
<- #List
(!('`') ('${' Identifier RCURLY
/ '$[' _ Index _ ( '`' Formatter '`' _ Index _ #Format
)? RBRAK
/ ( '$$' '$' / '\\`' '`'
/ (!('$$' / '${' / '$[' / '\\`' / '`') .)+ ) ))*
Index
<- ('-')? INT #Integer
INT
<- <DIGIT DIGIT*>
DIGIT
<- [0-9]
@mingodad I have just implemented simple builtin symbol table feature. Please see the following section in README. Hope it helps. Thanks!
https://github.com/yhirose/cpp-peglib#symbol-table
Here is an example:
If you would like to do the same with your own hash table, you can use the predicate feature.
Here is an easy sample. (This isn't thread-safe...)
Thank you for this feature !
But I'm trying to use it on the C11
grammar and it's not working as expected, see bellow a minimal grammar showing the problem:
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
typedef <- 'typedef' is_symbol(Name) symbol(Name) ';'
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
Input:
decl long
typedef
long
__off64_t;
typedef
__off64_t
__loff_t;
Output error:
4:9 'long' already exists.
I was expecting it to parse without errors.
And when I try this inlined version I'm getting a syntax error:
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
#typedef <- 'typedef' is_symbol(Name) symbol(Name) ';'
typedef <- 'typedef' Name { check_symbol var_table } Name { declare_symbol var_table } ';'
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s2) <- s2 { check_symbol var_table } # Check symbol instruction
Output:
5:64 syntax error
And maybe adding a third form that only add if not exists would help detect errors. Maybe something like:
symbol_unique(s) <- s { declare_unique_symbol var_table } # Declare unique symbol instruction
If this new check_symbol
or declare_symbol
doesn't consume input it should be documented for the end user to know.
I did tried this one and it also failed:
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
typedef <- 'typedef' is_symbol(Name) Name symbol(Name) Name SEMICOL
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
SEMICOL <- ';'
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
Output:
4:18 syntax error, unexpected ';', expecting <Name>.
It seems that the is_symbol/check_symbol
is not consuming input and symbol/declare_symbol
already check for duplicate and refuse to insert and give the error.
@mingodad, I had luck with the following modifications to your c99-mouse-perf.chpeg grammar (modified for cpp-peglib):
Declaration <- SymTypedef
/ DeclarationSpecifiers InitDeclaratorList? ExtensionSpecifier* SEMI
SymTypedef <- TYPEDEF TypeSpecifier symbol(&IdNondigit !Keyword IdNondigit IdChar*)
TypedefName <- &is_symbol(&IdNondigit !Keyword IdNondigit IdChar*) Identifier
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
I had to put the Identifer inline without Spacing since < ... >
token boundary was not taking effect inside the symbol()
/ is_symbol()
.
This may not be completely correct for the C language, but seems to demonstrate the concept.
Input:
typedef struct _foo {int bar; } foo;
foo a;
baz b;
Result:
4:1 'baz' doesn't exist.
If you remove the baz line you'll see it picks up foo:
+ TranslationUnit
+ ExternalDeclaration/1
+ Declaration/0
+ SymTypedef
+ TYPEDEF
+ TypeSpecifier/11
+ StructOrUnionSpecifier
+ StructOrUnion/0
+ STRUCT
- Identifier (_foo)
+ LRWingStructDecl
+ LWING
+ StructDeclaration
+ SpecifierQualifierList/1
+ TypeSpecifier/3
+ INT
+ StructDeclaratorList
+ StructDeclarator/1
+ Declarator
+ DirectDeclarator
- Identifier (bar)
+ SEMI
+ RWING
+ IdNondigit/0
+ IdChar/0
+ IdChar/0
+ SEMI
+ ExternalDeclaration/1
+ Declaration/1
+ DeclarationSpecifiers/0
+ TypedefName
- Identifier (foo)
+ InitDeclaratorList
+ InitDeclarator
+ Declarator
+ DirectDeclarator
- Identifier (a)
+ SEMI
- EOT ()
After @ChrisHixon example I changed mine and it worked but the documentation should clarify about consuming/not conusming
input.
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
typedef <- 'typedef' &is_symbol(Name) Name symbol(Name) ';'
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
Input:
decl long
typedef
long
__off64_t;
typedef
__off64_t
__loff_t;
Output:
+ S
- Decl/0[Name] (long)
+ typedef
- Name (long)
- Name (__off64_t)
+ typedef
- Name (__off64_t)
- Name (__loff_t)
But for usability it's better if all of then consume input as implied by the example on the README and this one should work:
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
typedef <- 'typedef' is_symbol(Name) symbol(Name) ';'
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
After thinking for awhile I tried the one shown bellow and it worked, but if we need a lot of gues/try and error
to use it that's a though sell, I still think it should work with the first example.
S <- (Decl / Ref / typedef)*
Decl <- 'decl' symbol(Name)
Ref <- 'ref' is_symbol(Name)
typedef <- 'typedef' typename symbol(Name) ';'
typename <- is_symbol(Name)
Name <- < [a-zA-Z_][a-zA-Z0-9_]+ >
%whitespace <- [ \t\r\n]*
# 'var_table' is a table name.
symbol(s) <- s { declare_symbol var_table } # Declare symbol instruction
is_symbol(s) <- s { check_symbol var_table } # Check symbol instruction
Input:
decl long
typedef
long
__off64_t;
typedef
__off64_t
__loff_t;
Output:
+ S
- Decl/0[Name] (long)
+ typedef
- typename/0[Name] (long)
- Name (__off64_t)
+ typedef
- typename/0[Name] (__off64_t)
- Name (__loff_t)
@mingodad, it seems like my symbol implementation with macros doesn't work... I made a change with regular definition rules instead of macros.
I've just got this parser https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua converted to
peg
running with https://github.com/mingodad/peg, it needs a hashtable/dicitionary available to be usable by the end users (I'm using khash, included). Nez (https://github.com/ChrisHixon/chpeg/issues/20#issuecomment-1144808438 ) has this:PEG/LEG (https://github.com/mingodad/peg) has this:
My implementation (see attached) has this:
Lpegrex (https://github.com/edubart/lpegrex) has this:
Can we have something like this in
cpp-peglib
?c11-lpegrex.zip