we-like-parsers / pegen_experiments

Experiments for the official PEG parser generator for Python
https://github.com/python/cpython/tree/master/Tools/peg_generator
Other
273 stars 30 forks source link

Flexibility in defining core building blocks of grammar #241

Closed h-vetinari closed 4 years ago

h-vetinari commented 4 years ago

I've been following this project with interest since the original discussion on discuss.python.org, it's great to see that this will most likely make it into python 3.9!

I wanted to try using pegen to generate a parser for toml, which has recently reached v1.0.0-rc.1 (background: pytoml is no longer maintained, and I dislike many things about toml-the-python-library, hence wanting to write my own library).

The thing that made using pegen quite an obvious idea is that toml itself is defined through a grammar, even if the format is slightly different - I thought: why hand-code a parser if there's all this machinery already?! My first pass was translating this into the particular flavour of PEG-notation that pegen supports, but I'm running into problems with defining the basic building blocks (I'd like to exactly replicate toml.abnf if at all possible; assuming no bugs in the parser generator, compliance would then be basically for free).

The first problem is that several rules are defined in terms of unicode ranges, and it looks to me like this is not supported in pegen at all. Example:

comment-start-symbol = %x23 ; #
non-ascii = %x80-D7FF / %xE000-10FFFF
non-eol = %x09 / %x20-7F / non-ascii

But even for an MVP without unicode ranges, I ran into troubles. Using a first draft of the translated toml.gram (hidden behind the fold), I get

``` # WARNING: This document is a work_in_progress and should not be considered # authoritative until further notice. # This is an attempt to define TOML in ABNF according to the grammar defined # in RFC 5234 (http://www.ietf.org/rfc/rfc5234.txt). # You can try out this grammar using http://instaparse.mojombo.com/ # To do so, in the lower right, click on Options and change `:input_format` to # ':abnf'. Then paste this entire ABNF document into the grammar entry box # (above the options). Then you can type or paste a sample TOML document into # the beige box on the left. Tada! # Overall Structure # note: pegen needs a 'start' rule start: expression ENDMARKER toml: expression ( newline expression )* expression: | ws [ comment ] | ws keyval ws [ comment ] | ws table ws [ comment ] # Whitespace ws: wschar* wschar: ' ' | '\t' # space & tab # Newline newline: '\r\n' | '\n' # CRLF & LF # Built_in ABNF terms, reproduced here (and moved up compared to toml.abnf) for clarity # ALPHA: %x41-5A | %x61-7A # A_Z | a_z ALPHA: | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' HEXDIG: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' # Comment comment_start_symbol: '#' # comment-symbol # FIXME! # non_ascii: %x80-D7FF | %xE000-10FFFF # non_eol: %x09 | %x20-7F | non_ascii CHAR_NO_SPECIAL: # without: ' " \ | ALPHA | DIGIT | '!' | '#' | '$' | '%' | '&' | '(' | ')' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' | ']' | '^' | '_' | '`' | '{' | '|' | '}' | '~' CHAR: CHAR_NO_SPECIAL | '"' | '\'' | '\\' non_eol: wschar | CHAR comment: comment_start_symbol non_eol* # Key_Value pairs keyval: key keyval_sep val key: simple_key | dotted_key simple_key: quoted_key | unquoted_key unquoted_key: ( ALPHA | DIGIT | '-' | '_' )+ quoted_key: basic_string | literal_string dotted_key: simple_key ( dot_sep simple_key )+ dot_sep: ws '.' ws # . Period keyval_sep: ws '=' ws # = val: string | boolean | array | inline_table | date_time | float | integer # String string: ml_basic_string | basic_string | ml_literal_string | literal_string # Basic String basic_string: quotation_mark basic_char* quotation_mark quotation_mark: '"' basic_char: basic_unescaped | escaped # FIXME! # basic_unescaped: wschar | '!' | %x23-5B | %x5D-7E | non_ascii basic_unescaped: wschar | CHAR_NO_SPECIAL | '\'' escaped: escape escape_seq_char escape: '\\' # \ escape_seq_char: | '"' # " quotation mark U+0022 | '\\' # \ reverse solidus U+005C | 'b' # b backspace U+0008 | 'f' # f form feed U+000C | 'n' # n line feed U+000A | 'r' # r carriage return U+000D | 't' # t tab U+0009 | 'u' HEXDIG HEXDIG HEXDIG HEXDIG # uXXXX U+XXXX | 'U' HEXDIG HEXDIG HEXDIG HEXDIG HEXDIG HEXDIG HEXDIG HEXDIG # UXXXXXXXX U+XXXXXXXX # Multiline Basic String ml_basic_string: ml_basic_string_delim ml_basic_body ml_basic_string_delim ml_basic_string_delim: quotation_mark quotation_mark quotation_mark ml_basic_body: mlb_content* ( mlb_quotes mlb_content+ )* [ mlb_quotes ] mlb_content: mlb_char | newline | mlb_escaped_nl mlb_char: mlb_unescaped | escaped mlb_quotes: (quotation_mark quotation_mark | quotation_mark) # FIXME! # mlb_unescaped: wschar | '!' | %x23-5B | %x5D-7E | non_ascii mlb_unescaped: wschar | CHAR_NO_SPECIAL | '\'' mlb_escaped_nl: escape ws newline ( wschar | newline )* # Literal String literal_string: apostrophe literal_char* apostrophe apostrophe: '\'' # ' apostrophe # FIXME! # literal_char: '\t' | %x20-26 | %x28-7E | non_ascii literal_char: wschar | CHAR_NO_SPECIAL | '"' | '\\' # Multiline Literal String ml_literal_string: ml_literal_string_delim ml_literal_body ml_literal_string_delim ml_literal_string_delim: apostrophe apostrophe apostrophe ml_literal_body: mll_content* ( mll_quotes mll_content+ )* [ mll_quotes ] mll_content: mll_char | newline # FIXME! # mll_char: '\t' | %x20-26 | %x28-7E | non_ascii mll_char: wschar | CHAR_NO_SPECIAL | '"' | '\\' mll_quotes: (apostrophe apostrophe | apostrophe) # Integer integer: dec_int | hex_int | oct_int | bin_int digit1_9: ('1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9') digit0_7: ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7') digit0_1: ('0' | '1') hex_prefix: '0x' oct_prefix: '0o' bin_prefix: '0b' dec_int: ( '+' | '-' )? unsigned_dec_int unsigned_dec_int: DIGIT | digit1_9 ( '_'? DIGIT )+ hex_int: hex_prefix HEXDIG ( '_'? HEXDIG )* oct_int: oct_prefix digit0_7 ( '_'? digit0_7 )* bin_int: bin_prefix digit0_1 ( '_'? digit0_1 )* # Float float: | float_int_part ( exp | frac [ exp ] ) | special_float float_int_part: dec_int frac: decimal_point zero_prefixable_int decimal_point: '.' zero_prefixable_int: DIGIT ('_'? DIGIT )* exp: 'e' float_exp_part float_exp_part: ( '+' | '-' )? zero_prefixable_int special_float: ( '+' | '-' )? ( 'inf' | 'nan' ) # Boolean boolean: 'true' | 'false' # Date and Time (as defined in RFC 3339) date_time: offset_date_time | local_date_time | local_date | local_time date_fullyear: DIGIT DIGIT DIGIT DIGIT date_month: DIGIT DIGIT # 01-12 date_mday: DIGIT DIGIT # 01-28, 01-29, 01-30, 01-31 based on month/year time_delim: 'T' | 't' | ' ' time_hour: DIGIT DIGIT # 00-23 time_minute: DIGIT DIGIT # 00-59 time_second: DIGIT DIGIT # 00-58, 00-59, 00-60 based on leap second rules time_secfrac: '.' DIGIT+ time_numoffset: ( '+' | '-' ) time_hour ':' time_minute time_offset: 'Z' | time_numoffset partial_time: time_hour ':' time_minute ':' time_second [ time_secfrac ] full_date: date_fullyear '-' date_month '-' date_mday full_time: partial_time time_offset # Offset Date_Time offset_date_time: full_date time_delim full_time # Local Date_Time local_date_time: full_date time_delim partial_time # Local Date local_date: full_date # Local Time local_time: partial_time # Array array: array_open [ array_values ] ws_comment_newline array_close array_open: '[' array_close: ']' array_values: | ws_comment_newline val ws array_sep array_values | ws_comment_newline val ws [ array_sep ] array_sep: ',' ws_comment_newline: ( wschar | [ comment ] newline )* # Table table: std_table | array_table # Standard Table std_table: std_table_open key std_table_close std_table_open: '[' ws # [ Left square bracket std_table_close: ws ']' # ] Right square bracket # Inline Table inline_table: inline_table_open [ inline_table_keyvals ] inline_table_close inline_table_open: '{' ws inline_table_close: ws '}' inline_table_sep: ws ',' ws inline_table_keyvals: keyval [ inline_table_sep inline_table_keyvals ] # Array Table array_table: array_table_open key array_table_close array_table_open: "[[" ws # [[ Double left square bracket array_table_close: ws "]]" # ]] Double right square bracket ```
AssertionError: ' ' is not a known literal

This is somewhat surprising because metagrammar.gram uses some strings directly (where the exact same ones - like '!' - fail for me). I haven't managed to find out how this works for metagrammar.gram (even playing around with the from ast import literal_eval import).

After some more sleuthing, I then understood how pegen basically reuses cpython/Lib/token.py, and I tried to hack something together using that.

```diff # Whitespace ws: wschar* -wschar: ' ' | '\t' # space & tab +wschar: INDENT # space & tab # Newline -newline: '\r\n' | '\n' # CRLF & LF +newline: NEWLINE # CRLF & LF # Built_in ABNF terms, reproduced here (and moved up compared to toml.abnf) for clarity # ALPHA: %x41-5A | %x61-7A # A_Z | a_z -ALPHA: - | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' - | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' - | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' - | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' -DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' +ALPHA: STRING +DIGIT: NUMBER HEXDIG: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' # Comment -comment_start_symbol: '#' # comment-symbol +comment_start_symbol: COMMENT # comment-symbol # FIXME! # non_ascii: %x80-D7FF | %xE000-10FFFF # non_eol: %x09 | %x20-7F | non_ascii CHAR_NO_SPECIAL: # without: ' " \ - | ALPHA | DIGIT | '!' | '#' | '$' | '%' | '&' | '(' | ')' | '*' | '+' - | ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' - | '[' | ']' | '^' | '_' | '`' | '{' | '|' | '}' | '~' -CHAR: CHAR_NO_SPECIAL | '"' | '\'' | '\\' + | ALPHA | DIGIT | COMMENT | PERCENT | AMPER | LPAR | RPAR | STAR + | PLUS | COMMA | MINUS | DOT | SLASH | COLON | SEMI | LESS | EQUAL | GREATER | AT + | LSQB | RSQB | CIRCUMFLEX | LBRACE | VBAR | RBRACE | TILDE # missing '!', '$', '?', '_', '`' +CHAR: CHAR_NO_SPECIAL #| '"' | '\'' | '\\' non_eol: wschar | CHAR comment: comment_start_symbol non_eol* @@ -62,12 +58,13 @@ keyval: key keyval_sep val key: simple_key | dotted_key simple_key: quoted_key | unquoted_key -unquoted_key: ( ALPHA | DIGIT | '-' | '_' )+ +# unquoted_key: ( ALPHA | DIGIT | '-' | '_' )+ +unquoted_key: ( ALPHA | DIGIT | MINUS )+ quoted_key: basic_string | literal_string dotted_key: simple_key ( dot_sep simple_key )+ -dot_sep: ws '.' ws # . Period -keyval_sep: ws '=' ws # = +dot_sep: ws DOT ws # . Period +keyval_sep: ws EQUAL ws # = val: string | boolean | array | inline_table | date_time | float | integer ``` etc.

Still, I can't get to the end of it, because I need to define " | ', and those seem to have no direct correspondence in token.py...

I understand that this project is heavily driven by replacing the old python parser. However, it seems that the general machinery that you have built is very close to being able to generate much more general PEG parsers, and it seems that could be a worthwhile goal too?

But even barring larger generalisations, could you maybe give me some pointers how to get literal strings working?

gvanrossum commented 4 years ago

You must write your own main. Check build_parser() in build.py. You can supply your own tokenizer, as long as it duck-types Tokenizer.

h-vetinari commented 4 years ago

Thanks for the info, will try!

vpoverennov commented 4 years ago

Hi, I was also using pegen for my own project, I moved all logic out of Parser.expect to Tokenizer and then implemented my own tokenizer without depending on tokenize.py and python tokens.

    @memoize
    def expect(self, type: str) -> Optional[tokenize.TokenInfo]:
        return self._tokenizer.expect(type)

By looking at toml.gram It seems like it would be easier to implement some of those rules in a tokenizer and have simpler grammar.

lysnikolaou commented 4 years ago

I feel that this has stalled and can thus be closed. In case there are more questions about this in the future, feel free to reopen it.