we-like-parsers / pegen_experiments

Experiments for the official PEG parser generator for Python
https://github.com/python/cpython/tree/master/Tools/peg_generator
Other
275 stars 29 forks source link

RecursionError in generated Python parser #245

Open gvanrossum opened 4 years ago

gvanrossum commented 4 years ago

The test corpus contained a case where function is called with many (over 255) arguments, each just numbers. This used to be a syntax error in 3.6 and before -- but with more recent versions of the bytecode compiler it works for 10s of 1000s of arguments.

However, the generated parser in Python cannot handle it! It raises a RecursionError for A(0, 1, ..., k) if k is 221 or higher. The traceback seems to be an endless repetition of this fragment:

  File "/Users/guido/pegen/data/python_parser.py", line 5512, in _tmp_108
    if (literal := self.expect(",")) and (c := self.args()):
  File "/Users/guido/pegen/pegen/parser.py", line 65, in memoize_wrapper
    tree = method(self, *args)
  File "/Users/guido/pegen/data/python_parser.py", line 3041, in args
    if (a := self.named_expression()) and (b := self._tmp_108(),):
  File "/Users/guido/pegen/pegen/parser.py", line 65, in memoize_wrapper
    tree = method(self, *args)

I think this is actually about this rule:

args[expr_ty]:
    [...]
    | a=named_expression b=[',' c=args { c }] {
        [...] }

The part [',' c=args {c}] corresponds to _tmp_108. This is a straightforward recursion (not left-).

I can't actually fathom that the C parser doesn't segfault on this for me; it just slows down. (In the C code, it's _tmp_109, but the structure is the same.) The stack must automatically grow (on Mac, at least). But even there it would be nice if we could replace this by a simple loop.

E.g. maybe we could make this work?

args[expr_ty]:
    | a=starred_expression b=[',' c=args { c }] {
        _Py_Call(_PyPegen_dummy_name(p),
                 (b) ? CHECK(_PyPegen_seq_insert_in_front(p, a, ((expr_ty) b)->v.Call.args))
                     : CHECK(_PyPegen_singleton_seq(p, a)),
                 (b) ? ((expr_ty) b)->v.Call.keywords : NULL,
                 EXTRA) }
    | a=kwargs { _Py_Call(_PyPegen_dummy_name(p),
                          CHECK_NULL_ALLOWED(_PyPegen_seq_extract_starred_exprs(p, a)),
                          CHECK_NULL_ALLOWED(_PyPegen_seq_delete_starred_exprs(p, a)),
                          EXTRA) }
    | ','.(named_expression !'=')+ { XXX }

I haven't figured out what should go into XXX yet, but it's probably going to be more efficient, since we don't build up the array of arguments by prepending one at a time.

gvanrossum commented 4 years ago

(Oh, I tried, and that doesn't actually work even for the Python parser. But why? @lysnikolaou any idea?)

gvanrossum commented 4 years ago

I tried this and it works... Now for filling in XXX...

args[expr_ty]:
    | ','.(starred_expression | named_expression !'=')+ [',' kwargs] { XXX }
    | a=kwargs { _Py_Call(_PyPegen_dummy_name(p),
                          CHECK_NULL_ALLOWED(_PyPegen_seq_extract_starred_exprs(p, a)),
                          CHECK_NULL_ALLOWED(_PyPegen_seq_delete_starred_exprs(p, a)),
                          EXTRA) }
pablogsal commented 4 years ago

Here is a draft implementation:

https://github.com/python/cpython/pull/22053

For this script:

import ast

code = "f(" + ",".join(['a' for _ in range(100000)]) + ")"
print("Ready!")
ast.parse(code)

I get these results (release mode):

This patch:  0.17s +- 0.04

Python3.8:  0.27s +- 0.08

Master:
[1]    78689 segmentation fault (core dumped)  ./python lel.py