Closed gvanrossum closed 4 years ago
where the magic action would transform a tree of the form
(a, *, (b, +, c))
into((a, *, b), +, c)
. Sure, we've wasted some space for a complicated expression, but for a simple one we've parsed it with much less effort. Thoughts?
Oh, that can't work because the AST doesn't record whether there were parentheses forcing the grouping. So it'll have to be more like
bitwise_or:
| factor (operator factor)* { magic action }
where the magic action must act on an array of (operator factor)
pairs. We already have a similar data structure for the comparison operators. (Heh, this would benefit from generics. :-)
We would collapse 6 levels into 1 this way. (To do more would require more intelligence in the magic action, but it would be possible.)
Yeah, I did consider the parens issue :)
To avoid allocating too many intermediate arrays, you may be able to use that initial array of operators as the operator stack, using a separate read and write index: the read index scans the array from left to right, and the write index is used to push the combined nodes. In other words, you are using the same piece of memory as both a queue and a stack.
In the context of this particular implementation, I don't mind allocating extra memory if I can free it before turning from the magic action. But writing in the array is tricky due to the way the memoization (packrat parsing) is implemented and how we allocate memory for parsing results (using an arena that is freed only when we delete the AST).
We still need to design the metagrammar syntax for indicating operator precedences.
I wonder if the generator could recognize the pattern and just do the smart thing automatically, without any markup? The downside is that a small tweak to the grammar might accidentally break the optimization and testing wouldn't find this out.
And I would be happy with implementing an ad-hoc solution, given that we're really only planning to use (this version of) pegen for the Python parser.
Given that the operator precedence will be a global property of the grammar (I don't see if it would be useful to have local precedence rules) we could declare the precedence itself with some metas at the top of the grammar. Then, we could use some flag in the rule like the one we use for memoization (the "arguments" between parenthesis) to activate the "magic". I would recommend that if we go this way we make it explicit to decrease the cognitive burden of reading the grammar and reason about it.
Assuming that the relative precedence of the binary operators will be the same in all parsing contexts, you could certainly define them globally - that's what bison does:
http://web.mit.edu/gnu/doc/html/bison_8.html#SEC73
There are two ways I can think of to declare the precedence: either numerically, or by an ordered list. If it's an ordered list, you need a way of grouping operators of equal precedence. Bison's metagrammar looks like this:
%left '<' '>' '=' NE LE GE
%left '+' '-'
%left '*' '/'
The order in which operators are declared determines the relative precedence, with the last ones declared having the highest. %left and %right determines whether the operator is left-associative or right-associative (this affects whether you use '>' or '>=' when doing stack reductions.)
Okay, we could adapt that to pegen's metagrammar like this:
@operators """
left |
left ^
left &
left << >>
left + -
left * / // % @
"""
It's a triple-quoted string, each line being a word left
or right
followed by one or more operator symbols separated by spaces (we could add quotes around the operators but I think we don't need that, since operators can't contain spaces). I'm not sure yet what to do with unary operators -- Bison's solution feels like a hack. Until I've figured that out I have to stop at the multiplication/division level, because the next level is unary +
, -
and ~
. After that comes Python's only right-associative operator, **
.
So now the parser generator has the priorities for those rules. How do we proceed then? Bison's example suggests that we could just write
expr:
| a=expr '|' b=expr { _Py_BinOp(a, BitOr, b, EXTRA) }
| a=expr ^ b=expr { _Py_BinOp(a, BitXor, b, EXTRA) }
...
| factor
and let the parser generator handle the choice of which reduction to use when, but that's kind of against the PEG philosophy of always trying alternatives in the order in which they are written. (In Bison IIRC the order is immaterial since it does all this nice CFG stuff.)
So the other way is what I suggested in an earlier comment, where we first collect the whole expression, like this:
expr:
| a=factor b=(operator factor)+ { _PyPegen_MagicReduce(a, b, EXTRA) }
| factor
The rest is just a matter of programming.
I have a working first draft of this in this branch: https://github.com/python/cpython/compare/master...we-like-parsers:operator_precedence
Is this more or less what you have in mind for _PyPegen_MagicReduce
?
Some benchmarks:
❯ ./python -m pyperf timeit 'import _peg_parser' '_peg_parser.parse_string("2*3 + 4*5*6\n12 + (2 * 3 * 4 * 5 + 6 + 7 * 8)\n1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20))))))", ast=False)'
.....................
Mean +- std dev: 27.0 us +- 0.1 us
❯ time ./python Tools/peg_generator/data/xxl.py
./python Tools/peg_generator/data/xxl.py 0.95s user 0.20s system 99% cpu 1.152 total
❯ ./python -m pyperf timeit 'import _peg_parser' '_peg_parser.parse_string("2*3 + 4*5*6\n12 + (2 * 3 * 4 * 5 + 6 + 7 * 8)\n1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20))))))", ast=False)'
.....................
Mean +- std dev: 25.0 us +- 0.2 us
❯ time ./python Tools/peg_generator/data/xxl.py
./python Tools/peg_generator/data/xxl.py 1.06s user 0.16s system 99% cpu 1.217 total
To really see how this compares, try something with several trivial expressions. E.g. I tried [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
and got 13 usec on master vs. 10 usec with your patch. That's impressive! (As is how quickly which you came up with this patch!)
I'll look over the code later (tomorrow?).
Now a CPython PR (python#20696) and bpo issue (https://bugs.python.org/issue40902).
I think I have an extension to unary operators (surely this isn't new, IIRC I discovered the same several decades ago).
First, the grammar rule becomes essentially
bitwise_or: primary (binop unop* primary)*
but the parsing algorithm treats each operator as a separate reduce opportunity, with a NULL primary inserted in front of each unary operator.
We use the same stack format as before, alternating expressions and (operator, priority) pairs:
expr (op, pri) expr (op, pri) expr (op, pri) expr
When a unary operator is encountered, it is given a separate priority, and we push a NULL expression to its left. For an expression like a+b*-~c*d
we would pretend the input was as follows:
a + b * NULL - NULL ~ c * d
Unary operators have separate priorities to the left and to the right -- an incoming unary operator gets a certain priority, and once on the stack the same operator is given a different priority. Both are distinct from the priority of the same operator in a binary position.
The priorities ensure that we never try to reduce e.g. b * NULL
-- instead, when we encounter the second *
, we reduce NULL ~c
to (~c)
and then NULL - (~c)
to (-(~c))
. Then finally we reduce b * ((-(~c))
to (b*(-(~c)))
and then we push *
and d
.
See also https://github.com/python/cpython/pull/20696/commits/882c74095bd0d34e226ee6b9a3bf6a3177ee9ba1.
One question is, can this be extended an additional level downward (by 'downward' I mean in the direction of lower precedence, not stack frames) an additional level to handle the comparison and containment operators? It seems to me that the only wrinkle here is the two-word operators, not in
and is not
- which can be handled if they are considered as a single grammatical unit.
And if you are able to handle unary operators with arbitrary precedence, then you might be able to go all the way down to the logical operators 'and' and 'or'.
On the other end of the spectrum, 'await' can be considered a kind of unary operator. So theoretically this could stretch all the way from the ternary operator to subscription.
Another thought: You might be able to skip the entire recursive expression tower if your tokenizer allows lookahead - i.e. if your parse_expression call sees an integer followed by a comma or right paren, you could do an 'early out' and just call parse_primary directly.
After thinking more about it, I think we should not bother with any further refinements. They would make the code more complex (and it is already too complex), and provide relatively little gain. (Another refinement I just tried is using a fixed-size local array instead of malloc()ed storage -- it did not make a difference.)
I observe that no matter what we do, the parser still ends up calling _PyPegen_expect_token()
for each operator. So this provides a limit on how much speedup we can get from this scheme.
We could even decide not to do this at all -- while a 12% speedup on parsing is nothing to sneeze at, it makes the grammar less readable (when we figure out how to publish the grammar in the reference docs, we'll have to add the table of operator priorities separately, and explain how it applies). Certainly worth pondering a bit more.
So we decided to punt on this, despite a working, productionized implementation that benchmarked favorably. The reason was the complexity of the code and the grammar readability issues.
There's an old parsing technique called "operator precedence" which we might be able to use to speed up parsing expressions.
When we parse an expression we go through a lot of levels of the grammar even if no operators of that level are present. E.g. a simple expression like
a + b
successfully goes through all of the following rules:Many of these (esp. bitwise_or through power) use similar actions that call
_Py_BinOp(left, op, right, EXTRA)
.For just binary operators there's a pretty straightforward algorithm using a stack and a table of operators and precedences that produces the right AST without so many productions.
Maybe we should explore this and see if it makes a substantial difference?
A variant of this would have a single rule
where the magic action would transform a tree of the form
(a, *, (b, +, c))
into((a, *, b), +, c)
. Sure, we've wasted some space for a complicated expression, but for a simple one we've parsed it with much less effort. Thoughts?