Description of my error recovery algorithm

This was very much inspired by the Rüfenacht thesis (Michael Rüfenacht, Error Handling in PEG Parser, 2016, master thesis University of Bern). But honestly I didn't fully understand everything in the thesis and I figured computation of dynamic follow sets required storing a concrete parse tree, which we don't have, so I figured we could do without it.

Rüfenacht's basic idea is to parse the source twice -- once with a faster version of the parser, optimistically hoping there won't be any errors, and only if an error is encountered, a second time with an error recovery algorithm turned on. The error recovery inserts, replaces or deletes tokens near the error location and then re-parses the source code.

Rüfenacht proposes to reuse the memo cache, making the second parse skip quickly over a correct prefix of the source code, using the following algorithm. During the first parse, for each memoized result, Rüfenacht adds one extra number to the cache entry, which indicates the farthest input token consulted by the parser (but not necessarily consumed!) while parsing the rule that led to the result. (Recall that negative results are also cached.) This is easily computed as a side-effect of tokenizing and parsing, and results in a modest overhead (IIRC Rüfenacht measured a few percent).

This extra number (which I have dubbed "reach") is then used to discard only cache entries representing results that peeked ahead at the token in error. The second parse then skips over much of the input using those cache entries. The relative success of the second parse compared to the initial parse tells us how successful the attempted error recovery was.

At this point I deviate from Rüfenacht. Whereas the thesis uses something called "dynamic follow sets", I just attempt various repairs and after each repair attempt I restore the parser input in the original state. For this purpose, the state of the parser includes:

the memo cache
the (lazy) array of input tokens

Restoring the memo cache is done in clear_excess. Restoring the array of input tokens is done by reversing the repair action: if the repair inserted token T at position P in the input array, its reversal removes the token at position P; if the repair deleted the token at position P, its reversal re-inserts the deleted token. (I haven't implemented replacements yet, but a replacement is just a deletion plus an insertion.)

All repair attempts re-parse the input, and any (even minimally) successful repair will tokenize the input stream further. We don't need to undo this added tokenization -- the array of input tokens is filled lazily (see the while loops in getnext and peek).

(One further deviation from Rüfenacht: the thesis uses the standard PEG approach of rolling tokenization an parsing into a single algorithm, described by a single grammar. We don't do this for Python. Where Rüfenacht, and a typical PEG parser, moves the offset in the input string around, we move the offset in the array of input tokens around. This means we cannot do recovery at the token level (e.g., our repair algorithm can't transform exept into except, nor can it guess where missing string quotes should go), but since tokenization is one of the slowest passes of a typical compiler, we make up for this by increased speed of re-parsing.)

Let's look at an example. Assume a simple non-left-recursive expression grammar. Consider this broken input:

( x + y + ) + z
          ^

The initial parse fails at the ) token, since a term is missing. At this point the input array is only filled up to and including that token, i.e.:

( x + y + )

Suppose a repair inserts a NAME token before the ) token. The parser will then successfully parse until the end of the input stream, appending + an ) to the input array. After the NAME token is removed again, the input array now looks like this:

( x + y + ) + z

Since we always restart the parser from the beginning of the input, the fact that the internal representation of the lazy input array is now two tokens longer doesn't matter to either the tokenizer or the parser.

Now let's look at how I decide what to insert. In particular, how does the repair process know to insert a NAME token in this example? The answer is that we use the expect API that forms the basic interface between the parser and the tokenizer. There are a number of specialized variations such as name and number, but these are just variations on a theme.

At various points during parsing, the parser calls expect with a specific token type. To get an idea of this, see my Medium blog post Visualizing PEG Parsing, in particular the little movie I recorded.

When the parser encounters the erroneous ) token, it will expect various tokens. Let's say that our grammar supports NAME, NUMBER and parenthesized expressions. It follows that after a + the parser will make three expect calls, with token types NAME, NUMBER and LPAR (the token type for ( -- tokens have a token type, which is essentially an enum).

Moreover, we know that the parser will make these expect calls in a consistent order -- there is no source of randomness in the parsing algorithm (not even Python's dictionary key hashing). If the grammar has the rule

term: NAME | NUMBER | '( expr ')'

then the parser will always call

expect(NAME)
expect(NUMBER)
expect(LPAR)

Each of these will be called (and fail) before failing on our particular input. Note that there might also be succeeding expect calls at the error location. We'll ignore these. (Actually, I can prove that there won't be, but the proof depends on how we report errors, and is not relevant to the error recovery process.)

The idea then is to attempt three insertion repairs in this case. We could do the following:

Re-parse without any repairs, collecting all failing expect calls at the error location. (The error location is known from the initial, fast parse.)
Given that we collected three failed expect calls, we will make three independent insertion repair attempts, and after each insertion we re-parse the input.
For each re-parse attempt we record how far in the input array the parser proceeded before failing. We consider only those repairs interesting that reached farthest into the input array. There may be many such repairs, and we all consider them equally successful.

The actual algorithm is a bit more involved, because I didn't come up with the idea of collecting the failed expect calls until Pablo mentioned it. Instead, I use the following pseudo code (N is a sufficiently large number to prevent runaway repairs):

for i in range(N):
- Run the parser, modifying expect as follows:
- When expect is called at the error location:
  - Return errors for the first i calls
  - On the next call, insert a token of the requested type
  - Return errors for the remaining calls
- If no expect call inserted a token: break (we're done)

The effect is the same though:

In iteration 0, we insert NAME on the first expect call
In iteration 1, we insert NUMBER on the second expect call
In iteration 2, we insert LPAR on the third expect call
In iteration 3, we don't insert anything, so we know we're done

This algorithm makes 4 extra parses, just like the version presented above. (We could attempt to record failing expect calls during the initial, fast parse, but that would just slow down the fast parse, since we don't know a priori where the parse will fail -- but after the initial parse we do know that.)

Goodnight!

we-like-parsers / pegen_experiments

Description of my error recovery algorithm #247