Thoughts on bpo41659 (SyntaxError on x{a})

lysnikolaou commented 4 years ago

First of all, I've missed working with you all. I'm really looking forward to catching up and spending time with you during the sprints.

Error Reporting

As far as Guido's approch on error reporting goes, I've spent some time this last week playing around with the code and trying to understand everything that's changed in this new version. I had never played around with the python parser too much and I like the refactoring speed.

So, I think I now have a fair understanding of the approach. I now plan to spend some time on skimming the paper and trying to identify if we could something better. Porting the whole thing to C might indeed be a nice sprint project.

bpo41659

Most of my time this last week I spent on bpo41659 and how we could actually fix that, which is not that trivial. My concern is not about this specific issue, but rather about how having these invalid_* rules could make grammar development much more difficult and how they affect the parser's performance. So, I went ahead and started coding up a solution that does all the parsing in two distinct steps (sounds familiar?):

Running the parser with all the invalid_* rules disabled
If 1 fails, we run the parser a second time, this time with invalid_*s enabled.

I think that @pablogsal had mentioned something exactly like this in the past and now that the error recovery mechanism follows the same principle, it seemed like the most promising approach.

I've now got something very basic ready, that works for everything except the REPL. I did some benchmarking with our time_stdlib target and we're actually gain a speedup of about 7% on the first run, which is not huge, but is definitely something.

Not sure if this is going to be relevant after we're done with Guido's error reporting approach, but I guess they could play nicely together, where in the first run we just want to parse and then in the second run we combine both approaches to both get the existing error messages and the suggestion on how we might fix the SyntaxError.

pablogsal commented 4 years ago

Not sure if this is going to be relevant after we're done with Guido's error reporting approach, but I guess they could play nicely together, where in the first run we just want to parse and then in the second run we combine both approaches to both get the existing error messages and the suggestion on how we might fix the SyntaxError.

I think that this would complement the extended error reporting nicely because we will need to do a second pass (technically, multiple ones) anyway. Ideally, in the first pass, we won't be adding into the memo cache the extra items for the error reporting tokens so they don't affect memory and/or speed.

The question for bpo41659 is how elegantly can we switch these rules on and off easily. Could we add some attribute in the parser struct and then check for that attribute in those rules and bail off if is not activated? Maybe this is similar to your current approach.

lysnikolaou commented 4 years ago

Could we add some attribute in the parser struct and then check for that attribute in those rules and bail off if is not activated?

Exactly. My current approach works exactly like that, except for the fact that the check for this struct field happens in the caller, rather than inside the invalid rule itself. This way we get to skip more.

In more detail:

I've added a new field in the parser, called call_invalid_rules. I set that to 0 by default and if the first run fails, I create a new parser struct with p->call_invalid_rules = 1. And then I've tweaked the generator to do the following:

Remember how every alternative has its own scope block?

{
    // variables
    if (something) {
        // action
    }
    // reset
}

The generator now just looks at the Alt object and if it's only got a single child that starts with 'invalid', it does this instead:

if (p->call_invalid_rules) {
    // variables
    if (some_invalid_rule()) {
        // RAISE_SYNTAX_ERROR alike action
    }
    // reset
}

pablogsal commented 4 years ago

I am curious, why this does not work out of the box in the REPL?

lysnikolaou commented 4 years ago

Because of the way I reset everything if the parser fails.

For strings, it's pretty simple, we just create a new parser with the same string by calling PyTokenizer_From(String|UTF8). But it gets a bit trickier for files, since one our the pegen C API functions just accepts a FILE *. There I get the position of the file with ftell before the first run and, if the parser fails, I reset it with fseek, which seems to work fine for most files, but not for the REPL.

I've not yet diped that deep to understand exactly where it goes wrong, but I assume that this is because of how the REPL parsing works, which is that the pegen C API function gets called with stdin and then it waits for input deep within the tokenizer.

I guess I need to rework this from scratch and I only did it the easy way to get some early results.

One solution I quickly gave up on, is to just pass the tokens array to the new parser struct, but that won't work by itself, because most invalid_ rules will need to look further than their counterparts that failed during the first run. I think that some hybrid solution where the second run reuses the tokens array until it needs to look at the input, because it needs to reach further, would be the most performant one.

gvanrossum commented 4 years ago

Let me rename this issue to be about just bpo41659. We can discuss the paper and my approach in a separate issue.

gvanrossum commented 4 years ago

Speaking of invalid rules, could we rename the one rule that's named incorrect_* to invalid_* too?

I agree with Pablo that the "invalid" rules and my approach are complimentary. An automated approach can never come up with decent errors for things like "cannot assign to literal". But you can't add "invalid" rules that will come up with decent recovery for random things like (a+) or (a:).

We should benchmark -- is the parser without the "invalid" rules significantly (or even just measurably) faster than the one without those?

If it is we should totally do this -- one of the biggest cost of parsing is tokenization, and for the second parse the tokenization is already done. I'm not sure what we can salvage of the memo content -- perhaps the approach from clear_excess will work (but it requires storing an extra int for each memo entry, which is expensive, so I kind of doubt it).

we-like-parsers / pegen_experiments

Thoughts on bpo41659 (SyntaxError on x{a}) #242

Error Reporting

bpo41659