Identifying unexpected token error positions

SeanDS commented 3 years ago

I hope it's ok to ask a general question about pegen here. I've built a parser following the blog posts and code here, and it generally works really nicely but one thing I found missing in the blog is how to handle unexpected tokens. As far as I understand, the recursive descent parser will continue to backtrack on unexpected input until it reaches the first rule again, unless you define an explicit rule to handle particular errors. I assume there is some strategy to identify the token that caused the error, like with how Python's parser knows the error in the following line is the *:

>>> 1*
  File "<stdin>", line 1
    1*
     ^
SyntaxError: invalid syntax

How/where is pegen handling this sort of error? Or, if pegen doesn't handle this error, where is Python's parser handling it (since I know it does!)?

Aside: while trying a different invalid Python syntax example I got something unexpected:

$ echo "hi(" -n | python -

With 3.9.2 this gives the error

  File "<stdin>", line 2

    ^
SyntaxError: unexpected EOF while parsing

which seems to be not showing the line with the error (but it's marking it). The same behaviour happens when hi( is in a file. Is this a bug with Python's new parser's error handling (and if so, should I report it)?

pablogsal commented 3 years ago

How/where is pegen handling this sort of error? Or, if pegen doesn't handle this error, where is Python's parser handling it (since I know it does!)?

We inject invalid rules and we have a mechanism in the C parser to abort the backtracking. For instance, check:

https://github.com/python/cpython/blob/master/Grammar/python.gram#L106

which seems to be not showing the line with the error (but it's marking it). The same behaviour happens when hi( is in a file. Is this a bug with Python's new parser's error handling (and if so, should I report it)?

That is a tokenizer error, the tokenizer has reached the end of the source while expecting more tokens. This error has been improved in Python3.10 (when is possible to retain the source, like when using "-c"):

 ./python.exe -c "hi("
  File "<string>", line 1
    hi(
      ^
SyntaxError: '(' was never close

gvanrossum commented 3 years ago

Without special error rules you can still do a decent job. Just make the error point at the last token read (assuming your tokenizer is “lazy”, i.e. only tokenized as far as needed by the parser). In most cases this gives adequate errors.

SeanDS commented 3 years ago

@pablogsal, I was rather looking for information on how the parser handles cases where there isn't a special error rule. But thanks for the input on the tokeniser error, and glad to hear it's apparently fixed in 3.10.

@gvanrossum Thanks, that works. Earlier I found the part of the code in the repo here that is doing what you say, and I just finished getting it to work for my parser - seems to do the job!

I really enjoyed reading the blogs BTW. If I hadn't found them I'd still be hacking away at my project's old LALR(1) parser.

pablogsal commented 3 years ago

and glad to hear it's apparently fixed in 3.10.

Is not technically "fixed" but "improved". Notice the old error is still correct: there was an unexpected end of file token while parsing. With our parser, is not always easy to emit the improved version because we don't have all the text that we parsed (for instance, when reading from stdin).

we-like-parsers / pegen

Identifying unexpected token error positions #7