data: make a full Python parser for python grammar

MatthieuDartiailh commented 3 years ago

I am making progress. But I have a couple of questions.

First I had to include the fix to make the parser return a single item rather than a list of a single item. It is the first commit, I did not update the tests though. I will try to make a separate PR for that.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

pablogsal commented 3 years ago

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

For testing the parser, the best way is to throw it a lot of Python files (we used to have the whole stdlib or the top 5000 Pypi packages to do this) and compare potentially the output (or check if it crashes). If you plan to generate exactly the same AST, then it makes sense to compare the output with the built-in parser.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Not sure I fully understand the problem. I need some time to investigate exactly what's going on but it would be helpful if you could elaborate a bit more and maybe provide some examples :)

First I had to include the fix to make the parser return a single item rather than a list of a single item

Could you also elaborate on this.

Currently, I am quite overburdened with work on CPython so apologies if it takes me a bit more than usually to respond here :)

MatthieuDartiailh commented 3 years ago

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

For testing the parser, the best way is to throw it a lot of Python files (we used to have the whole stdlib or the top 5000 Pypi packages to do this) and compare potentially the output (or check if it crashes). If you plan to generate exactly the same AST, then it makes sense to compare the output with the built-in parser.

I will look into how to do that then.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Not sure I fully understand the problem. I need some time to investigate exactly what's going on but it would be helpful if you could elaborate a bit more and maybe provide some examples :)

Pegen requires to have a rule start (or to have trailers). The Python grammar has multiple entries (file, interactive, eval). Looking at the original C grammar I realize now that the start is dynamically set in that case. I am not sure what would be the best way to emulate that, but for testing I can simply keep start: file

First I had to include the fix to make the parser return a single item rather than a list of a single item

Could you also elaborate on this.

I was referring to the point discussed in #6 and for which I opened #14.

Currently, I am quite overburdened with work on CPython so apologies if it takes me a bit more than usually to respond here :)

MatthieuDartiailh commented 3 years ago

I have reached a somehow puzzling point where trying to parse import sys lead to this being matched as an expression but then just bubbling up that way and at some point being converted to None. I tried to step through with pdb but I am not sure I fully follow the logic. I feel like the parser should realize that matching NAME won't go anywhere and then explore different options but it does not seem to be what happen. Parsing this appear to work though:

def a():
    pass

a: int
a = 1
a += 1
a *= 1
a /= 1
a //= 1
a %= 1

a + 1
a * 1
a / 1
a // 1
a % 1

a < 1
a > 1
a <= 1
a >= 1
a == 1
a != 1

a = 2
a << 1
a >> 1
a ^ 1
a ^= 1

a = []
a = {}

but the ast test just checking the str complain about the context of some Name node being explicitely set to Load, I will look into making the ast comparison more robust at some point.

Any help would be very much appreciated.

lysnikolaou commented 3 years ago

At some point in time, we added hard keywords to the C parser, which means that keywords do not get parsed as NAMEs, but are themselves tokens. After that we moved the star_expressions alternative in the simple_stmt rule all the way to the top, in order to speed up the parsing, since expressions are the most common constructs in Python source code.

That's probably what's creating the problem. We have not implemented hard keywords for the python generator, which means that import gets parsed as a name and therefore as an expression. That's why star_expressions succeeds, way before import_stmt is even reached.

Does that make sense or have I missed something?

If you want to have a better look at how keywords are parsed in the C parser, have a look at the keyword list and how it gets generated.

MatthieuDartiailh commented 3 years ago

Thanks a lot @lysnikolaou ! Yes that make sense. Trying to browse through the C implementation am I right to assume that the key point is https://github.com/python/cpython/blob/3bb19873abd572879cc9a8810b1db9db1f704070/Parser/pegen.c#L629 which is in charge on transforming the tokens returned by the tokenizer and turn bare NAME token into the appropriate keyword TOKEN type.

lysnikolaou commented 3 years ago

Exactly, yes! More specifically the function call to _get_keyword_or_name_type here.

MatthieuDartiailh commented 3 years ago

Thanks for the confirmation.

MatthieuDartiailh commented 3 years ago

I have now a better idea of how those things are connected but I still have a couple of questions:

regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''. Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?
I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?
related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

pablogsal commented 3 years ago

regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''.

With pattern matching Python has match and case as soft keywords.

Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?

Yup. Check for instance https://github.com/python/cpython/blob/master/Grammar/python.gram#L212

I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?

Yeah, pegen is tightly coupled with the Python tokenizer and could be quite difficult to untangle since is baked into its base design. I think the best approach is to extract the keywords from the grammar and make the generated Python parser check for them as we are doing with the C parser as @lysnikolaou mentioned. Maybe I am not understanding your proposal correctly, but if your concern is how to enforce keywords, then I think the best approach is that one. Check for example:

https://github.com/python/cpython/blob/af50c84643ce21cfbdfdabbdfae6bd5e1368c542/Tools/peg_generator/pegen/c_generator.py#L114-L134

related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

They should be extracted from the grammar.

MatthieuDartiailh commented 3 years ago

regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''.

With pattern matching Python has match and case as soft keywords.

I started from the python.gram from the repo and pattern matching was not there yet so I missed that thanks for the pointer.

Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?

Yup. Check for instance https://github.com/python/cpython/blob/master/Grammar/python.gram#L212

Thanks

I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?

Yeah, pegen is tightly coupled with the Python tokenizer and could be quite difficult to untangle since is baked into its base design. I think the best approach is to extract the keywords from the grammar and make the generated Python parser check for them as we are doing with the C parser as @lysnikolaou mentioned. Maybe I am not understanding your proposal correctly, but if your concern is how to enforce keywords, then I think the best approach is that one. Check for example:

https://github.com/python/cpython/blob/af50c84643ce21cfbdfdabbdfae6bd5e1368c542/Tools/peg_generator/pegen/c_generator.py#L114-L134

My concern is that the token type need to be converted from NAME to the a dedicated type as is done in _PyPegen_fill_token (along with other operations) and that feels like something that should be done inside the tokenizer rather than the parser, hence my idea to provide a way to do it on the tokenizer wrapper since we are already adding the ability to peek to the tokenizer. Ideally I would like to document well enough what are the requirements on the tokenizer to allow people to use a different one.

related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

They should be extracted from the grammar.

Thanks

lysnikolaou commented 3 years ago

My concern is that the token type need to be converted from NAME to the a dedicated type as is done in _PyPegen_fill_token (along with other operations) and that feels like something that should be done inside the tokenizer rather than the parser, hence my idea to provide a way to do it on the tokenizer wrapper since we are already adding the ability to peek to the tokenizer.

Not doing this in the tokenizer was an explicit design decision though. While your feeling makes sense (we had this discussion when doing this), the general rule is that the tokenizer should know nothing about reserved keywords. Its only job is to identify different "sentence parts", without knowing about what they mean in the context they're parsed.

For example, a.import() should be tokenized as:

NAME a
DOT .
NAME import
LPAR (
RPAR )

It is up to the parser to identify that import is a reserved keyword and fail here.

MatthieuDartiailh commented 3 years ago

Thanks for the clarification @lysnikolaou. Happy to know that, my feeling was not completely off. I will see how to implement the logic that way in the python parser/generator then. Thanks for your advice.

MatthieuDartiailh commented 3 years ago

Good new I can now parse import statements !

However I would appreciate a quick review of the commit adding support for keywords because it is not without issues. Basically I followed the idea of the c generator and assigned new integers as token type. Even though I follow the idea that it should be done in the parser, I ended up doing inside the tokenizer wrapper since it seems the most efficient way to do it (this way the transformation is done once and only once). The keywords and soft keywords are collected from the grammar but due to how they collected are only listed at the very end of the generated parser.

One issue I have is that printing anything involving TokenInfo fails because the type is unknown. Since we are basically doing string matching, I could use the OP type and still get the proper matching but it does not seems ideal. I am open to suggestions.

When I get more time I will go back to more extensively testing the parser.

MatthieuDartiailh commented 3 years ago

I found a more elegant way to handle keyword by using the NAME matching rule. This way there is no need to touch the tokenizer and no modification of the token making them non printable.

Since to achieve this I had to make a number of changes not directly related to adding a full blown python grammar to the project I would like to know if you want me to split that part of and make a different PR for the Python parser.

Regarding testing for the Python parser, should I focus on the updating the scripts in scripts, or should I write more systematic tests under tests (maybe tests\python) ?

lysnikolaou commented 3 years ago

I found a more elegant way to handle keyword by using the NAME matching rule. This way there is no need to touch the tokenizer and no modification of the token making them non printable.

Happy to hear that.

Since to achieve this I had to make a number of changes not directly related to adding a full blown python grammar to the project I would like to know if you want me to split that part of and make a different PR for the Python parser.

Truth is that would help a bunch with the review. If that's not too much work for you, I'd be really happy to review a PR with only the changes for keyword support.

Regarding testing for the Python parser, should I focus on the updating the scripts in scripts, or should I write more systematic tests under tests (maybe tests\python) ?

I guess the answer is both. For now though, let's just focus on writing as many tests as possible for the constructs the parser can parse. For example, let's write tests that the output AST is correct for parsing import statements and then incrementally build more tests as we support more and more things. When we've got a first working version for all of the grammar, we can see if it makes sense to run modify and run something under /scripts.

MatthieuDartiailh commented 3 years ago

Ok then I will try to make a new PR with the change to support for keyword, plus other minor stuff (such as not creating a method called async !), each with its own dedicated commit. And I will keep working on adding tests once I am done.

Currently the one very broken things is the handling of strings. I need to look into it. I may try a trick similar to https://github.com/nucleic/enaml/blob/main/enaml/core/parser/parser36.py#L98 since we are parsing Python in Python and I do not fancy spending to much time on parsing f strings (but I could be convinced otherwise).

MatthieuDartiailh commented 3 years ago

See #16 for a smaller PR for keyword support

MatthieuDartiailh commented 3 years ago

I made some more progress (you can check the files under test/data to get an idea of what we can currently parse successfully). Right now I have some issue with parsing f(a for a in b) while 1 + f(a for a in b) parse without issues. I checked the current python grammar and noticed that some rules make use of && which is not currently the case. I will try to diff pegen vs the version currently in CPython main and see what has not yet been reflected in pegen.

MatthieuDartiailh commented 3 years ago

After adding support for Forced node in the grammar definition and making some changes found in the current python grammar, I can now parse the tests files in the test directory and parse the source code of pegen ! I will try to add more tests and run on some popular pypi packages to get a better idea of how robust things are.

Next I will try to add pattern matching and line numbers in the AST. Regarding this last point I am interested into any suggestion to make this process less verbose and in ways to test the output since the repr of the AST will not be enough anymore, I imagine I will have to have test walking the AST in that case.

MatthieuDartiailh commented 3 years ago

Note that black dislike tests files with Python 3.9+ only syntax. Should we exclude those files or bump the Python version running black ?

pablogsal commented 3 years ago

Sorry for not participating here these days but I am very swamped with the beta release of Python 3.10. I just wanted to say that this effort actually could be very useful in the future to black, as we could add actions that create a CFG instead of an AST, allowing black to support Python 3.10+ syntax.

Thanks for all the patience @MatthieuDartiailh! Your work is great! Unfortunately, both @lysnikolaou and me are quite busy so it will take us a bit more to review 😅

MatthieuDartiailh commented 3 years ago

No problem, I had to work on other projects impacted by changes in the latest alpha for a while and so I made only slow progresses.

MatthieuDartiailh commented 3 years ago

I made some extra progress. I corrected a bunch of issues on async and decorators (I included a version check for the pre 3.9 syntax). python.gram based parser can now parse the 100 pypi packages tested by the included script ! If you are aware of additional corner cases worth testing, let me know.

As a side note, for async, I had to revert to identify async and await as keyword since tokenize send back NAME token not ASYNC token.

So I am left with pattern matching and line numbers to support. For the former, where can I find testing files, I assume some exist within CPython repo but I could not find them. For line numbers, I am open to any suggestion to make it efficient and on how to best test for correctness.

isidentical commented 3 years ago

So I am left with pattern matching and line numbers to support. For the former, where can I find testing files, I assume some exist within CPython repo but I could not find them.

The file that we use for ast.unparse is Lib/test/test_patma.py (2000+ lines of pattern matching statements).

MatthieuDartiailh commented 3 years ago

Thanks for the pointer @isidentical I will try to have look.

MatthieuDartiailh commented 3 years ago

I did not look at more serious testing yet, but I tried to add line locations in the AST using a tactic similar to what is used in the c_generator. I also mirrored the most recent changes from cpython regarding error reporting but I still have more work to do to grab the sources and also to allow to select the entry point rule when creating a new parser instance.

As usual I am happy for any comment.

MatthieuDartiailh commented 3 years ago

Regarding testing I usually use pytest, is that fine with you, before I add too much ? @pablogsal @lysnikolaou

I have some basic testing working with correct line numbers. It took some patching for strings but I have no better idea at the moment.

Now that the soft keyword PR is in let me know what feature would it make sense to extract next from this giant PR ?

pablogsal commented 3 years ago

Regarding testing I usually use pytest, is that fine with you, before I add too much ? @pablogsal @lysnikolaou

Absolutely, go ahead with pytest!

MatthieuDartiailh commented 3 years ago

While trying to add support for the last changes to python.gram I hit an issue with how to handle SOFT_KEYWORD. I can add a rule to the base parser class but then the grammar validator complains.

Should I add this to the tokens manually ? Also on the Python side we do not need the keys of the token.tok_name so I am wondering if this argument should not simply be a list or a set.

Some of the tests I added are not passing but I will look into this in the next days and try to add some coverage measurement to be able to add more tests.

As a side node @pablogsal, what is the purpose of the rule invalid_with_stmt which has no production ? I would have expected a syntax error.

pablogsal commented 3 years ago

As a side node @pablogsal, what is the purpose of the rule invalid_with_stmt which has no production ? I would have expected a syntax error.

Ah, that is using one of the sneaky changes we made for syntax errors: the && operator. That fails the parser in place if that token is not found without backtracking. Notice the rule is:

| [ASYNC] 'with' '(' ','.(expressions ['as' star_target])+ ','? ')' &&':'
| [ASYNC] 'with' ','.(expression ['as' star_target])+ &&':'

so is basically a way to emit a expected ":" error. The original idea was adding that in the rule itself instead of one of the invalid_ rules, but things got complicated over time.

Honestly I may revert that because is very difficult to spot correctly and just transform that into a normal verbose production

MatthieuDartiailh commented 3 years ago

@pablogsal thanks ! I guess I was mostly led astray by the fact that you have explicit rules to raise errors for missing : but also used force_expect in other places and then this rule that is kind of in between. Actually when writing tests I was confused by the fact that some construct had no error for missing : but errors for missing indent.

I think that using && for forced expectation is fine but it would need to be used in a more systematic manner to avoid confusion.

pablogsal commented 3 years ago

@pablogsal thanks ! I guess I was mostly led astray by the fact that you have explicit rules to raise errors for missing : but also used force_expect in other places and then this rule that is kind of in between. Actually when writing tests I was confused by the fact that some construct had no error for missing : but errors for missing indent.

I think that using && for forced expectation is fine but it would need to be used in a more systematic manner to avoid confusion.

That cannot be done, when is not used is because stopping the parsing forcefully will lead to problems. That was the reason we ported usage of && to the explicit checks in some cases. Specially when the parser goes into a left recursive rule, that can advance too much without being an error.

MatthieuDartiailh commented 3 years ago

@pablogsal

Is there is typo in the following rules and should it not be t_atom in the t_primary rule ? I am asking cause trying to improve coverage test of the parser I could not figure how to reach the target rule.

targets[asdl_expr_seq*]: a[asdl_expr_seq*]=','.target+ [','] { a }
target[expr_ty] (memo):
    | a=t_primary '.' b=NAME !t_lookahead { _PyAST_Attribute(a, b->v.Name.id, Store, EXTRA) }
    | a=t_primary '[' b=slices ']' !t_lookahead { _PyAST_Subscript(a, b, Store, EXTRA) }
    | t_atom
t_primary[expr_ty]:
    | a=t_primary '.' b=NAME &t_lookahead { _PyAST_Attribute(a, b->v.Name.id, Load, EXTRA) }
    | a=t_primary '[' b=slices ']' &t_lookahead { _PyAST_Subscript(a, b, Load, EXTRA) }
    | a=t_primary b=genexp &t_lookahead {
        _PyAST_Call(a, CHECK(asdl_expr_seq*, (asdl_expr_seq*)_PyPegen_singleton_seq(p, b)), NULL, EXTRA) }
    | a=t_primary '(' b=[arguments] ')' &t_lookahead {
        _PyAST_Call(a,
                 (b) ? ((expr_ty) b)->v.Call.args : NULL,
                 (b) ? ((expr_ty) b)->v.Call.keywords : NULL,
                 EXTRA) }
    | a=atom &t_lookahead { a }  <------------------- here
t_lookahead: '(' | '[' | '.'
t_atom[expr_ty]:
    | a=NAME { _PyPegen_set_expr_context(p, a, Store) }
    | '(' a=target ')' { _PyPegen_set_expr_context(p, a, Store) }
    | '(' b=[targets] ')' { _PyAST_Tuple(b, Store, EXTRA) }
    | '[' b=[targets] ']' { _PyAST_List(b, Store, EXTRA) }

pablogsal commented 3 years ago

Is there is typo in the following rules and should it not be t_atom in the t_primary rule ? I am asking cause trying to improve coverage test of the parser I could not figure how to reach the target rule.

I don't think so, if I remove it from the source:

diff --git a/Parser/parser.c b/Parser/parser.c
index d49bba1549..f38cc14595 100644
--- a/Parser/parser.c
+++ b/Parser/parser.c
@@ -17597,7 +17597,6 @@ target_rule(Parser *p)
 //     | t_primary '[' slices ']' &t_lookahead
 //     | t_primary genexp &t_lookahead
 //     | t_primary '(' arguments? ')' &t_lookahead
-//     | atom &t_lookahead
 static expr_ty t_primary_raw(Parser *);
 static expr_ty
 t_primary_rule(Parser *p)

i get this parsing failure:

 File "/home/pablogsal/github/python/master/./setup.py", line 29
    sys.modules['subprocess'] = _bootsubprocess
                              ^
SyntaxError: invalid syntax

lysnikolaou commented 3 years ago

Nope, it's not a typo. It was done by me in we-like-parsers/pegen_experiments#141 and if I recall correctly (it's been a while since) this aims to correctly handle the ctx argument to the ast node.

The rule without actions would just include atom as you correctly identified. However, because of the left-recursiveness of the thing we had to use two distinct rules. One for cases, where there is a single Attribute (or Subscript) node, which has to have a Store context and one for case where there are more nodes (like a.b.c), where the outermost node needs to be a Store, while the inner ones are Loads.

Notice how ast.parse('a.b = 0') produces this:

Module(
    body=[
        Assign(
            targets=[Attribute(value=Name(id="a", ctx=Load()), attr="b", ctx=Store())], <-- This is a Store here
            value=Constant(value=0),
        )
    ],
    type_ignores=[],
)

while ast.parse('a.b.c = 0)produces one with aLoadattribute in a whithin aStore` one:

Module(
    body=[
        Assign(
            targets=[
                Attribute(
                    value=Attribute(
                        value=Name(id="a", ctx=Load()), attr="b", ctx=Load() <-- The inner attr is a `Load`
                    ),
                    attr="c",
                    ctx=Store(),  <-- but the outermost one is a `Store`
                )
            ],
            value=Constant(value=0),
        )
    ],
    type_ignores=[],
)

MatthieuDartiailh commented 3 years ago

Thanks for the clarification. However I was not suggesting to remove it but to have t_atom instead of atom since otherwise t_atom and target appear to be unreachable to me or am I missing something else ?

lysnikolaou commented 3 years ago

Thanks for the clarification. However I was not suggesting to remove it but to have t_atom instead of atom since otherwise t_atom and target appear to be unreachable to me or am I missing something else ?

That seems to be correct, yes. We could probably just remove targets, target and t_atom without further changes.

MatthieuDartiailh commented 3 years ago

Thanks @pablogsal !

pablogsal commented 3 years ago

Thanks for the clarification. However I was not suggesting to remove it but to have t_atom instead of atom since otherwise t_atom and target appear to be unreachable to me or am I missing something else ?

That seems to be correct, yes. We could probably just remove targets, target and t_atom without further changes.

Oh, excellent point @lysnikolaou and @MatthieuDartiailh !

MatthieuDartiailh commented 3 years ago

On a similar note I believe the second rule of kwarg_or_double_starred is also unreachable since the first rule of kwargs will always match keywords arguments using the kwargs_or_starred rule. Am I right about this one ?

kwargs[list]:
    | a=','.kwarg_or_starred+ ',' b=','.kwarg_or_double_starred+ { a + b }
    | ','.kwarg_or_starred+
    | ','.kwarg_or_double_starred+
starred_expression:
    | '*' a=expression { ast.Starred(value=a, ctx=Load, LOCATIONS) }
kwarg_or_starred:
    | invalid_kwarg { UNREACHABLE }
    | a=NAME '=' b=expression { ast.keyword(arg=a.string, value=b, LOCATIONS) }
    | a=starred_expression { a }

kwarg_or_double_starred:
    | invalid_kwarg { UNREACHABLE }
    | a=NAME '=' b=expression { ast.keyword(arg=a.string, value=b, LOCATIONS) }   # XXX Unreachable
    | '**' a=expression { ast.keyword(arg=None, value=a, LOCATIONS) }

lysnikolaou commented 3 years ago

Didn't exactly understand while rule you think is unreachable. Are you talking about a specific alternative of a rule?

MatthieuDartiailh commented 3 years ago

My bad for some reason I had convinced myself that f(a=1, **d, e=2) was not legal... Sorry for the noise.

pablogsal commented 3 years ago

Just to double check I have checked the C coverage over our parser once we remove the unused rules in https://github.com/python/cpython/pull/26655:

lel

Seems that all of them are used, so we can be happy now :)

The missing line coverage is error checking code.

MatthieuDartiailh commented 3 years ago

Nice ! Happy I could help with this. And now I know that I should be able to test at least all functions.

MatthieuDartiailh commented 3 years ago

I made some nice progress on testing (current coverage (with branch coverage enabled) is 81% without testing pattern matching). But I still have a bunch of bugs, in particular with syntax error checks, so I will keep working on that.

While going I made some well confined changes on the parser generation on which I would like to have your opinion:

support for SOFT_KEYWORD in rules. This was needed to implement invalid_expression.
custom unreachable production for syntax error rules. This allows for nicer code coverage on the produced parser.
do not generate logic for cuts if no cuts are present. Those really cluttered the Python parser for no good reason
line number and col offset tracking and fixes in 1, 2, 3. I followed the same strategy as for the C parser and it worked extremely well. I used LOCATIONS instead of EXTRA to be more descriptive.

For any of those I would be happy to extract the relevant changes in a separate PR and add the proper documentation, just let me know.

Finally a side question, currently the parser generator enforce that all grammar must have a start rule but for Python we have basically 5, and we just force start to be file. Would there be an interest in adding a way to declare that a grammar has multiple entry points ?

pablogsal commented 3 years ago

For any of those I would be happy to extract the relevant changes in a separate PR and add the proper documentation, just let me know.

Separate PRs (if possible) is always better :)

Finally a side question, currently the parser generator enforce that all grammar must have a start rule but for Python we have basically 5, and we just force start to be file. Would there be an interest in adding a way to declare that a grammar has multiple entry points ?

Yeah, I think that may be interesting. Even better it would be the possibility of specifying the entry rule when building the parser or when calling "parse". That allows to test individual sub-trees rather elegantly.

Thanks a lot for the great work @MatthieuDartiailh!

Btw, the black maintainers are thinking of moving their parser to something else and I think they may be able to leverage this work. You may consider writing to them or adding some sentences in this document:

https://docs.google.com/document/d/1QPqm01E7MIRi_l4jrgCGOFVThP_HINnkm4hmVZC_CSg/edit?fbclid=IwAR03VBxQLhXkrZiO7BDUQerkeQU7MOy51BHtJ9xqRMHonBF7_dF1Tef53fs#

MatthieuDartiailh commented 3 years ago

Thanks I commented on the google doc.

I will open the PRs when I get time but first I would like to track down a nasty bug. I have spent the better part of my evening on it but it fails to make sense somehow I believe the behavior I am seeing is right and I do not understand how it actually work in 3.10b2. The issue is as follows:

I am trying to parse f(i for i in range(10)) which is perfectly valid python. However I get a "generator expression must be parenthesized issue." Here is the overall parsing logic:

start() ... (looking at 1.0: NAME:'f')
  file() ... (looking at 1.0: NAME:'f')
    statements() ... (looking at 1.0: NAME:'f')
      _loop1_11() ... (looking at 1.0: NAME:'f')
        statement() ... (looking at 1.0: NAME:'f')
          compound_stmt() ... (looking at 1.0: NAME:'f') -> fails
          simple_stmts() ... (looking at 1.0: NAME:'f')
            simple_stmt() ... (looking at 1.0: NAME:'f')
              assignment() ... (looking at 1.0: NAME:'f')
                name() ... (looking at 1.0: NAME:'f')
                ... name() -> TokenInfo(type=1 (NAME), string='f', start=(1, 0), end=(1, 1), line='f(i for i in range(10))\n')
                expect(':') ... (looking at 1.1: OP:'(')
                ... expect(':') -> None
                _tmp_20() ... (looking at 1.0: NAME:'f') (i.e. '(' single_target ')' | single_subscript_attribute_target)
                expect('(') ... (looking at 1.0: NAME:'f')
                  ... expect('(') -> None
                single_subscript_attribute_target() ... (looking at 1.0: NAME:'f')

And inside single_subscript_attribute_target we have t_primary:

t_primary:
    | a=t_primary '.' b=NAME &t_lookahead { ast.Attribute(value=a, attr=b.string, ctx=Load, LOCATIONS) }
    | a=t_primary '[' b=slices ']' &t_lookahead { ast.Subscript(value=a, slice=b, ctx=Load, LOCATIONS) }
    | a=t_primary b=genexp &t_lookahead { ast.Call(func=a, args=[b], keywords=[], LOCATIONS) }
    | a=t_primary '(' b=[arguments] ')' &t_lookahead {
        ast.Call(
            func=a,
            args=b[0] if b else [],
            keywords=b[1] if b else [],
            LOCATIONS,
        )
     }
    | a=atom &t_lookahead { a }
t_lookahead: '(' | '[' | '.'

The first 2 options fail right away, the third identify the call with a generator as argument but bails for a missing look ahead, so logically we get to a=t_primary '(' b=[arguments] ')' &t_lookahead and here logically we complain that a non-parenthesized generator is invalid !

All of this seems perfectly logical but it is wrong and 3.10b2 which has the same grammar pass. So I am beyond confused. Actually removing the invalid argument check allow the code to pass since it now properly backtrack to the right point.

Any help or advice on how to debug that kind of issue will be greatly appreciated.

MatthieuDartiailh commented 3 years ago

After pondering on the explanation I just gave, I went to look at the c parser and realize that since cannot throw the exception only bubble if there is no alternative found. Looking back at my code I realized that I could not raise since that prevented the parser the backtrack properly, I am all good now. Sorry for the noise.

MatthieuDartiailh commented 3 years ago

Testing on 3.10 and after some addition to the tests, I managed to get 94% coverage. I add to juggle a bit with when to raise exception and when delay their raising, I hope I got it right. I would be curious to know how you handled that in the C parser.

Also among the last things I tried to test were type comment, but it appears that tokenize does not expose TYPE_COMMENT. Is there any way to go around this issue.

I will start putting together the PRs for the extra features I added to the generator.

we-like-parsers / pegen

data: make a full Python parser for python grammar #11