we-like-parsers / cpython

Here we work on integrating pegen into CPython; use branch 'pegen'
https://github.com/gvanrossum/pegen
Other
1 stars 0 forks source link

f-string parser: Fix multi-line debug expressions #151

Closed pablogsal closed 3 years ago

pablogsal commented 3 years ago

Multi-line debug expression may be broken. We need to either validate that they work in all cases or properly fix them

isidentical commented 3 years ago

In the initial implemention, I simply ignored multiline debug expressions and fetched the actual source from the tokenize buffer. https://github.com/we-like-parsers/cpython/blob/9e05991a7ae2aa3bc6e66eced957a50104f5e791/Parser/pegen.c#L2890-L2902

Though since it is getting replaced at every different line, we need a way of storing each line's contents if we are inside of an f-string expression. I'm experimenting with it right now, though I did not found an easy way. For example this version seem to work but it is a really hacky/complicated implementation, so I need to somehow simplify it.

It currently works by creating a new buffer in the tokenizer_mode, and if we are inside the f-string expression and when we see the initial { we start recording. At each line, while the tok->buf gets replaced, we simply append to our existing buffer. Normally we should free it as soon as we close the expression with } but due to the way we parse debug expressions (we parse the whole f-string expression and then continue) we can't free it until we actually use it's contents. If the expression is multiline, we try to find where it ended by manually searching newlines among lines, and trying to match the position which seem a bit error-prone so this is something that I especially want to get rid of.

Possible further ideas:

pablogsal commented 3 years ago

We could do the same thing we are doing for the interactive mode and place all the parsed source into a buffer and then just record the offsets into said buffer. There is some scaffolding already for this, you would need to activate for non interactive mode as well.

isidentical commented 3 years ago

I initially went with the interactive_src_start (like when we initially implemented the debug expressions) and that approach was more problematic than this. From what I can recall; we had to modify a bunch of places to add the line since they were not part of the interactive flow (we also the same in the current version, but less places AFAICT). Another problem was that, since we always missed the first line (even though that was part of the tokenizer_mod->f_string_multiline_start, it made it complicated to work with) and we still needed to store 2 pointers for each expression (amount of characters relative to the start of the string, and the size of the expression itself). Interactive mode also gets deallocated at the end, but the current approach gets deallocated on each f-string.

I've made some changes (some of the above + others), and I think the code is simple enough for us to go with. Also seems like we pass all the debug expression tests on the suite.