pegen: fold f-string constants and fix string concatenation

FFY00 commented 3 years ago

This patch folds the f-string constants, like they were previously (1), and makes string concatenation construct a Constant instead of JoinedStr (2).

1) f'hello' f'hello' will now construct JoinedStr(values=[Constant(value='hellohello')])

2) 'hello' 'hello' will now construct Constant(value='hellohello')

Signed-off-by: Filipe Laíns lains@riseup.net

pablogsal commented 3 years ago

Good start! Thanks for helping with the project!

You need to also handle the presence of byte strings and reject all those concatenations. I would recommend to do a first pass over the collection checking if there is a single fstrings. In that case you construct a joined value, otherwise you cancatenate all strings/bytes accordingly. If you find a single bytes object but also some string or fstrings, you need to fail.

Check first all the joining rules in the repr to make sure you familiarise yourself with the expected behaviour of str + str, str + fstring, fstring + str, fstring + fstring, str/ fstr + bytes and bytes + str / fstring

FFY00 commented 3 years ago

You need to also handle the presence of byte strings and reject all those concatenations

Aren't they already handled here?

https://github.com/we-like-parsers/cpython/blob/9e05991a7ae2aa3bc6e66eced957a50104f5e791/Parser/pegen.c#L2352-L2372

All cases specified in #150 seem to be handled, no?

import ast

for case in (
    '"hello" "hello"',
    '"hello" f"hello"',
    'f"hello" "hello"',
    'b"hello" "hello"',
    'b"hello" f"hello"',
    '"hello" b"hello"',
    'f"hello" b"hello"',
    'b"hello" b"hello"',
    'f"hello" f"hello"',
):
    try:
        print(f'{case}:', ast.dump(ast.parse(case).body[0].value))
    except Exception as e:
        print(f'{case}: failed ({e})')

$ ./python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])

FFY00 commented 3 years ago

Though the constant folding does not seem to be working.

pablogsal commented 3 years ago

You need to also handle the presence of byte strings and reject all those concatenations

Aren't they already handled here?

https://github.com/we-like-parsers/cpython/blob/9e05991a7ae2aa3bc6e66eced957a50104f5e791/Parser/pegen.c#L2352-L2372

All cases specified in #150 seem to be handled, no?

import ast

for case in (
    '"hello" "hello"',
    '"hello" f"hello"',
    'f"hello" "hello"',
    'b"hello" "hello"',
    'b"hello" f"hello"',
    '"hello" b"hello"',
    'f"hello" b"hello"',
    'b"hello" b"hello"',
    'f"hello" f"hello"',
):
    try:
        print(f'{case}:', ast.dump(ast.parse(case).body[0].value))
    except Exception as e:
        print(f'{case}: failed ({e})')

$ ./python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])

@FFY00 is not that there are not handled, is that they need to return exactly the same AST as in the main branch.

FFY00 commented 3 years ago

The last commit changes the byte string concatenation to use _PyBytesWriter but I couldn't see any performance difference.

pablogsal commented 3 years ago

The last commit changes the byte string concatenation to use _PyBytesWriter but I couldn't see any performance difference.

How are you measuring? If you measure a full parse, creating the AST is 80% if the work, do you need to instrument the code?

FFY00 commented 3 years ago

I measured the time it took to parse a file with a bunch of long byte strings being concatenated. I scaled it up a little bit now and it is actually registering worse performance. But I am not sure how meaningful my testing is. How would you recommend benchmarking this?

pablogsal commented 3 years ago

I measured the time it took to parse a file with a bunch of long byte strings being concatenated. I scaled it up a little bit now and it is actually registering worse performance. But I am not sure how meaningful my testing is. How would you recommend benchmarking this?

Use perf counters / time counters at the enter and exit of the function and parse a file with a lot of concatenations. If you measure the entire parsing you are going to get a ton of noise because this operation is very very small compared with everything else.

On the other hand, having this into account, if we measure and is not that slow/different, we can just call PyUnicode_ConcatAndDel and PyBytes_ConcatAndDel and call it a day

FFY00 commented 3 years ago

I added perf counters to the function itself, when interpreting the file generated by the following script I get the following values for the total time spent in _PyPegen_concatenate_strings2.

with open('long-string-concatenation.py', 'w') as f:
    f.write(f'b"{"a" * 1000}"\n' * 25000)

`_PyBytesWriter` (last commit)

Run 1: 2.099500s Run 2: 2.046978s Run 3: 2.114400s Run 4: 2.073067s Run 5: 2.078464s

`PyBytes_Concat` (just the first commit)

Run 1: 0.140555s Run 2: 0.137755s Run 3: 0.152179s Run 4: 0.140775s Run 5: 0.141847s

The patch for the perf counters:

diff --git a/Parser/pegen.c b/Parser/pegen.c
index 59f7f7620c..8bdbe5f535 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -2337,11 +2337,14 @@ _PyPegen_seq_delete_starred_exprs(Parser *p, asdl_seq *kwargs)
     return new_seq;
 }

+long double total_time = 0;
+
 expr_ty
 _PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
                              int lineno, int col_offset, int end_lineno,
                              int end_col_offset, PyArena *arena)
 {
+    clock_t start_time = clock();
     Py_ssize_t len = asdl_seq_LEN(strings);
     assert(len > 0);

@@ -2394,6 +2397,10 @@ _PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
         if (_PyArena_AddPyObject(arena, res) < 0) {
             return NULL;
         }
+        long double elapsed_time = (long double)(clock() - start_time);
+        total_time += elapsed_time;
+        printf("_PyPegen_concatenate_strings2 elapsed %Lf seconds (total %Lf seconds)\n",
+               elapsed_time / CLOCKS_PER_SEC, total_time / CLOCKS_PER_SEC);
         return _PyAST_Constant(res, NULL, lineno, col_offset, end_lineno, end_col_offset, p->arena);
     }

@@ -2461,6 +2468,11 @@ _PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
         }
     }

+    long double elapsed_time = (long double)(clock() - start_time);
+    total_time += elapsed_time;
+    printf("_PyPegen_concatenate_strings2 elapsed %Lf seconds (total %Lf seconds)\n",
+           elapsed_time / CLOCKS_PER_SEC, total_time / CLOCKS_PER_SEC);
+
     if (!f_string_found) {
         assert(n_elements == 1);
         expr_ty elem = asdl_seq_GET(values, 0);

With _PyBytesWriter, we need to iterate over the asdl sequence first to calculate the size and then again to add the bytes, so these results do actually make sense. I think we should drop the last patch.

pablogsal commented 3 years ago

This is still a bit surprising, but the data is data. Let's go with PyBytes_ConcatAndDel and friends then. Thanks for the experiment @FFY00 👍

FFY00 commented 3 years ago

Anyway, the first patch does properly fix the AST generation.

$ ./python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hellohello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])

$ python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hellohello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])

$ diff <(./python possible-concatenations.py) <(python possible-concatenations.py)

FFY00 commented 3 years ago

Is there anything else to take into account?

pablogsal commented 3 years ago

Is there anything else to take into account?

No, I will review the last version later today and we can land it. Great job!

FFY00 commented 3 years ago

Thank you, yes, I think it would make sense, we already have all the necessary information :blush:

pablogsal commented 3 years ago

I noticed an issue with new lines while testing this PR:

https://github.com/we-like-parsers/cpython/issues/166

pablogsal commented 3 years ago

@isidentical Haha, we made the same review :P

FFY00 commented 3 years ago

Rebased to solve conflicts as #164 touched some whitespace :stuck_out_tongue:

pablogsal commented 3 years ago

Grteat work @FFY00! :tada:

I am merging this so we can start building on top.

we-like-parsers / cpython