Closed FFY00 closed 3 years ago
Good start! Thanks for helping with the project!
You need to also handle the presence of byte strings and reject all those concatenations. I would recommend to do a first pass over the collection checking if there is a single fstrings. In that case you construct a joined value, otherwise you cancatenate all strings/bytes accordingly. If you find a single bytes object but also some string or fstrings, you need to fail.
Check first all the joining rules in the repr to make sure you familiarise yourself with the expected behaviour of str + str, str + fstring, fstring + str, fstring + fstring, str/ fstr + bytes and bytes + str / fstring
You need to also handle the presence of byte strings and reject all those concatenations
Aren't they already handled here?
All cases specified in #150 seem to be handled, no?
import ast
for case in (
'"hello" "hello"',
'"hello" f"hello"',
'f"hello" "hello"',
'b"hello" "hello"',
'b"hello" f"hello"',
'"hello" b"hello"',
'f"hello" b"hello"',
'b"hello" b"hello"',
'f"hello" f"hello"',
):
try:
print(f'{case}:', ast.dump(ast.parse(case).body[0].value))
except Exception as e:
print(f'{case}: failed ({e})')
$ ./python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
Though the constant folding does not seem to be working.
You need to also handle the presence of byte strings and reject all those concatenations
Aren't they already handled here?
All cases specified in #150 seem to be handled, no?
import ast for case in ( '"hello" "hello"', '"hello" f"hello"', 'f"hello" "hello"', 'b"hello" "hello"', 'b"hello" f"hello"', '"hello" b"hello"', 'f"hello" b"hello"', 'b"hello" b"hello"', 'f"hello" f"hello"', ): try: print(f'{case}:', ast.dump(ast.parse(case).body[0].value)) except Exception as e: print(f'{case}: failed ({e})')
$ ./python possible-concatenations.py "hello" "hello": Constant(value='hellohello') "hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')]) f"hello" "hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')]) b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1)) b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1)) "hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1)) f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1)) b"hello" b"hello": Constant(value=b'hellohello') f"hello" f"hello": JoinedStr(values=[Constant(value='hello'), Constant(value='hello')])
@FFY00 is not that there are not handled, is that they need to return exactly the same AST as in the main branch.
The last commit changes the byte string concatenation to use _PyBytesWriter
but I couldn't see any performance difference.
The last commit changes the byte string concatenation to use
_PyBytesWriter
but I couldn't see any performance difference.
How are you measuring? If you measure a full parse, creating the AST is 80% if the work, do you need to instrument the code?
I measured the time it took to parse a file with a bunch of long byte strings being concatenated. I scaled it up a little bit now and it is actually registering worse performance. But I am not sure how meaningful my testing is. How would you recommend benchmarking this?
I measured the time it took to parse a file with a bunch of long byte strings being concatenated. I scaled it up a little bit now and it is actually registering worse performance. But I am not sure how meaningful my testing is. How would you recommend benchmarking this?
Use perf counters / time counters at the enter and exit of the function and parse a file with a lot of concatenations. If you measure the entire parsing you are going to get a ton of noise because this operation is very very small compared with everything else.
On the other hand, having this into account, if we measure and is not that slow/different, we can just call PyUnicode_ConcatAndDel and PyBytes_ConcatAndDel and call it a day
I added perf counters to the function itself, when interpreting the file generated by the following script I get the following values for the total time spent in _PyPegen_concatenate_strings2
.
with open('long-string-concatenation.py', 'w') as f:
f.write(f'b"{"a" * 1000}"\n' * 25000)
_PyBytesWriter
(last commit)Run 1: 2.099500s Run 2: 2.046978s Run 3: 2.114400s Run 4: 2.073067s Run 5: 2.078464s
PyBytes_Concat
(just the first commit)Run 1: 0.140555s Run 2: 0.137755s Run 3: 0.152179s Run 4: 0.140775s Run 5: 0.141847s
The patch for the perf counters:
diff --git a/Parser/pegen.c b/Parser/pegen.c
index 59f7f7620c..8bdbe5f535 100644
--- a/Parser/pegen.c
+++ b/Parser/pegen.c
@@ -2337,11 +2337,14 @@ _PyPegen_seq_delete_starred_exprs(Parser *p, asdl_seq *kwargs)
return new_seq;
}
+long double total_time = 0;
+
expr_ty
_PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
int lineno, int col_offset, int end_lineno,
int end_col_offset, PyArena *arena)
{
+ clock_t start_time = clock();
Py_ssize_t len = asdl_seq_LEN(strings);
assert(len > 0);
@@ -2394,6 +2397,10 @@ _PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
if (_PyArena_AddPyObject(arena, res) < 0) {
return NULL;
}
+ long double elapsed_time = (long double)(clock() - start_time);
+ total_time += elapsed_time;
+ printf("_PyPegen_concatenate_strings2 elapsed %Lf seconds (total %Lf seconds)\n",
+ elapsed_time / CLOCKS_PER_SEC, total_time / CLOCKS_PER_SEC);
return _PyAST_Constant(res, NULL, lineno, col_offset, end_lineno, end_col_offset, p->arena);
}
@@ -2461,6 +2468,11 @@ _PyPegen_concatenate_strings2(Parser *p, asdl_expr_seq *strings,
}
}
+ long double elapsed_time = (long double)(clock() - start_time);
+ total_time += elapsed_time;
+ printf("_PyPegen_concatenate_strings2 elapsed %Lf seconds (total %Lf seconds)\n",
+ elapsed_time / CLOCKS_PER_SEC, total_time / CLOCKS_PER_SEC);
+
if (!f_string_found) {
assert(n_elements == 1);
expr_ty elem = asdl_seq_GET(values, 0);
With _PyBytesWriter
, we need to iterate over the asdl sequence first to calculate the size and then again to add the bytes, so these results do actually make sense. I think we should drop the last patch.
This is still a bit surprising, but the data is data. Let's go with PyBytes_ConcatAndDel
and friends then. Thanks for the experiment @FFY00 👍
Anyway, the first patch does properly fix the AST generation.
$ ./python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hellohello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
$ python possible-concatenations.py
"hello" "hello": Constant(value='hellohello')
"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
f"hello" "hello": JoinedStr(values=[Constant(value='hellohello')])
b"hello" "hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" f"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
f"hello" b"hello": failed (cannot mix bytes and nonbytes literals (<unknown>, line 1))
b"hello" b"hello": Constant(value=b'hellohello')
f"hello" f"hello": JoinedStr(values=[Constant(value='hellohello')])
$ diff <(./python possible-concatenations.py) <(python possible-concatenations.py)
Is there anything else to take into account?
Is there anything else to take into account?
No, I will review the last version later today and we can land it. Great job!
Thank you, yes, I think it would make sense, we already have all the necessary information :blush:
I noticed an issue with new lines while testing this PR:
@isidentical Haha, we made the same review :P
Rebased to solve conflicts as #164 touched some whitespace :stuck_out_tongue:
Grteat work @FFY00! :tada:
I am merging this so we can start building on top.
This patch folds the f-string constants, like they were previously (1), and makes string concatenation construct a Constant instead of JoinedStr (2).
1)
f'hello' f'hello'
will now construct JoinedStr(values=[Constant(value='hellohello')])2)
'hello' 'hello'
will now construct Constant(value='hellohello')Signed-off-by: Filipe Laíns lains@riseup.net