Closed oremanj closed 3 months ago
Test functions:
m.def("test_many",
[](int, int, int, int, int, int, int, int,
int, int, int, int, int, int, int, int) {},
"param01"_a = 1, "param02"_a = 1, "param03"_a = 3, "param04"_a = 4,
"param05"_a = 5, "param06"_a = 6, "param07"_a = 7, "param08"_a = 8,
"param09"_a = 9, "param10"_a = 10, "param11"_a = 11, "param12"_a = 12,
"param13"_a = 13, "param14"_a = 14, "param15"_a = 15, "param16"_a = 16);
m.def("test_kwargs", [](int, nb::kwargs) {}, "param99"_a = 1, "kwargs"_a);
Microbenchmark script:
from timeit import timeit
import test_functions_ext
ns = {"fn": test_functions_ext.test_many, "fnvar": test_functions_ext.test_kwargs}
def params_str(nums):
return ",".join(f"param{n:02d}={n}" for n in nums)
results = {}
def run(desc, stmt):
nsec = timeit(stmt, number=3_000_000, globals=ns) / 3_000_000 * 1_000_000_000
results.setdefault(desc, []).append(nsec)
# consistent randomly-chosen parameter order
ORDER = [15, 5, 3, 1, 10, 9, 16, 4, 14, 11, 6, 7, 8, 2, 13, 12]
for _ in range(2):
run(
"All positional:",
"fn(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)",
)
run(
"All keyword:",
"fn({})".format(params_str(ORDER)),
)
run(
"Some keyword, rest defaulted:",
"fn({})".format(params_str(ORDER[:2])),
)
run(
"All defaulted:",
"fn()",
)
run(
"Var-kwargs, 16 unmatched:",
"fnvar({})".format(params_str(ORDER)),
)
run(
"Var-kwargs, 2 unmatched:",
"fnvar({})".format(params_str(ORDER[:2])),
)
for desc, times in results.items():
print(f"{desc:>40}", " / ".join(f"{nsec:.1f}ns" for nsec in times))
Python 3.12, regular ABI:
Stock nanobind:
All positional: 104.7ns / 102.4ns
All keyword: 493.2ns / 493.5ns
Some keyword, rest defaulted: 91.6ns / 91.5ns
All defaulted: 77.3ns / 77.3ns
Var-kwargs, 16 unmatched: 668.1ns / 668.7ns
Var-kwargs, 2 unmatched: 70.0ns / 70.0ns
This PR as uploaded: pretty much a wash with stock nanobind, though var-kwargs + filling in defaults gets somewhat slower
All positional: 104.2ns / 101.7ns
All keyword: 482.3ns / 476.3ns
Some keyword, rest defaulted: 92.8ns / 90.6ns
All defaulted: 73.9ns / 73.8ns
Var-kwargs, 16 unmatched: 752.7ns / 752.6ns
Var-kwargs, 2 unmatched: 85.7ns / 85.6ns
Always compare contents: really bad
All positional: 104.4ns / 102.7ns
All keyword: 1268.2ns / 1264.2ns
Some keyword, rest defaulted: 276.7ns / 275.7ns
All defaulted: 78.9ns / 77.6ns
Var-kwargs, 16 unmatched: 786.4ns / 775.2ns
Var-kwargs, 2 unmatched: 82.2ns / 82.1ns
Intern first unconditionally: somewhat worse but maybe acceptable
All positional: 107.7ns / 104.5ns
All keyword: 543.4ns / 540.4ns
Some keyword, rest defaulted: 100.6ns / 100.9ns
All defaulted: 79.1ns / 79.5ns
Var-kwargs, 16 unmatched: 706.0ns / 706.3ns
Var-kwargs, 2 unmatched: 79.9ns / 80.0ns
Intern first if the string object isn't already interned: gets back much of the difference
All positional: 106.3ns / 104.0ns
All keyword: 501.4ns / 500.4ns
Some keyword, rest defaulted: 96.0ns / 95.5ns
All defaulted: 79.1ns / 78.9ns
Var-kwargs, 16 unmatched: 663.3ns / 662.0ns
Var-kwargs, 2 unmatched: 75.1ns / 74.8ns
Fascinating! Clear signal, we do not want to run the comparisons by default.
Intern first if the string object isn't already interned
I don't follow what you did here, can you share the diff for this strategy?
One more request: could you run this in a limited API build? Should be as simple as adding STABLE_ABI
to the cmake command and ensuring that you have Python 3.12 installed. Be careful to delete the old .so
binding file, otherwise Python will prefer the version-specific one.
I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.
I don't follow what you did here, can you share the diff for this strategy?
Just added one line before InternInPlace:
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
if (PyUnicode_Check(key) && !(((PyASCIIObject *)key)->state.interned))
#endif
PyUnicode_InternInPlace(&key_interned);
One more request: could you run this in a limited API build?
Stand by for results there!
I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.
One thing to note about the PR as currently implemented is that we don't actually enter the second double-loop unless there are unused keywords. That will only happen when there is actually a non-interned keyword, or the function takes **kwargs
, or we're about to return failure. My benchmark doesn't capture that case but I can come up with another one that does.
FWIW I don't think that the PyUnicode_Check(key)
could ever fail. The vector call protocol spec says that keywords
is, quote: "Either NULL
or a tuple with the names of the keyword arguments" (source)
Stable ABI benchmarks:
Stock nanobind:
All positional: 126.2ns / 123.8ns
All keyword: 706.7ns / 702.7ns
Some keyword, rest defaulted: 150.5ns / 150.5ns
All defaulted: 99.2ns / 99.2ns
Var-kwargs, 16 unmatched: 727.5ns / 712.5ns
Var-kwargs, 2 unmatched: 87.7ns / 87.4ns
PR as uploaded:
All positional: 126.7ns / 123.9ns
All keyword: 692.0ns / 692.2ns
Some keyword, rest defaulted: 165.3ns / 165.6ns
All defaulted: 101.3ns / 101.2ns
Var-kwargs, 16 unmatched: 858.6ns / 839.5ns
Var-kwargs, 2 unmatched: 105.3ns / 105.3ns
Always compare:
All positional: 126.4ns / 123.8ns
All keyword: 1494.5ns / 1474.5ns
Some keyword, rest defaulted: 327.8ns / 327.4ns
All defaulted: 99.2ns / 99.3ns
Var-kwargs, 16 unmatched: 807.9ns / 809.1ns
Var-kwargs, 2 unmatched: 98.6ns / 98.6ns
Intern first:
All positional: 128.4ns / 126.4ns
All keyword: 649.9ns / 651.8ns
Some keyword, rest defaulted: 131.1ns / 130.9ns
All defaulted: 102.0ns / 102.2ns
Var-kwargs, 16 unmatched: 821.0ns / 822.0ns
Var-kwargs, 2 unmatched: 98.9ns / 98.8ns
Great! What about something like this to further accelerate the non-stable ABI variant? (This potentially avoids the alloca
and reference counting)
(Untested, may need some syntax tweaks)
// Ensure that keyword argument names are interned
bool interned;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
interned = true;
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
interned &= ((PyASCIIObject *)key)->state.interned;
}
#else
interned = false;
#endif
PyObject **kwnames = nullptr;
if (NB_LIKELY(interned)) {
kwnames = kwargs_in;
} else {
kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
*key_interned = key;
Py_INCREF(key_interned);
PyUnicode_InternInPlace(&key_interned);
if (NB_LIKELY(key == key_interned)) // string was already interned
Py_DECREF(key_interned);
else
cleanup.append(key_interned);
kwnames[i] = key_interned;
}
}
By the way, is it valid to assume that key
is always a PyASCIIObject
? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.
ah nvm, fix incoming
I updated the non-stableabi benchmarks with the new var-kwargs tests also, and updated the comment above to show my current benchmarking script.
I think "intern first" clearly wins on stable ABI. It's less clear on the regular ABI; it depends whether you'd rather penalize var-kwargs functions with unspecified defaulted args a lot, or everyone a little bit. I'll try your suggested tweak and see if we can close the gap there. I think "intern first" wins overall because it's so much simpler; guessing you feel the same?
By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.
CompactUnicodeObject adds additional fields past the end of ASCIIObject, so this is safe. I also just noticed there's a macro PyUnicode_CHECK_INTERNED
which I'll use for better clarity here.
This works apparently:
def f(**kwargs):
print(kwargs)
f(ğ=1)
NVM, I see now that the fancier versions all build on the ascii object.
How about this?
// Ensure that keyword argument names are interned. That makes it faster
// to compare them against pre-interned argument names in the overload chain.
PyObject **kwnames = nullptr;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
bool kwnames_interned = true;
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
kwnames_interned &= ((PyASCIIObject *) key)->state.interned != 0;
}
if (NB_LIKELY(kwnames_interned)) {
kwnames = ((PyTupleObject *) kwargs_in)->ob_item;
goto traverse_overloads;
}
#endif
kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
*key_interned = key;
Py_INCREF(key_interned);
PyUnicode_InternInPlace(&key_interned);
if (NB_LIKELY(key == key_interned)) // string was already interned
Py_DECREF(key_interned);
else
cleanup.append(key_interned);
kwnames[i] = key_interned;
}
traverse_overloads:
Perfect.
=== 3.12 ABI ===
stock:
All positional: 104.9ns / 102.2ns
All keyword: 495.5ns / 494.8ns
Some keyword, rest defaulted: 91.5ns / 91.4ns
All defaulted: 77.3ns / 77.3ns
Var-kwargs, 16 unmatched: 659.9ns / 660.4ns
Var-kwargs, 2 unmatched: 69.9ns / 70.0ns
intern before:
All positional: 105.3ns / 102.7ns
All keyword: 489.0ns / 490.5ns
Some keyword, rest defaulted: 93.1ns / 92.9ns
All defaulted: 77.2ns / 77.3ns
Var-kwargs, 16 unmatched: 658.5ns / 657.5ns
Var-kwargs, 2 unmatched: 70.6ns / 70.3ns
Verdict: no noticeable difference.
=== Stable ABI ===
stock:
All positional: 126.2ns / 123.8ns
All keyword: 713.0ns / 715.9ns
Some keyword, rest defaulted: 150.5ns / 150.7ns
All defaulted: 99.2ns / 99.3ns
Var-kwargs, 16 unmatched: 737.5ns / 738.9ns
Var-kwargs, 2 unmatched: 88.7ns / 88.5ns
intern before:
All positional: 128.7ns / 126.4ns
All keyword: 614.0ns / 615.5ns
Some keyword, rest defaulted: 131.1ns / 131.1ns
All defaulted: 102.1ns / 102.1ns
Var-kwargs, 16 unmatched: 757.9ns / 758.5ns
Var-kwargs, 2 unmatched: 98.7ns / 98.8ns
Verdict: clear improvement on a few benchmarks due to reducing the number of tuple calls; cost on others is not awful.
Great! Can you update the commit with this one and add a changelog entry?
Should be all set! Thanks for your help in figuring out a better approach here.
Thanks 👍
Fixes #468.
Size impact on my laptop: nb_func.cpp.o grows by 368 bytes (text size 22565 before, 22933 after).
Performance impact based on some timeit microbenchmarks (run twice in each configuration):
test_functions_ext.test_02(3, 5)
goes from 32.4/32.3 ns to 32.0/31.8 nstest_functions_ext.test_02(down=3, up=5)
goes from 35.3/35.1 ns to 34.4/34.4 nstest_functions_ext.test_02(down=3)
goes from 33.3/33.5 ns to 34.0/34.0 nsWhich feels like basically a wash to me.
I didn't bother clearing the
cast_flags::deferred
flags used for this bookkeeping before calling the function, because that takes time and it seemed easy enough to just tell type casters to ignore the new flag.