Properly handle non-interned keyword argument names

oremanj commented 3 months ago

Fixes #468.

Size impact on my laptop: nb_func.cpp.o grows by 368 bytes (text size 22565 before, 22933 after).

Performance impact based on some timeit microbenchmarks (run twice in each configuration):

test_functions_ext.test_02(3, 5) goes from 32.4/32.3 ns to 32.0/31.8 ns
test_functions_ext.test_02(down=3, up=5) goes from 35.3/35.1 ns to 34.4/34.4 ns
test_functions_ext.test_02(down=3) goes from 33.3/33.5 ns to 34.0/34.0 ns

Which feels like basically a wash to me.

I didn't bother clearing the cast_flags::deferred flags used for this bookkeeping before calling the function, because that takes time and it seemed easy enough to just tell type casters to ignore the new flag.

oremanj commented 3 months ago

Test functions:

    m.def("test_many",
          [](int, int, int, int, int, int, int, int,
             int, int, int, int, int, int, int, int) {},
          "param01"_a = 1, "param02"_a = 1, "param03"_a = 3, "param04"_a = 4,
          "param05"_a = 5, "param06"_a = 6, "param07"_a = 7, "param08"_a = 8,
          "param09"_a = 9, "param10"_a = 10, "param11"_a = 11, "param12"_a = 12,
          "param13"_a = 13, "param14"_a = 14, "param15"_a = 15, "param16"_a = 16);

    m.def("test_kwargs", [](int, nb::kwargs) {}, "param99"_a = 1, "kwargs"_a);

Microbenchmark script:

from timeit import timeit
import test_functions_ext
ns = {"fn": test_functions_ext.test_many, "fnvar": test_functions_ext.test_kwargs}

def params_str(nums):
    return ",".join(f"param{n:02d}={n}" for n in nums)

results = {}

def run(desc, stmt):
    nsec = timeit(stmt, number=3_000_000, globals=ns) / 3_000_000 * 1_000_000_000
    results.setdefault(desc, []).append(nsec)

# consistent randomly-chosen parameter order
ORDER = [15, 5, 3, 1, 10, 9, 16, 4, 14, 11, 6, 7, 8, 2, 13, 12]

for _ in range(2):
    run(
        "All positional:",
        "fn(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)",
    )
    run(
        "All keyword:",
        "fn({})".format(params_str(ORDER)),
    )
    run(
        "Some keyword, rest defaulted:",
        "fn({})".format(params_str(ORDER[:2])),
    )
    run(
        "All defaulted:",
        "fn()",
    )
    run(
        "Var-kwargs, 16 unmatched:",
        "fnvar({})".format(params_str(ORDER)),
    )
    run(
        "Var-kwargs, 2 unmatched:",
        "fnvar({})".format(params_str(ORDER[:2])),
    )

for desc, times in results.items():
    print(f"{desc:>40}", " / ".join(f"{nsec:.1f}ns" for nsec in times))

Python 3.12, regular ABI:

Stock nanobind:

                         All positional: 104.7ns / 102.4ns
                            All keyword: 493.2ns / 493.5ns
           Some keyword, rest defaulted: 91.6ns / 91.5ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 668.1ns / 668.7ns
                Var-kwargs, 2 unmatched: 70.0ns / 70.0ns

This PR as uploaded: pretty much a wash with stock nanobind, though var-kwargs + filling in defaults gets somewhat slower

                         All positional: 104.2ns / 101.7ns
                            All keyword: 482.3ns / 476.3ns
           Some keyword, rest defaulted: 92.8ns / 90.6ns
                          All defaulted: 73.9ns / 73.8ns
               Var-kwargs, 16 unmatched: 752.7ns / 752.6ns
                Var-kwargs, 2 unmatched: 85.7ns / 85.6ns

Always compare contents: really bad

                         All positional: 104.4ns / 102.7ns
                            All keyword: 1268.2ns / 1264.2ns
           Some keyword, rest defaulted: 276.7ns / 275.7ns
                          All defaulted: 78.9ns / 77.6ns
               Var-kwargs, 16 unmatched: 786.4ns / 775.2ns
                Var-kwargs, 2 unmatched: 82.2ns / 82.1ns

Intern first unconditionally: somewhat worse but maybe acceptable

                         All positional: 107.7ns / 104.5ns
                            All keyword: 543.4ns / 540.4ns
           Some keyword, rest defaulted: 100.6ns / 100.9ns
                          All defaulted: 79.1ns / 79.5ns
               Var-kwargs, 16 unmatched: 706.0ns / 706.3ns
                Var-kwargs, 2 unmatched: 79.9ns / 80.0ns

Intern first if the string object isn't already interned: gets back much of the difference

                         All positional: 106.3ns / 104.0ns
                            All keyword: 501.4ns / 500.4ns
           Some keyword, rest defaulted: 96.0ns / 95.5ns
                          All defaulted: 79.1ns / 78.9ns
               Var-kwargs, 16 unmatched: 663.3ns / 662.0ns
                Var-kwargs, 2 unmatched: 75.1ns / 74.8ns

wjakob commented 3 months ago

Fascinating! Clear signal, we do not want to run the comparisons by default.

Intern first if the string object isn't already interned

I don't follow what you did here, can you share the diff for this strategy?

One more request: could you run this in a limited API build? Should be as simple as adding STABLE_ABI to the cmake command and ensuring that you have Python 3.12 installed. Be careful to delete the old .so binding file, otherwise Python will prefer the version-specific one.

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

oremanj commented 3 months ago

I don't follow what you did here, can you share the diff for this strategy?

Just added one line before InternInPlace:

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
        if (PyUnicode_Check(key) && !(((PyASCIIObject *)key)->state.interned))
#endif
            PyUnicode_InternInPlace(&key_interned);

One more request: could you run this in a limited API build?

Stand by for results there!

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

One thing to note about the PR as currently implemented is that we don't actually enter the second double-loop unless there are unused keywords. That will only happen when there is actually a non-interned keyword, or the function takes **kwargs, or we're about to return failure. My benchmark doesn't capture that case but I can come up with another one that does.

wjakob commented 3 months ago

FWIW I don't think that the PyUnicode_Check(key) could ever fail. The vector call protocol spec says that keywords is, quote: "Either NULL or a tuple with the names of the keyword arguments" (source)

oremanj commented 3 months ago

Stable ABI benchmarks:

Stock nanobind:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 706.7ns / 702.7ns
           Some keyword, rest defaulted: 150.5ns / 150.5ns
                          All defaulted: 99.2ns / 99.2ns
               Var-kwargs, 16 unmatched: 727.5ns / 712.5ns
                Var-kwargs, 2 unmatched: 87.7ns / 87.4ns

PR as uploaded:

                         All positional: 126.7ns / 123.9ns
                            All keyword: 692.0ns / 692.2ns
           Some keyword, rest defaulted: 165.3ns / 165.6ns
                          All defaulted: 101.3ns / 101.2ns
               Var-kwargs, 16 unmatched: 858.6ns / 839.5ns
                Var-kwargs, 2 unmatched: 105.3ns / 105.3ns

Always compare:

                         All positional: 126.4ns / 123.8ns
                            All keyword: 1494.5ns / 1474.5ns
           Some keyword, rest defaulted: 327.8ns / 327.4ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 807.9ns / 809.1ns
                Var-kwargs, 2 unmatched: 98.6ns / 98.6ns

Intern first:

                         All positional: 128.4ns / 126.4ns
                            All keyword: 649.9ns / 651.8ns
           Some keyword, rest defaulted: 131.1ns / 130.9ns
                          All defaulted: 102.0ns / 102.2ns
               Var-kwargs, 16 unmatched: 821.0ns / 822.0ns
                Var-kwargs, 2 unmatched: 98.9ns / 98.8ns

wjakob commented 3 months ago

Great! What about something like this to further accelerate the non-stable ABI variant? (This potentially avoids the alloca and reference counting)

(Untested, may need some syntax tweaks)

// Ensure that keyword argument names are interned
bool interned;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        interned &= ((PyASCIIObject *)key)->state.interned;
    }
#else
    interned = false;
#endif

PyObject **kwnames = nullptr;
if (NB_LIKELY(interned)) {
    kwnames = kwargs_in;
} else {
    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }
}

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

wjakob commented 3 months ago

ah nvm, fix incoming

oremanj commented 3 months ago

I updated the non-stableabi benchmarks with the new var-kwargs tests also, and updated the comment above to show my current benchmarking script.

I think "intern first" clearly wins on stable ABI. It's less clear on the regular ABI; it depends whether you'd rather penalize var-kwargs functions with unspecified defaulted args a lot, or everyone a little bit. I'll try your suggested tweak and see if we can close the gap there. I think "intern first" wins overall because it's so much simpler; guessing you feel the same?

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

CompactUnicodeObject adds additional fields past the end of ASCIIObject, so this is safe. I also just noticed there's a macro PyUnicode_CHECK_INTERNED which I'll use for better clarity here.

wjakob commented 3 months ago

This works apparently:

def f(**kwargs):
    print(kwargs)

f(ğ=1)

wjakob commented 3 months ago

NVM, I see now that the fancier versions all build on the ascii object.

wjakob commented 3 months ago

How about this?

    // Ensure that keyword argument names are interned. That makes it faster
    // to compare them against pre-interned argument names in the overload chain.
    PyObject **kwnames = nullptr;

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    bool kwnames_interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        kwnames_interned &= ((PyASCIIObject *) key)->state.interned != 0;
    }
    if (NB_LIKELY(kwnames_interned)) {
        kwnames = ((PyTupleObject *) kwargs_in)->ob_item;
        goto traverse_overloads;
    }
#endif

    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }

traverse_overloads:

oremanj commented 3 months ago

Perfect.

=== 3.12 ABI ===

stock:

                         All positional: 104.9ns / 102.2ns
                            All keyword: 495.5ns / 494.8ns
           Some keyword, rest defaulted: 91.5ns / 91.4ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 659.9ns / 660.4ns
                Var-kwargs, 2 unmatched: 69.9ns / 70.0ns

intern before:

                         All positional: 105.3ns / 102.7ns
                            All keyword: 489.0ns / 490.5ns
           Some keyword, rest defaulted: 93.1ns / 92.9ns
                          All defaulted: 77.2ns / 77.3ns
               Var-kwargs, 16 unmatched: 658.5ns / 657.5ns
                Var-kwargs, 2 unmatched: 70.6ns / 70.3ns

Verdict: no noticeable difference.

=== Stable ABI ===

stock:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 713.0ns / 715.9ns
           Some keyword, rest defaulted: 150.5ns / 150.7ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 737.5ns / 738.9ns
                Var-kwargs, 2 unmatched: 88.7ns / 88.5ns

intern before:

                         All positional: 128.7ns / 126.4ns
                            All keyword: 614.0ns / 615.5ns
           Some keyword, rest defaulted: 131.1ns / 131.1ns
                          All defaulted: 102.1ns / 102.1ns
               Var-kwargs, 16 unmatched: 757.9ns / 758.5ns
                Var-kwargs, 2 unmatched: 98.7ns / 98.8ns

Verdict: clear improvement on a few benchmarks due to reducing the number of tuple calls; cost on others is not awful.

wjakob commented 3 months ago

Great! Can you update the commit with this one and add a changelog entry?

oremanj commented 3 months ago

Should be all set! Thanks for your help in figuring out a better approach here.

wjakob commented 3 months ago

Thanks 👍

wjakob / nanobind

Properly handle non-interned keyword argument names #469