wjakob / nanobind

nanobind: tiny and efficient C++/Python bindings
BSD 3-Clause "New" or "Revised" License
2.35k stars 198 forks source link

[BUG]: nanobind wrapping of Python-arrays slower than pybind11 #367

Closed jorgensd closed 12 months ago

jorgensd commented 12 months ago

Problem description

In DOLFINx, we are currently seeing some major performance regressions when using nanobind instead of pybind: https://github.com/FEniCS/dolfinx/issues/2891

I've tried to dig into what is wrong, but haven't been able to explain the 2x slowdown with nanobind. When trying to isolate the issue, I just compared wrapping a numpy array with nanobind and pybind, code available at: https://github.com/jorgensd/nanobind_example/tree/dokken/nano_vs_pb

i.e Pybinding

#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
#include <pybind11/pytypes.h>
#include <span>
namespace py = pybind11;
namespace
{
    template <typename T>
    void spanned_function(std::span<T> b)
    {
        std::for_each(b.begin(), b.end(), [](auto &val)
                      { val *= -2; });
    }
}
PYBIND11_MODULE(pybind11_example_ext, m)
{
    m.def(
        "add_vector",
        [](py::array_t<double, py::array::c_style> b)
        {
            spanned_function<double>(std::span<double>(b.mutable_data(), b.size()));
        },
        py::arg("b"));
}

nanobinding

#include <nanobind/nanobind.h>
#include <nanobind/stl/complex.h>
#include <complex.h>
#include <nanobind/ndarray.h>
#include <span>
namespace nb = nanobind;

using namespace nb::literals;

namespace
{
    template <typename T>
    void spanned_function(std::span<T> b)
    {
        std::for_each(b.begin(), b.end(), [](auto &val)
                      { val *= -2; });
    }
}
NB_MODULE(nanobind_example_ext, m)
{
    m.def(
        "add_vector",
        [](nb::ndarray<double, nb::ndim<1>, nb::c_contig> b)
        {
            spanned_function<double>(std::span<double>(b.data(), b.size()));
        },
        nb::arg("b").noconvert());
}

Running the code in the reproducible example below, I get the timings:

Nanobind: 5.960962e-02
Pybind: 5.023153e-02

for an array of size 100 million, and

Nanobind: 5.951924e-01
Pybind: 5.107849e-01

with a 1 billion array.

As I stated earlier, I know this is not the cause of the regression (but could be a part of it). Do you have any insight for us? The binding that we are currently timing is: https://github.com/FEniCS/dolfinx/commit/0edd19f3cbed5d9acc911610aef4703410123dad#diff-97f6b59b7db9ef802d0a9cff4b9a3780864c3aa8c5efe78a920a7a04c428714aL147-L162

Reproducible example code

import nanobind_example as m
import numpy as np
import time

arr = np.arange(10_000_000, dtype=np.float64)

def test_vector():
    arr_nb = arr.copy()
    start = time.perf_counter()
    m.add_vector(arr_nb)
    end = time.perf_counter()
    print(f"Nanobind: {end-start:5e}")

    arr_pb = arr.copy()
    start = time.perf_counter()
    m.add_vector_pb(arr_pb)
    end = time.perf_counter()
    print(f"Pybind: {end-start:5e}")
    assert np.allclose(arr_nb, arr_pb)
jorgensd commented 12 months ago

At least for the regression seen in DOLFINx, we have found a resolution: https://github.com/FEniCS/dolfinx/pull/2895#issuecomment-1817294687 I will test is for the case above today. Could you comment on https://github.com/FEniCS/dolfinx/pull/2895#issuecomment-1817294687?

wjakob commented 12 months ago

If you are doing heavy computations in your bindings themselves rather than the more typical setup of forwarding calls to another library that does the heavy lifting, then NOMINSIZE is probably an option you will want to add.

jorgensd commented 12 months ago

If you are doing heavy computations in your bindings themselves rather than the more typical setup of forwarding calls to another library that does the heavy lifting, then NOMINSIZE is probably an option you will want to add.

In the wrapper itself in DOLFINx, we do not do heavy lifting, it calls the DOLFINx C++ code that has been compiled.

IgorBaratta commented 12 months ago

Maybe the issue is that some of the functions we wrap from dolfinx are templated, so they are compiled together with the bindings.

wjakob commented 12 months ago

In any case, as long as your performance regression disappears with NOMINSIZE, it seems to me that this issue can be closed.