wjakob / nanobind

nanobind: tiny and efficient C++/Python bindings
BSD 3-Clause "New" or "Revised" License
2.32k stars 194 forks source link

[BUG]: Eigen integration silently fails with recent versions of Eigen #746

Open dfm opened 2 weeks ago

dfm commented 2 weeks ago

Problem description

When I build the following simple nanobind module with the current master branch of Eigen:

#include <nanobind/nanobind.h>
#include <nanobind/eigen/dense.h>

namespace nb = nanobind;

NB_MODULE(nanobind_example_ext, m) {
    using Matrix = Eigen::Matrix<int, 1, 1>;
    m.def("add", [](Matrix a, Matrix b) { return a + b; });
}

The output of calling:

import numpy as np
nanobind_example.add(np.int32([0]), np.int32([0]))

is array([17726950], dtype=int32) instead of the expected array([0], dtype=int32).

This can be fixed by adding a return type:

-     m.def("add", [](Matrix a, Matrix b) { return a + b; });
+     m.def("add", [](Matrix a, Matrix b) -> Matrix { return a + b; });

and the example above works properly using the last stable release of Eigen, so I expect it has something to do with a change in the type produced by the addition. Unfortunately I don't know enough about the nanobind (or Eigen!) internals to know where to start debugging this, but I wanted to check here if this was an expected sharp edge, or if this is something that should work.

I note that I originally came across this issue when I discovered a test failure here:

https://github.com/wjakob/nanobind/blob/f3e2796d7bd12f77856eef53772644115a5e8788/tests/test_eigen.py#L303-L305

which has this same syntax.

Reproducible example code

#include <nanobind/nanobind.h>
#include <nanobind/eigen/dense.h>

namespace nb = nanobind;

NB_MODULE(nanobind_example_ext, m) {
    using Matrix = Eigen::Matrix<int, 1, 1>;
    m.def("add", [](Matrix a, Matrix b) { return a + b; });
}
import numpy as np
assert nanobind_example.add(np.int32([0]), np.int32([0])) == 0
wjakob commented 2 weeks ago

I can't reproduce this issue on my machine. What version of Eigen are you using? What OS is this on?

wjakob commented 2 weeks ago

I did sme more checking to see if there could be an issue involving ownership management: the expression caster turns the addition expression into a Eigen::Array<int, 1, 1, 0, 1, 1> and then uses the rv_policy::move RVP to convert it into a Python object. Since the array only encapsulates a small amount of memory, nanobind downgrades this to rv_policy::copy and creates a new NumPy array containing a copy of the result. It all looks good to me, so I don't understand the issue.

The CI also hasn't had shown any failures here in the last months. Could it be that there is a problem on your end? If this is truly a nanobind bug, could I ask you to make a PR that causes a failure to help reproducibility on my end?

dfm commented 2 weeks ago

Thanks for looking into this! I'm seeing this on my macbook with commit 44b16f48cb5c5e7b9f5ed2b4d6a14ed14ab302e1 of Eigen (the current tip of the master from https://gitlab.com/libeigen/eigen). The original error showed up in our internal CI at Google with an upgrade of Eigen to commit 2d4c9b400cca33d2f5cf316efc7151236244edb1 (the nanobind test10_eigen_scalar_default started failing then).

Let me try bisecting Eigen to see if I can narrow down when this occurred. I'll report back ASAP with more info. Thanks!

dfm commented 2 weeks ago

Bisecting was actually fast! It looks like the offending Eigen commit actually is 2d4c9b400cca33d2f5cf316efc7151236244edb1. I'm happy to open a PR using this Eigen commit to try to reproduce in CI if that would be helpful.

hawkinsp commented 2 weeks ago

One possibly relevant thing is that both Google's CI and your macbook likely use clang as a compiler.

I tried to reproduce the same failure on my Linux machine with Eigen from master and was unable, but perhaps I didn't get the compiler stars to align right.

dfm commented 2 weeks ago

I expect @hawkinsp is right here! I've opened https://github.com/wjakob/nanobind/pull/749 demonstrating the issue in CI, and everything works fine using GCC on Ubuntu, but fails on macOS with clang. Hopefully this can be illuminating as to what's going on here!