Closed GaetanLepage closed 1 month ago
Logs:
More precisely, the issue is fixed since/by https://github.com/wjakob/nanobind/commit/eed820109534134dcd5da9ef841159819434fcc1.
Would you be able to provide a backtrace of a crashing build in debug mode? It's really difficult to say what the problem might be based on this information.
Would you be able to provide a backtrace of a crashing build in debug mode? It's really difficult to say what the problem might be based on this information.
I am not sure on how to extract such a backtrace, knowing that it is the python process itself that is crashing. We will most likely wait for the next release of nanobind and disable the jax-related tests in the meantime.
I'm concerned that there may be another issue. The commit you listed doesn't really explain why one version crashes and the other one works. Could you run pytest
with gdb --args python3 -m pytest
, after having made a debug build? Then, when you encounter the failure, print "bt" to get a backtrace.
ping @GaetanLepage
I'm concerned that there may be another issue. The commit you listed doesn't really explain why one version crashes and the other one works. Could you run
pytest
withgdb --args python3 -m pytest
, after having made a debug build? Then, when you encounter the failure, print "bt" to get a backtrace.
Sorry for the delay :/
I have been trying to get this working, but it is not very easy within the nix sandbox.
I did compile the tests using make -d
to get the debugging symbols in.
However, when I run the tests with gdb
I don't get any interesting output.
It says, before running the tests:
Reading symbols from python...
(No debugging symbols found in python)
Am I doing something wrong ?
Hi @GaetanLepage ,
it's expected that Python itself would not have interesting debug symbols, it's the plugin that will provide them. To get CMake to build the nanobind test suite with debug symbols, I don't think that make -d
is enough. You need to run the CMake process with -DCMAKE_BUILD_TYPE=Debug
and then compile. After starting gdb
with the arguments I specified earlier, you need to enter run
so that it actually launches the process. This should then reproduce your crash. At that point, you can enter bt
to get the backtrace.
Thank you for those precise instructions. I was able to perform those operations by exiting the sandbox. Hopefully, I was able to replicate the crash.
Here is the backtrace:
========================================================================== 391 passed, 4 skipped in 14.49s ===========================================================================
Thread 1 "pt_main_thread" received signal SIGABRT, Aborted.
0x00007ffff76a2efc in __pthread_kill_implementation () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
(gdb)
(gdb) bt
#0 0x00007ffff76a2efc in __pthread_kill_implementation () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1 0x00007ffff7652e86 in raise () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#2 0x00007ffff763b935 in abort () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#3 0x00007ffff63aa137 in nanobind::detail::internals_cleanup () at /home/gaetan/temp/nanobind/src/nb_internals.cpp:312
#4 0x00007ffff7b72b71 in Py_FinalizeEx.part.0 () from /nix/store/7hnr99nxrd2aw6lghybqdmkckq60j6l9-python3-3.11.9/lib/libpython3.11.so.1.0
#5 0x00007ffff7b79248 in Py_RunMain () from /nix/store/7hnr99nxrd2aw6lghybqdmkckq60j6l9-python3-3.11.9/lib/libpython3.11.so.1.0
#6 0x00007ffff763d10e in __libc_start_call_main () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#7 0x00007ffff763d1c9 in __libc_start_main_impl () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#8 0x0000000000401075 in _start ()
We are preparing the upgrade to nanobind 2.0 and there, this issue does not occur.
The tests work fine, and pytest exits properly, even though jax
and jaxlib
are present in the environment.
Awesome. One last question: which version of nanobind is this? Can you tell me what's on nb_internals.cpp
line 312 in your version?
Awesome. One last question: which version of nanobind is this? Can you tell me what's on
nb_internals.cpp
line 312 in your version?
This is on the v1.9.2
tag of nanobind. Same process on tag v2.0.0
does not crash.
Here are lines 311 - 313 of nb_internals.cpp
:
#if defined(NB_ABORT_ON_LEAK)
abort(); // Extra-strict behavior for the CI server
#endif
Ok. So this is intentional. There is a reference leak, and the test suite crashes at the end to point everyone's attention to this. (Reference leaks are detected all the way at the end when the interpreter shuts down, and at that point this is the only way to make sure the issue doesn't go unnoticed). I will close this then.
Thanks for helping to localize the issue!
The issue here very likely lies with one of the other tensor frameworks. They sometimes hold on to the last ndarray converted and don't release a reference to a nanobind object by the time this shutdown routine is called. It's a benign issue.
Ok great ! Thanks for your patience. So this was caused by JAX somehow ? What have you changed since then that makes this issue go away ?
Problem description
When running the test suite while the latest jax/jaxlib (v0.4.28) is installed,
pytest
will suddenly crash withAborted (core dumped)
after the tests have (supposedly) all succeeded. This weird behavior doesn't happen if I uninstall the jax library (the tests are then skipped andpytest
quits without error).More interestingly,
pytest
runs fine when I use the latest commit (https://github.com/wjakob/nanobind/commit/c5454462e35f29310df05b412b5c48997d634bdd as of today). It only occurs on tag v0.9.2Context: Updating jax in the nixpkgs repo.
Reproducible example code
No response