Closed vfdev-5 closed 3 weeks ago
Hi @vfdev-5,
I tried your example and also ran the test suite with TSAN. I at first got a huge amount of warnings about races in inst_new_int
and inst_new_ext
(which makes no sense because these acquire a lock). Switching the associated mutex from a Python mutex to a std::mutex
made these go away, which made me suspect that Python itself must be probably built with TSAN as well.
Compiling Python with ./configure --disable-gil --with-thread-sanitizer
resolved these warnings for me.
A question to @colesbury: I suspect that TSAN gets confused by the non-standard Python mutex unless both Python and the extension are compiled with TSAN enabled. Is this possible?
@colesbury: While debugging, I also noticed the following:
We currently immortalize type objects with the following assignments
o->ob_tid=_Py_UNOWNED_TID;
o->ob_ref_local=_Py_IMMORTAL_REFCNT_LOCAL;
o->ob_ref_shared = 0;
However, these values don't all stay frozen. In one debugging session, I noticed an immortal type object with an o->ob_ref_shared
value as large as 3200000. Is this expected behavior?
This is on Python 3.14 master.
I suspect this isn't going to work unless the Python interpreter is built with tsan
, given the atomics in the PyMutex implementation won't be subject to tsan instrumentation without it.
I also suspect that PyMutex would probably benefit from tsan annotations: https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/include/sanitizer/tsan_interface.h#L27 but I don't think that's mandatory.
Yeah, in general you want your whole application compiled with tsan or it will miss synchronizations in the non-instrumented portion. (I don't think the mutex-specific warnings are particularly useful, which is why we don't use them.)
I'm not sure what would cause ob_ref_shared
to be in the 3200000 range. Objects that use deferred reference counting have a large constant value (2^62-1, I think) added to ob_ref_shared
, but that's much larger than 3200000.
@colesbury : I tracked it to _PyType_MergeThreadLocalRefcounts
. It seems to be called even for the immortal objects and increases ob_ref_shared
. This is the backtrace:
frame #1: 0x00000001003350bc python3` _PyType_MergeThreadLocalRefcounts(tstate=0x000000010611c000) + 340 at typeid.c:159
frame #2: 0x0000000100335130 python3` _PyType_FinalizeThreadLocalRefcounts(tstate=0x000000010611c000) + 36 at typeid.c:169
frame #3: 0x0000000100308444 python3` PyThreadState_Clear(tstate=0x000000010611c000) + 2044 at pystate.c:1748
frame #4: 0x00000001003c68dc python3` thread_run(boot_raw=0x0000000104c011a0) + 468
frame #5: 0x000000010032df80 python3` pythread_wrapper(arg=0x0000000104a00050) + 68 at thread_pthread.h:243
frame #6: 0x0000000100bc45b0 libclang_rt.tsan_osx_dynamic.dylib` __tsan_thread_start_func + 144
frame #7: 0x000000018525bfa8 libsystem_pthread.dylib` _pthread_start + 148
Thanks a lot @wjakob for checking this and sorry for the noise, definitely, compiling python3.13t with TSAN removes a lot of reported data races in the repro code and my other tests.
@vfdev-5 Can the issue be closed? Or do you still see races that could be a problem in nanobind?
It seems to be called even for the immortal objects and increases
ob_ref_shared
Ah, that makes sense. That's confusing, but I don't think it really matters. Once PyUnstable_Object_EnableDeferredRefcount
is made public, we should update nanobind to use that instead of making heap types immortal (for 3.14+).
@wjakob yes, we can close this issue. No problems from nanobind side.
Problem description
Hi @wjakob , I still see some data race reports by TSAN sometimes mentioning
keep_alive
ornb_type_put_common
orinst_new_ext
. For example, a report I could get usingmaster
(c1bab7e4207566b75bbc51c35079f66ae6f0afc0):I could create a reproducible example below:
C++ extension
CMakeLists.txt:
Can you please confirm the issue and maybe hint what could be the issue with the nanobind or the usage, thanks!
Reproducible example code