Setting __release_gil__ to True for a function sometimes breaks calling the function from multiple threads

levy commented 2 years ago

I'm not totally sure that I'm using __release_gil__ properly. In the real example the function takes a lot of time to execute and I need parallelism there.

pip3 list:

cppyy                   2.3.1
cppyy-backend           1.14.8
cppyy-cling             6.25.3
CPyCppyy                1.12.10

test.h:

class Simulation
{
  public:
    virtual void do_something();
};

test.cc:

#include "test.h"

void Simulation::do_something()
{
}

test.py:

import cppyy
import threading

cppyy.add_include_path(".")
cppyy.add_library_path(".")
cppyy.load_library("libtest")
cppyy.include("test.h")

cppyy.gbl.Simulation.do_something.__release_gil__ = True

def test(i):
    simulation = cppyy.gbl.Simulation()
    simulation.do_something()

test(0)

for i in range(0, 10):
    threading.Thread(target=test, args=[i]).start()

compile:

clang++ -fPIC -shared -o libtest.so test.cc

run:

levy@valardohaeris:~$ python3 test.py
Exception in thread Thread-9 (test):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/levy/test.py", line 13, in test
    simulation.do_something()
TypeError: void Simulation::do_something() =>
    TypeError: takes at most 0 arguments (1 given)

The above output pops up only occasionally, but removing the non-threaded test call makes the problem worse. Of course, removing the __release_gil__ setting fixes the problem.

wlav commented 2 years ago

I'm willing to bet that it has the same cause as #61, just more likely with increased parallelism with the GIL released during the C++ portion.

What I observe is that for the failing case, the vectorcall flags the presence of self in the stack, which requires adjustment of the arguments, hence the extra one. What I don't understand is that the arguments sent in from threading are only 2 different stacks at most, while the bit PY_VECTORCALL_ARGUMENTS_OFFSET is set, which per vectorcall rules means they are modifiable (i.e. self can be inserted into slot 0). If I disregard that bit, then this bug is fixed (or at least, appears so), but not the other one. But either way, it's Python setting that bit, so I don't understand how it could be wrong (I can find no bug reports of this) and so I don't think it's the real cause, just a symptom.

levy commented 2 years ago

Yeah, I had the feeling that this bug is related somehow. Without me knowing too much detail about how the Python interpreter works, are you saying that the PY_VECTORCALL_ARGUMENTS_OFFSET bit and the self parameter on the stack is somehow messed up in concurrently running threads? So there must be a lock missing somewhere, either in Python or in the C++ glue code that is used/generated by cppyy?

levy commented 2 years ago

Maybe Python reuses the argument vector for subsequent calls, and perhaps the args vector is even shared between threads, because GIL prevents concurrency problems anyway. Except in this case cppyy releases the lock when the actual C++ call happens?

wlav commented 2 years ago

Maybe; I've been thinking the same thoughts, but haven't been able to pin-point anything.

There is one CPPOverload object created per two threads, so certainly there is some re-use. Before vector calls, I could control re-use in the descriptor and make sure methods were properly re-initialized. Now it's up to the interpreter. I'm finding that in the problematic cases, the two self objects provided are indeed from the current and the previous call: one is passed as the first argument on the stack, the other as a self argument to the vector call.

It's not clear to me how the re-use is causing problems. I.e. whether the re-use of the argument stack is concurrent (which would be a locking problem) or whether the argument stack is inadvertently modified and not properly restored either b/c it's re-used "too quickly" or b/c there's a bug in the cleanup code.

I'm kindof leaning towards the latter: if __release_gil__ is not set, then everything, including the bindings, executes under the GIL. Additionally, all reflection lookups and wrapper creation are locked (the next version, with LLVM13, will see a JIT that can compile in parallel, but right now it has another global lock). Also, since self is provided twice, I can simply null out one of the two selfs in the call when the CPPOverload was created directly (as opposed to through a descriptor) and that solves the problem. Doing that is not completely non-sensical, but I'd like to find some confirmation in the Python sources that that is the proper way of handling vector calls that I simply have missed. (A descriptor will provide a separate self, a vector call will not, so maybe that's really the way it should have been handled. Unfortunately, with a dearth of examples, the vector call code was written through experimental programming. I.e. trial and error. :) )

wlav commented 2 years ago

Yes, the above was pretty much it: each CPPOverload is used in two threads. At the start, they check whether they are bound (they're not for vector calls), then proceed to use their internal self data member b/c that's how things rolled historically in py2. If two threads start at the same time, that's okay: self isn't set until later on, so both threads see nullptr; and it isn't actually used when set, since the vector call will have all arguments, including self on the stack. However, if there are enough threads to saturate the machine, or otherwise timing differences crop up, e.g. by releasing the GIL (Python counts instructions, not execution time, which keeps things pretty much in sync), then one thread may have that internal self in use at the point when the other checks for it. Only then are both the internal self and the one on the stack used, leading to the extra parameter seen.

The solution is not to lock around that self data member, but to use a stack variable (thus per-thread) for the internal use that the self data member served (it's still needed for descriptor-based calls, which simply assigns to the stack variable now).

Fix is here: https://github.com/wlav/CPyCppyy/commit/b4dbd0803bc47c26a81954f1aae85621d42a3138

levy commented 2 years ago

I'm blown away with your support!

I have installed your patch and indeed it fixes the issue. This is the first time that even the more complicated use case works.

I re-installed the cppyy libraries according to the documentation. There was a small issue though when installing the last part (cppyy itself):

ERROR: Disabling PEP 517 processing is invalid: project specifies a build backend of cppyy_monkey_patch:main in pyproject.toml

So perhaps the last command no longer needs the no-use-pep517 parameter?

Thank you for the quick responses, anyway!

wlav commented 2 years ago

I fixed the document. The custom installer (cppyy_monkey_patch) for PEP517 is needed b/c it doesn't recognize PyPy as a different platform from CPython.

wlav commented 2 years ago

Released with 2.4.0 and its dependencies.

wlav / cppyy

Setting __release_gil__ to True for a function sometimes breaks calling the function from multiple threads #62