Closed levy closed 2 years ago
I'm willing to bet that it has the same cause as #61, just more likely with increased parallelism with the GIL released during the C++ portion.
What I observe is that for the failing case, the vectorcall flags the presence of self
in the stack, which requires adjustment of the arguments, hence the extra one. What I don't understand is that the arguments sent in from threading
are only 2 different stacks at most, while the bit PY_VECTORCALL_ARGUMENTS_OFFSET
is set, which per vectorcall rules means they are modifiable (i.e. self
can be inserted into slot 0). If I disregard that bit, then this bug is fixed (or at least, appears so), but not the other one. But either way, it's Python setting that bit, so I don't understand how it could be wrong (I can find no bug reports of this) and so I don't think it's the real cause, just a symptom.
Yeah, I had the feeling that this bug is related somehow. Without me knowing too much detail about how the Python interpreter works, are you saying that the PY_VECTORCALL_ARGUMENTS_OFFSET bit and the self parameter on the stack is somehow messed up in concurrently running threads? So there must be a lock missing somewhere, either in Python or in the C++ glue code that is used/generated by cppyy?
Maybe Python reuses the argument vector for subsequent calls, and perhaps the args vector is even shared between threads, because GIL prevents concurrency problems anyway. Except in this case cppyy releases the lock when the actual C++ call happens?
Maybe; I've been thinking the same thoughts, but haven't been able to pin-point anything.
There is one CPPOverload
object created per two threads, so certainly there is some re-use. Before vector calls, I could control re-use in the descriptor and make sure methods were properly re-initialized. Now it's up to the interpreter. I'm finding that in the problematic cases, the two self
objects provided are indeed from the current and the previous call: one is passed as the first argument on the stack, the other as a self
argument to the vector call.
It's not clear to me how the re-use is causing problems. I.e. whether the re-use of the argument stack is concurrent (which would be a locking problem) or whether the argument stack is inadvertently modified and not properly restored either b/c it's re-used "too quickly" or b/c there's a bug in the cleanup code.
I'm kindof leaning towards the latter: if __release_gil__
is not set, then everything, including the bindings, executes under the GIL. Additionally, all reflection lookups and wrapper creation are locked (the next version, with LLVM13, will see a JIT that can compile in parallel, but right now it has another global lock). Also, since self
is provided twice, I can simply null out one of the two self
s in the call when the CPPOverload
was created directly (as opposed to through a descriptor) and that solves the problem. Doing that is not completely non-sensical, but I'd like to find some confirmation in the Python sources that that is the proper way of handling vector calls that I simply have missed. (A descriptor will provide a separate self
, a vector call will not, so maybe that's really the way it should have been handled. Unfortunately, with a dearth of examples, the vector call code was written through experimental programming. I.e. trial and error. :) )
Yes, the above was pretty much it: each CPPOverload
is used in two threads. At the start, they check whether they are bound (they're not for vector calls), then proceed to use their internal self
data member b/c that's how things rolled historically in py2. If two threads start at the same time, that's okay: self
isn't set until later on, so both threads see nullptr
; and it isn't actually used when set, since the vector call will have all arguments, including self
on the stack. However, if there are enough threads to saturate the machine, or otherwise timing differences crop up, e.g. by releasing the GIL (Python counts instructions, not execution time, which keeps things pretty much in sync), then one thread may have that internal self
in use at the point when the other checks for it. Only then are both the internal self
and the one on the stack used, leading to the extra parameter seen.
The solution is not to lock around that self
data member, but to use a stack variable (thus per-thread) for the internal use that the self
data member served (it's still needed for descriptor-based calls, which simply assigns to the stack variable now).
Fix is here: https://github.com/wlav/CPyCppyy/commit/b4dbd0803bc47c26a81954f1aae85621d42a3138
I'm blown away with your support!
I have installed your patch and indeed it fixes the issue. This is the first time that even the more complicated use case works.
I re-installed the cppyy libraries according to the documentation. There was a small issue though when installing the last part (cppyy itself):
ERROR: Disabling PEP 517 processing is invalid: project specifies a build backend of cppyy_monkey_patch:main in pyproject.toml
So perhaps the last command no longer needs the no-use-pep517 parameter?
Thank you for the quick responses, anyway!
I fixed the document. The custom installer (cppyy_monkey_patch
) for PEP517 is needed b/c it doesn't recognize PyPy as a different platform from CPython.
Released with 2.4.0 and its dependencies.
I'm not totally sure that I'm using
__release_gil__
properly. In the real example the function takes a lot of time to execute and I need parallelism there.pip3 list:
test.h:
test.cc:
test.py:
compile:
run:
The above output pops up only occasionally, but removing the non-threaded test call makes the problem worse. Of course, removing the
__release_gil__
setting fixes the problem.