Question: Multithreaded C++ library callback hangs

amal-khailtash commented 2 weeks ago

I am working with a multithreaded C++ library that I am trying to bind to. I have a written a new C++ class that extends one of the original C++ library class. A Python class extends this class and defines a few callbacks that should be called by the C++ library.

The callbacks registration is initiated from Python that passes a Python method name to be called. This method is passed to the extended C++ class where it keeps a std::map of Python method name and function pointer. It also registers a dispatcher method in the extended C++ class that will eventually be calling the Python method by looking its name in the map and calling the Python method.

In this dispatcher function, as soon as I try to access the std::map or any variables outside of this function, the process hangs. I am familiar with GIL a bit, but I think this is somehow related to the concurrent multi-threaded aspect of this library. I tried GIL acquire/release, mutex, and also interpreter swap (sub-interpreters). But I do not quite understand the problem and what causes is or how to debug it.

wlav commented 2 weeks ago

Yes, would be the GIL. If you start the whole machinery from cppyy, then that top-level method should release the GIL, by setting the __release_gil__ flag of the relevant method (see: https://cppyy.readthedocs.io/en/latest/misc.html). Contrary to most binders, cppyy defaults to NOT releasing the GIL, whereas most others do release the GIL when entering C/C++.

With the GIL released, C++ threads can then call the Python callbacks (the generated wrapper that marshalls the function arguments and the return value will automatically re-acquire the GIL). There being only one GIL, all Python code will still serialize; that is, you can only run one Python callback at any time, starting any others will simply block them for the duration of the current one.

There are couple of examples in the test suite: https://github.com/wlav/cppyy/blob/master/test/test_concurrent.py

Aside, to run Python on different C++ threads in parallel, sub-interpreters are needed, but those are not supported yet. Is a work-in-progress.

amal-khailtash commented 2 weeks ago

That is great information I was missing. Thank you!

I added the __release_gil__ = True to the callback function, and added this inside the function:

if ( PyGILState_Check() ) { printf("GIL is held by the current thread.\n"); }
else                      { printf("GIL is not held by the current thread.\n"); }

And I see: GIL is held by the current thread. So, it seems the GIL is still held inside this function?

Does this mean that in my use case, I still need to wait for sub-interpreters implementation? I do not mind if things are done serially at the moment, for I am more interested in functionality and not full performance yet.

amal-khailtash commented 2 weeks ago

Inside this C++ callback method, I added this at the beginning of the callback.

void callback()
{
  if ( PyGILState_Check() ) { printf("A - GIL is held\n"); }
  else                      { printf("A - GIL is NOT held\n"); }

  Py_BEGIN_ALLOW_THREADS
  if ( PyGILState_Check() ) { printf("B - GIL is held\n"); }
  else                      { printf("B - GIL is NOT held\n"); }
  // Your C code that doesn't need the GIL
  std::cout << "    Stored functions count: " << this->processes.size() << std::endl;
  Py_END_ALLOW_THREADS
}

And I see:

A - GIL is held
B - GIL is NOT held

It seems GIL it is still held at the start, but inside Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS, it is released. Program still hangs when accessing this->processes (std::map) with or without Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS.

wlav commented 2 weeks ago

The word "callback" is doing too much work (you have them internally to your C++ threaded calls, but now it seems to be a top-level one, or maybe not). Let's talk in code, makes it easier. I believe that the following is what we're talking about:

import cppyy

cppyy.cppdef(r"""\
#include "Python.h"

std::map<std::string, std::function<void()>> the_callbacks;

void does_it_have_the_GIL(const char* name) {
  if ( PyGILState_Check() ) { printf("%s - GIL is held\n", name); }
  else                      { printf("%s - GIL is NOT held\n", name); }
}

void run_callback(const std::string& name) {
    does_it_have_the_GIL("run_callback");

    auto c = the_callbacks.find(name);
    if (c != the_callbacks.end())
        return c->second();

    std::cerr << "function " << name << " not registered" << std::endl;
}""")

def callme():
    cppyy.gbl.does_it_have_the_GIL("callme")

cppyy.gbl.the_callbacks[callme.__name__] = callme

cppyy.gbl.run_callback.__release_gil__ = True
cppyy.gbl.run_callback(callme.__name__)

The above shows how the function that is going to run the callbacks (run_callback), which is the entry point from Python into C++, needs to release the GIL, and the printout shows that it does. Then running the actual Python callbacks from C++ (which in your case will be on some thread), will re-acquire the GIL, as they have to. The printout shows that it does. If said Python callback calls back into C++ again, then yes, you need to release the GIL again as appropriate, either by setting __release_gil__ on the functions in Python that the Python callback calls, or by using the C-API as you do in the example above once you enter the C++ code.

Let's continue editing this example code until convergence if there are further questions.

Now, if you want to run those callbacks into Python into true parallel mode, then yes, they need to run, from C++ I'd guess or maybe I can figure out some higher-level support, in a Python sub-interpreter. That's a WIP (it's a big chunk of work, but I need it myself as it's critical for a project I'm currently working on).

amal-khailtash commented 1 week ago

Yes, I have similar code as you specified. The registration done from Python as you have, but the actual call is done from the C++ code and not Python. So instead of cppyy.gbl.run_callback(callme.__name__), C++ multithreaded code calls this run_callback. I see:

run_callback - GIL is held

I thought with cppyy.gbl.run_callback.__release_gil__ = True, GIL should have been released? And it hangs accessing the_callbacks map the_callbacks.find(name);. I have a print after that and that is not printed.

wlav commented 1 week ago

Not only should ... I have in fact, running the code above:

run_callback - GIL is NOT held
callme - GIL is held

But you say "C++ multithreaded code calls this run_callback", so you're probably doing something different, which goes back to the point of talking in code as opposed to in prose, but are you calling that function directly from C++ or are you calling that function through cppyy either from the interpreter or with (for example) the Python C-API? If the former, than obviously, there's no GIL release by only setting it on run_callback: it's whatever is the entry point is from Python into C++ that needs to release the GIL.

amal-khailtash commented 1 week ago

I will try to get create a small example, it is difficult to show full code for this library. Python creates the functions to be called by C++ library. It will register them in the map. The C++ library calls the run_callback directly, looks up the Python function to call, and calls it.

wlav commented 1 week ago

The C++ library calls the run_callback directly

Right, so if the function does not originate from Python, it's C++ that needs to release the GIL. When calling C++ methods from C++ methods, Python-side features such as __release_gil__ are not in play.

amal-khailtash commented 1 week ago

Should releasing the GIL happen before the C++ library/thread calls the callback function? My experience is that I cannot release GIL inside the callback function and that seems to be the cause of the hang/deadlock.

wlav commented 1 week ago

Even before that. The GIL is held by the main thread (I'm still not sure based on your description whether that's a Python or C++ one). It needs to be released before another thread calls the callback. That thread calling the callback shouldn't itself release the GIL until it held it.

amal-khailtash commented 1 week ago

I think that is where my problem lies. I have no control over the C++ library I am binding to and I do not want to go in and modify it if possible. Not sure what options I have here.

wlav commented 1 week ago

Well, I still don't understand what you are doing. Maybe one final attempt: first, is this a C++ executable with an embedded Python interpreter, or is the application run from the Python interpreter?

amal-khailtash commented 1 week ago

This is an external C++ library that is started from Python. The C++ library is multithreaded and defines different classes. I intend to expose the classes and most of the functionality to Python. Basically Python runs/initiates the C++ code.

Would the sub-interpreters or the no GIL version of Python 3.13 help?

wlav commented 1 week ago

So how is it started from Python, through cppyy? If yes, then whatever is the top-level function (clearly not the equivalent of run_callback) is what needs to have __release_gil__ set. Then the main thread has no GIL.

As for sub-interpreters, that's about running multiple Python codes in parallel from e.g. C++. They still have a GIL, just one each. You are still going to need to release the main thread one first.

amal-khailtash commented 1 week ago

Great suggestion. I added __release_gil__ to the top-level C++ function that is the entry point for the C++ library. And in the callback function I now see: [call_process] - GIL is NOT held. Still hangs! One note is that the calbacks for C++ threads are registered from Python, but spawned from C++ that need to call the call_process. I am still missing and don't quite understand where else I need to release or acquire GIL.

wlav commented 1 week ago

Okay, next I need to understand how these callbacks are created. Are these Python functions or JITed C++ functions? And then, how are they passed from Python into C++?

If there is no Python involved in the callbacks, they do not need to hold the GIL. If there is, then they do. But e.g. JITed C++ functions that are passed through Python back into C++ do not require the GIL. To illustrate, below is an extension on the example above, executing Python function and C++ ones, respectively, from threads in C++: the former hold the GIL, the latter don't (this is all automatic). Try the use of C++ JITing to generate callbacks for use by your library. If those still hang, then the issue is not the GIL. If they don't hang, then it most likely still is somehow and to say something sensible in that scenario, I'd need to know how they are passed from Python into C++ and how they are stored in C++?

import cppyy
import textwrap

cppyy.cppdef(r"""\
#include "Python.h"

std::map<std::string, std::function<void()>> the_callbacks;

void does_it_have_the_GIL(const char* name) {
  if ( PyGILState_Check() ) { printf("%s - GIL is held\n", name); }
  else                      { printf("%s - GIL is NOT held\n", name); }
}

void run_callbacks() {
    does_it_have_the_GIL("run_callback");

    std::vector<std::thread> workers;

    for (auto& c: the_callbacks)
        workers.push_back(std::thread(c.second));

    for (auto& t: workers)
        t.join();
}""")

cppyy.gbl.run_callbacks.__release_gil__ = True

# call python functions
cppyy.gbl.the_callbacks.clear()

def py_fgen(i):
    name = f"py_callme{i}"
    exec(textwrap.dedent(f"""\
    def {name}():
        cppyy.gbl.does_it_have_the_GIL("{name}")
    """), globals())
    return eval(name)

for i in range(10):
    f = py_fgen(i)
    cppyy.gbl.the_callbacks[f.__name__] = f

cppyy.gbl.run_callbacks()

# call C++ functions
cppyy.gbl.the_callbacks.clear()

def cpp_fgen(i):
    name = f"cpp_callme{i}"
    cppyy.cppdef(textwrap.dedent(f"""\
    void {name}() {{
        does_it_have_the_GIL("{name}");
    }}
    """))
    return getattr(cppyy.gbl, name)

for i in range(10):
    f = cpp_fgen(i)
    cppyy.gbl.the_callbacks[f.__name__] = f

cppyy.gbl.run_callbacks()

amal-khailtash commented 1 week ago

I pushed my code to github. You can clone, and do

$ git clone https://github.com/amal-khailtash/pysystemc
$ make setup-systemc
$ make uv
$ make venv
$ source .venv/bin/activate
$ make uv-sync

This will clone, get SystemC submodule, checkout v2.3.4, and build SystemC library. This is a H/W language similiar to Verilog/VHDL and it is a multithreaded simulation kernel.

Normally the kernel expects an external sc_main() function that is linked when building a SystemC simulation. I overloaded this sc_main here: src/amal/eda/systemc/sc_core/sc_main.cpp. This acquires GIL, and calls a sc_main counterpart in Python instead and releases GIL after that.

The sc_main is responsible for instantiating a design sc_core.sc_module that I have overloaded as well src/amal/eda/systemc/sc_core/sc_module_new.cpp. This is to provide the callback capability.

Once sc_main starts, and instantiates modules, where modules define the design using ports, and signals and processes (methods, (c)threads in SystemC land). There are a few standard fixed callbacks before_end_of_elaboration, end_of_elaboration, start_of_simulation, and end_of_simulation that are called when the simulation starts.

sc_main can use sc_start to start simulations. During definition of sc_modules, new processes can be registered with the kernel using what I named sc_method, sc_thread and sc_cthread in src/amal/eda/systemc/sc_core/sc_module_new.cpp.

This is where I put the function callback name/function pointer into a std::map. I could also pass self to this class for easy access to Python counterpart class members. But that same lock/hang gave me grief.

Once simulation starts with sc_start, first the standard fixed callbacks are call during the simulation loop. Then any registered method/thread need to be called based on the defined triggers (sensitivity).

There are some examples of pure SystemC designs that you can run using the provided Python script that just cppdefs the files and runs C++ sc_main. If do type make alone, you will see help and examples:

Examples:
=========
  example-hello-cpp                    -  Run the hello world example in C++
  example-hello-py                     -  Run the hello world example in Python
  example-counter-cpp                  -  Run the counter example in C++
  example-counter-py                   -  Run the counter example in Python
  example-simple_fifo-py               -  Run the simple_fifo example in Python
  example-pipe-cpp                     -  Run the pipe example in C++
  example-pipe-py                      -  Run the pipe example in Python
  example-pkt_switch-cpp               -  Run the pkt_switch example in C++

You can use hello-cpp and hello-py to see the problem. Right now, I have commented the callback lookup and the actual call in src/amal/eda/systemc/sc_core/sc_module_new.cpp, so both C++ and Python example should run correctly. If you uncomment src/amal/eda/systemc/sc_core/sc_module_new.cpp line 136 which I try to access the processes (map) size, it locks.

Your help and insight is much appreciated and I find cppyy awesome for the fact that it seamlessly supports C++ and templates.

wlav commented 1 week ago

I can look in more detail later, but just from looking over sc_main: that's not the way to embed Python. Here's an example to setup Python to run embedded from C++: https://github.com/wlav/CPyCppyy/blob/master/src/API.cxx#L39 .

At least, I can't find anywhere in your or the systemc code where Python is being initialized. Recommend calling Py_IsInitialized() in sc_main to see whether it is. If not, then that might be the problem right there.

amal-khailtash commented 1 week ago

I didn't believe I need to call py_initialize! I am not embedding python in C++, but the other way around. Unless I am missing something. Python will start a python application where it accesses the exposed C++ library.

wlav commented 1 week ago

Ah, so the main executable is the python interpreter itself? That starts systemc, which runs sc_main? Yes, that should definitely be fine that way.

amal-khailtash commented 1 week ago

Yes, here is how things start:

Python is invoked on the script.
Python function systemc.main() is called.
systemc (python module) include file systemc and loads libsystemc.so
some of the systemc header file sources are overridden:
in systemc/sc_main.cpp that provides a new C++ sc_main()
in systemc/sc_module_new.cpp new (C++) class ScModule is defined with fixed callbacks before_end_of_elaboration, end_of_elaboration, start_of_simulation, and end_of_simulation
This class ScModule inherits from the original C++ sc_core.sc_module, but it will replace it on the Python side.
new C++ sc_main will acquire GIL, find a Python function named sc_main in main, runs that and releases GIL.
Python sc_main usually instantiates classes that inherit from sc_module (now ScModule inherited from sc_module).
In the classes that inherit from sc_module, new ports (sc_core.sc_in, sc_core_sc_out, ...) are created, and processes (method, (c)thread) are registered using functions sc_method and sc_thread implemented in ScModule class.
The registration involves putting Python function callback name and function pointer itself in a std::map.
Continuing in (python) sc_main, sc_start is called which starts the SystemC simulation kernel.
During the simulation, various method/(c)threads need to be called. sc_core/sc_module_new ::call_process() function will be called by the SystemC simulation kernel.
This is were the deadlock/hang happens. As soon as anything in ScModule class (including the std::map or self) is accessed, process hangs.

I hope that is clear. I need to draw some diagram for all these process.

wlav / cppyy

Question: Multithreaded C++ library callback hangs #274