Open ysbaddaden opened 2 years ago
I believe we can use a combo of signals and pthreads to achieve a stop-the-world.
We register two signal handlers on each thread, one to react on a SUSPEND signal, and another to a RESUME signal. We can send a signal to each thread with pthread_kill(3)
.
Thanks to sigaction()
the signal handler will receive a context
parameter which is a ucontext_t
(see getcontext(3)
with the switch context of the thread before it switched to the signal handler, meaning that we should have the CPU registers saved in memory? I assume we can sizeof(ucontext_t)
to know the size of the struct, and can merely register a root to scan for each thread!
Then the threads can sigsuspend(2)
themselves from the SUSPEND signal handler (hopefully?), waiting for the collector to send the RESUME signal, to continue the world.
Once all known threads have registered themselves, the collector can start marking the memory.
From the getcontext(3)
manpage:
uc_mcontext
(mcontext_t
) is the machine-specific representation of the saved context, that includes the calling thread's machine registers.
:tada:
Bonus: we won't need to create a collecting fiber and callback into Crystal to switch to that fiber to save the current context: we'll just need all threads to call GC_init_thread()
when they start (note that we'll need a collector thread).
Proposed design:
GC_init()
registers sigaction handler actions for two user signals (SUSPEND and RESUME), the SUSPEND action asking for 3 arguments to get the interrupted context (SA_SIGINFO);GC_Collector_init()
initializes an Array of thread ids;GC_Collector_init()
starts a collector thread that pauses itself;GC_init_thread()
registers the current thread to the collector;GC_deinit_thread()
removes the current thread from the collector;GC_collect()
shall now be provided by the C library (instead of Crystal), it wakes up the collector thread and suspends itself (i.e. suspends the current thread);That's... a lot of work :open_mouth:
Talking with @ggiraldez we thought that maybe we didn't need to force a stop the world and could probably start marking immediately, as long as we have some safeguards, such as stopping properly on GC_malloc
or when switching context, and probably when spawning a Fiber: it allocates (i.e. Fiber.new
) so it should be stopping, but we also add it to the list of all fibers that may have already been reported to the GC!
Stopping properly means that they call getcontext()
, report their current Fiber + context to the GC, then wait for the RESUME signal. :warning: we must only report non-running Fiber stacks to the GC!
There is one case where we'd need to force a stop-the-world: an optimized long running fiber that doesn't allocate nor switches context running in a thread. We must eventually stop it, for example after some nano/milliseconds or, worst case, when a collector thread has nothing left to scan & mark, but it can't start a sweep because a thread is still running. We thus need a preemptive stop-the-world mecanism, as a last measure.
The only drawback is the worst case where we have to eventually stop the world. That could create some delay, when the point is to avoid an initial delay... yet, that should only happen when you have CPU bound fibers with highly optimized allocations (nearing zero) so maybe it's a rare case and acceptable?
Anyway, we'll need the stop-the-world mechanism, so it's probably best to start with it, then we can try to have the world stop itself (with an eventual force stop).
Oh my, it looks like the combo is working (at least on Linux): https://gist.github.com/ysbaddaden/9a484ef55ed451673176a7614207a1c1
poke @beta-ziliani @ggiraldez :tada:
Nice! I'd test if it works properly when the threads are in not idle (eg. in a tight loop doing some computation), as sleep already puts the thread in a waiting state. It should work fine though.
@ggiraldez You're right, I updated the experiment to do just that (they increment a counter).
Hum, there are many places in the GC that despite being thread safe, must also be made stop-the-world safe:
GC_lock()
and is responsible for calling GC_collect()
so it should be fine (unless we get a manual call to GC_collect()
).We should probably have a RWLOCK.
Counter example to starting to mark before all threads have stopped:
A naive mark & sweep MUST stop the world BEFORE scanning the objects. I guess we'd need tri-color marking to lift this limitation?
We can, however, start scanning the stacks of inactive fibers/threads for direct references to objects to scan while we wait for the world to stop, at which time we can start marking objects from many threads.
When starting a collection, we must tell all crystal threads to stop and save their current fiber context to the stack.