ysbaddaden / gc

A garbage collector for Crystal
93 stars 5 forks source link

Stop-the-world #20

Open ysbaddaden opened 2 years ago

ysbaddaden commented 2 years ago

When starting a collection, we must tell all crystal threads to stop and save their current fiber context to the stack.

ysbaddaden commented 2 years ago

I believe we can use a combo of signals and pthreads to achieve a stop-the-world.

We register two signal handlers on each thread, one to react on a SUSPEND signal, and another to a RESUME signal. We can send a signal to each thread with pthread_kill(3).

Thanks to sigaction() the signal handler will receive a context parameter which is a ucontext_t (see getcontext(3) with the switch context of the thread before it switched to the signal handler, meaning that we should have the CPU registers saved in memory? I assume we can sizeof(ucontext_t) to know the size of the struct, and can merely register a root to scan for each thread!

Then the threads can sigsuspend(2) themselves from the SUSPEND signal handler (hopefully?), waiting for the collector to send the RESUME signal, to continue the world.

Once all known threads have registered themselves, the collector can start marking the memory.

ysbaddaden commented 2 years ago

From the getcontext(3) manpage:

uc_mcontext (mcontext_t) is the machine-specific representation of the saved context, that includes the calling thread's machine registers.

:tada:

Bonus: we won't need to create a collecting fiber and callback into Crystal to switch to that fiber to save the current context: we'll just need all threads to call GC_init_thread() when they start (note that we'll need a collector thread).

ysbaddaden commented 2 years ago

Proposed design:

That's... a lot of work :open_mouth:

ysbaddaden commented 2 years ago

Talking with @ggiraldez we thought that maybe we didn't need to force a stop the world and could probably start marking immediately, as long as we have some safeguards, such as stopping properly on GC_malloc or when switching context, and probably when spawning a Fiber: it allocates (i.e. Fiber.new) so it should be stopping, but we also add it to the list of all fibers that may have already been reported to the GC!

Stopping properly means that they call getcontext(), report their current Fiber + context to the GC, then wait for the RESUME signal. :warning: we must only report non-running Fiber stacks to the GC!

There is one case where we'd need to force a stop-the-world: an optimized long running fiber that doesn't allocate nor switches context running in a thread. We must eventually stop it, for example after some nano/milliseconds or, worst case, when a collector thread has nothing left to scan & mark, but it can't start a sweep because a thread is still running. We thus need a preemptive stop-the-world mecanism, as a last measure.

The only drawback is the worst case where we have to eventually stop the world. That could create some delay, when the point is to avoid an initial delay... yet, that should only happen when you have CPU bound fibers with highly optimized allocations (nearing zero) so maybe it's a rare case and acceptable?

ysbaddaden commented 2 years ago

Anyway, we'll need the stop-the-world mechanism, so it's probably best to start with it, then we can try to have the world stop itself (with an eventual force stop).

ysbaddaden commented 2 years ago

Oh my, it looks like the combo is working (at least on Linux): https://gist.github.com/ysbaddaden/9a484ef55ed451673176a7614207a1c1

poke @beta-ziliani @ggiraldez :tada:

ggiraldez commented 2 years ago

Nice! I'd test if it works properly when the threads are in not idle (eg. in a tight loop doing some computation), as sleep already puts the thread in a waiting state. It should work fine though.

ysbaddaden commented 2 years ago

@ggiraldez You're right, I updated the experiment to do just that (they increment a counter).

ysbaddaden commented 2 years ago

Hum, there are many places in the GC that despite being thread safe, must also be made stop-the-world safe:

  1. We can't have 2 concurrent collections starting concurrently (obviously);
  2. The Collector eventually resets the LocalAllocator's blocks, so we musn't interrupt a small object allocation;
  3. What if the LocalAllocator can't allocate in its current blocks and is requesting blocks from the GlobalAllocator?
  4. GlobalAllocator is protected by GC_lock() and is responsible for calling GC_collect() so it should be fine (unless we get a manual call to GC_collect()).

We should probably have a RWLOCK.

ysbaddaden commented 9 months ago

Counter example to starting to mark before all threads have stopped:

A naive mark & sweep MUST stop the world BEFORE scanning the objects. I guess we'd need tri-color marking to lift this limitation?

We can, however, start scanning the stacks of inactive fibers/threads for direct references to objects to scan while we wait for the world to stop, at which time we can start marking objects from many threads.