Open ysbaddaden opened 5 months ago
Note: more yielding points means that cooperative shutdown or cooperative stop-the-world, or fiber cancellations could happen faster.
I got the logic implemented:
Fiber.maybe_yield
will yield if told to;Fiber#enqueue
always calls Fiber.maybe_yield
(*);It works. A busy loop that manually checks for Fiber.maybe_yield
regularly yields :tada:
But I quickly get a libevent2
warning:
[warn] event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
And an eventual crash:
FATAL: can't resume a running fiber #<Fiber:0x7fa4608ebe40: DEFAULT:loop> (#<ExecutionContext::SingleThreaded:0x7fa4608f0f20 DEFAULT>)
from src/single_threaded.cr:124:52 in 'resume'
from src/single_threaded.cr:172:11 in 'run_loop'
from src/single_threaded.cr:68:56 in '->'
from src/core_ext/fiber.cr:151:11 in 'run'
from src/core_ext/fiber.cr:57:34 in '->'
from ???
(*) more places could eventually call Fiber.maybe_yield
, for example uncontented IO methods, buffered channel, mutex, ... and so on.
Note about the crash: this is the EC::Scheduler#run_loop
fiber trying to resume itself :raised_eyebrow:
EDIT: this may happen if:
Fiber#enqueue
(it musn't);NOTE: had it been a MT context, the thread would have deadlock, waiting for the current fiber to be resumable.
I think the issue is the monitoring thread telling the runloop fibers to interrupt then the event calling Fiber#enqueue to enqueue fibers.
ExecutionContext::Scheduler.enqueue(fiber)
instead of Fiber#enqueue
; Crystal::EventLoop#run
would return a list of fibers, that the scheduler would enqueue in a batch (instead of 1 by 1), minus one to be resumed immediately (skipping the queue).That fixed the issue.
CPU bound fibers may never hit a cancellation point, and block a thread to run a fiber for an unbounded amount of time, blocking other fibers from running.
I'm not even talking about CPU heavy algorithms (e.g. cryptography): a regular sockets may prevent a fiber from yielding, for example when the socket's always ready for read or write (the fiber will never wait on the event loop); a buffered Channel won't suspend the current fiber until the buffer is full or empty, which may take a while to happen if the other end is pushing/consuming them quickly.
One goal of execution contexts is to limit this in practice, by taking advantage of the OS to preempt threads. Still, we should have more places places that can yield the current fiber.
For example
Fiber#enqueue
could check for how long the current fiber has been running and decide to yield when it ran for N milliseconds. MaybeIO
methods could do just that (inside the EventLoop). Instead of checkingTime.monotonic
over and over again, we could have the monitoring thread do the check and mark the fiber (see #5).