wingo / fibers

Concurrent ML-like concurrency for Guile
GNU Lesser General Public License v3.0
311 stars 32 forks source link

High CPU usage on system time change #89

Open jetomit opened 1 year ago

jetomit commented 1 year ago

Since Guix upgraded to guile-fibers 1.3.1, shepherd hangs shortly after boot on systems without a RTC. I believe the problem comes from using get-internal-real-time in the guile-fibers timer wheel implementation. After NTP corrects the system time, this function returns a much larger value, and the CPU load (for one core) goes to 100%.

Profiling suggests the process spends the CPU time in timer-wheel-advance!, so I imagine it is trying to tick through a five-year time diff. I tried increasing the system time manually by N days, which causes shepherd to be unresponsive (e.g. to herd status) for about N×5 seconds. I observed similar behavior with the example from guile-fibers readme.

Replacing all instances of (get-internal-real-time) with (clock-gettime 1) in guile-fibers, and reconfiguring the system with the patched package, fixes this problem. I think using a monotonic clock makes sense, but there is probably a cleaner / more portable way to do it.

Thanks!

civodul commented 1 year ago

Hi @jetomit!

Using CLOCK_MONOTONIC as you suggest seemed like the right choice to me so I started working on it. However, the API of (fibers timers) as well as schedule-task-at-time expect "internal time units"; changing timer-wheel to use CLOCK_MONOTONIC would affect those interfaces similarly, which is not acceptable.

Instead we should probably change timer-wheel-advance! to cope with large gaps.

@wingo, WDYT?

Thanks!

civodul commented 1 year ago

@jetomit Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

jetomit commented 1 year ago

Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

This would work for aarch64, but I also encounter this issue on armhf and x86_64 systems. This happens whenever system time is pushed forward by a significant amount (a day or more), either by ntpd or manually.

As I understand it, guile’s internal-time-units only depends on the platform and is the same for all clock types. The bigger problem with using CLOCK_MONOTONIC might be that it doesn’t count time the system is suspended, which would probably break stuff.

civodul commented 1 year ago

Another report of shepherd spinning once system time has changed: https://issues.guix.gnu.org/66684

civodul commented 6 months ago

@wingo Hello! Did you have a chance to look into that? I'd be happy to try and implement any suggestions you might have (I'd love to do that before Shepherd 1.0 is out).

wingo commented 6 months ago

Took a look at it but it requires a bit of concentration to not introduce bugs :) Do have a look if you like!

RJSent commented 6 months ago

Just posting for the records another example "in the wild" of someone working around this issue. https://issues.guix.gnu.org/70892#3

Thanks for all your hard work! :smile:

jfrederickson commented 2 months ago

Apologies if anything is inaccurate in this comment, but I had a look through the timer code to see if I could think of anything. I still don't fully grok the control flow, but from a cursory read it does look like timer-wheel-advance! iterates through all layers of the timer wheel even when "skipping ahead" across a large gap.

I wonder if a solution to this might be to skip iterating through the inner wheels if:

  1. The current slot in an outer wheel contains no events, and
  2. The current time is greater than the start time of the current slot plus the time duration of that slot

This feels like it should allow you to pretty rapidly "warp" through a large space of mostly unset timers (by climbing up the hierarchy and skipping slots at the lowest granularity possible), which is what you would expect to see if the clock has suddenly jumped years into the future.