wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

Always relaunch backlogged inactive workers #79

Closed wlandau closed 8 months ago

wlandau commented 1 year ago

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned (slowness because my earlier versions of this test capped idle time, which allowed workers to rotate more easily and thus backlogged workers could percolate to the top for relaunch). The only changes were throttling in crew, which shouldn't affect this, and mirai https://github.com/shikokuchuo/mirai/commit/e27005793fa73098e70b5a018b5d45fb9703a220. Doesn't necessarily point to an issue in mirai, but it is odd. (Those changes don’t matter for this issue, only the change in my test which exposed more of #75.)

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

library(crew)
controller <- crew_controller_local(
  workers = 20L,
  tasks_max = 100,
  seconds_exit = 10
)
controller$start()
names <- character(0L)
index <- 0L
n_tasks <- 6000L
while (index < n_tasks || !(controller$empty())) {
  if (index < n_tasks) {
    index <- index + 1L
    cat("submit", index, "\n")
    controller$push(
      name = as.character(index),
      command = TRUE
    )
  }
  out <- controller$pop()
  if (!is.null(out)) {
    cat("collect", out$name, "\n")
    names[[length(names) + 1L]] <- out$name
  }
}

# unresolved tasks
lapply(controller$queue, function(x) x$handle[[1L]]$data)

# daemons
controller$router$poll()
controller$router$daemons

controller$terminate()
wlandau commented 1 year ago

I think this is a recurrence of https://github.com/wlandau/crew/issues/75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned. The only changes were throttling in crew, which shouldn't affect this, and mirai https://github.com/shikokuchuo/mirai/commit/e27005793fa73098e70b5a018b5d45fb9703a220. Doesn't necessarily point to an issue in mirai, but it is odd.

Ah, the other change could have been how I was testing it. Previously, I was focused on idle time. The test above sets infinite idle time but max tasks equal to 100, which is a better way to expose this remaining problem in crew's auto-scaling logic.

wlandau commented 1 year ago

Tomorrow, I will tackle this one and #78. Maybe there is something I can do for #77, but I am still scratching my head on that one. The latest at https://github.com/ropensci/targets/actions/runs/5009290193 shows an "object closed" error from a task, as well as a 360-second timeout in a different test.

shikokuchuo commented 1 year ago

It is not enough to simply prioritize backlogged workers above non-backlogged ones in controller$scale(). Regardless of the task load, controller$try_launch() must always try to re-launch inactive backlogged workers as determined by inactive & (controller$router$assigned > controller$router$complete). The following test is failing because of controller$try_launch() is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.

Yes, that's right. I didn't emphasize it at the time, but it's what I meant by "always launch at least those servers" here: https://github.com/wlandau/crew/issues/75#issuecomment-1549273388

But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on crew's end.

Yes, I experienced this delay as well, which I thought a bit odd. But with the throttling implementation, this is now pretty much instant, at least for the previous tests.

wlandau commented 1 year ago

Now fixed.

wlandau commented 8 months ago

For #124 crew stopped re-launching backlogged workers, but from the discussion in https://github.com/shikokuchuo/mirai/discussions/95, we need to return to re-launching them. Fortunately, the assigned and complete statistics from mirai are cumulative this time around, so this attempt should be straightforward.