Closed wlandau closed 8 months ago
I think this is a recurrence of https://github.com/wlandau/crew/issues/75, though it's odd that I am seeing overt hanging now, as opposed to slowness as those once stuck tasks get reassigned. The only changes were throttling in crew, which shouldn't affect this, and mirai https://github.com/shikokuchuo/mirai/commit/e27005793fa73098e70b5a018b5d45fb9703a220. Doesn't necessarily point to an issue in mirai, but it is odd.
Ah, the other change could have been how I was testing it. Previously, I was focused on idle time. The test above sets infinite idle time but max tasks equal to 100, which is a better way to expose this remaining problem in crew
's auto-scaling logic.
Tomorrow, I will tackle this one and #78. Maybe there is something I can do for #77, but I am still scratching my head on that one. The latest at https://github.com/ropensci/targets/actions/runs/5009290193 shows an "object closed" error from a task, as well as a 360-second timeout in a different test.
It is not enough to simply prioritize backlogged workers above non-backlogged ones in
controller$scale()
. Regardless of the task load,controller$try_launch()
must always try to re-launch inactive backlogged workers as determined byinactive & (controller$router$assigned > controller$router$complete)
. The following test is failing because ofcontroller$try_launch()
is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.
Yes, that's right. I didn't emphasize it at the time, but it's what I meant by "always launch at least those servers" here: https://github.com/wlandau/crew/issues/75#issuecomment-1549273388
But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on
crew
's end.Yes, I experienced this delay as well, which I thought a bit odd. But with the throttling implementation, this is now pretty much instant, at least for the previous tests.
Now fixed.
For #124 crew
stopped re-launching backlogged workers, but from the discussion in https://github.com/shikokuchuo/mirai/discussions/95, we need to return to re-launching them. Fortunately, the assigned
and complete
statistics from mirai
are cumulative this time around, so this attempt should be straightforward.
It is not enough to simply prioritize backlogged workers above non-backlogged ones in
controller$scale()
. Regardless of the task load,controller$try_launch()
must always try to re-launch inactive backlogged workers as determined byinactive & (controller$router$assigned > controller$router$complete)
. The following test is failing because ofcontroller$try_launch()
is not aggressive enough on this point. About 8-10 tasks hang because the workers they are assigned to don't make it back to launch. This happens to the trailing tasks at the end of a pipeline because the task load is too low by then and there is no longer explicit demand for those workers due to the presence of the other online workers. (Those online workers happen to be the wrong workers for the assigned tasks at that point). When I tried manually relaunching just the correct workers, the stuck tasks completed.I think this is a recurrence of #75, though it's odd that I am seeing overt hanging now, as opposed to slowness
as those once stuck tasks get reassigned(slowness because my earlier versions of this test capped idle time, which allowed workers to rotate more easily and thus backlogged workers could percolate to the top for relaunch).The only changes were throttling in(Those changes don’t matter for this issue, only the change in my test which exposed more of #75.)crew
, which shouldn't affect this, andmirai
https://github.com/shikokuchuo/mirai/commit/e27005793fa73098e70b5a018b5d45fb9703a220. Doesn't necessarily point to an issue inmirai
, but it is odd.But anyway, the solution to this issue will be good to have anyway. Right after @shikokuchuo solved #75, the trailing hanging tasks did end up completing, but only after several long seconds of delay. I think I can eliminate that delay entirely by more aggressively launching backlogged inactive workers on
crew
's end.