Cancelled workflows lead to extra runners

alfred-stokespace commented 2 months ago

This case is specific to two scenarios we have in our org.

The run in question is canceled prior to being scheduled
Dependent runs are cancelled when parent runs fail (I think,... )

In both cases we see that myshoes

is notified
creates a resource
myshoes notices that the runner is idle
- 2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
myshoes waits 6hours before killing the idle runner

We're using some pretty expensive ec2 instances as well as have several contingent runs (that sometimes fail) so having unnecessary instances running for 6 hours is pretty expensive.

Having looked through your code base and understanding the challenges I can see why this hasn't been solved.

I modified my code base of myshoes to handle this, reasonably well. I'll post my solution in a follow up comment.

alfred-stokespace commented 2 months ago

Here is the result (after my local code changes) to both use cases mentioned above...


2024/05/23 19:20:32 will delete runner: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down
2024/05/23 19:20:32 will delete runner with GitHub: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:16:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:15:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:14:15 instance create successfully! (job: 7a8d1181-4452-49d4-93d4-272bada8dc76, cloud ID: i-04d5c846a5d0c4adb)
2024/05/23 19:14:14 start create instance (job: 7a8d1181-4452-49d4-93d4-272bada8dc76)
2024/05/23 19:14:14 start job (job id: 7a8d1181-4452-49d4-93d4-272bada8dc76)

the line that indicates new behavior, and hints at how it's implemented is...

2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down

Basically, I had to expand the webhook code to catch the canceled job case and I chose to put the event details into a Sync.Map.

I then do two things...

periodically clear the cancelled sync.map when no runners exist (prevent memory leak)
if we notice that a host has been idle and is in that 6hr category, we walk the cancelled map and find the first entry that matches the same repo that the runner was created for,
- we then remove that entry from the map.

Problems...

So, I know that GH does scheduling and you don't get to control which runner runs which job, so there's no perfection here. We're just doing best effort to match up a cancel with a stale runner.

So far so good but it's only been running for a couple days.

Let me know you want actual code examples and I can put them in another comment.

alfred-stokespace commented 1 month ago

I just discovered a race-condition I thought would help anyone else that might be trying to address this problem in a similar way.

I had a case where

a cancel happened
the existing delete_runner.go logic detects that a runner has been idle for to long [...] is idle and not running [...]
my suggestion above kicks in [...]a recent cancel was found [...]
delete_runner.go logic continues and attempts to perform the deleteRunnerWithGitHub(...) behaviour
failure happens do to race-condition, we a 422 from GH 422 Bad request - Runner "myshoes-<uuid>" is still running a job"

I went through the GH logs and found that indeed that runner was chosen at the same time and beat the delete code to the finish line.

In my case the ramifications were, we consumed the cancel but didn't reduce the runner count, so we remained in a over-producing state.

One thought I'm having now is to track "in-flight-cancels" and if this race fails again we put the in-flight-cancel back into the cancel pool so it can hopefully consome another one of the idle runners.

whywaita / myshoes

Cancelled workflows lead to extra runners #206