Open alfred-stokespace opened 2 months ago
Here is the result (after my local code changes) to both use cases mentioned above...
2024/05/23 19:20:32 will delete runner: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down
2024/05/23 19:20:32 will delete runner with GitHub: 7a8d1181-4452-49d4-93d4-272bada8dc76
2024/05/23 19:16:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:15:46 7a8d1181-4452-49d4-93d4-272bada8dc76 is not running MustRunningTime
2024/05/23 19:14:15 instance create successfully! (job: 7a8d1181-4452-49d4-93d4-272bada8dc76, cloud ID: i-04d5c846a5d0c4adb)
2024/05/23 19:14:14 start create instance (job: 7a8d1181-4452-49d4-93d4-272bada8dc76)
2024/05/23 19:14:14 start job (job id: 7a8d1181-4452-49d4-93d4-272bada8dc76)
the line that indicates new behavior, and hints at how it's implemented is...
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, a recent cancel was found; we will shut down
Basically, I had to expand the webhook code to catch the canceled job case and I chose to put the event details into a Sync.Map.
I then do two things...
Problems...
So, I know that GH does scheduling and you don't get to control which runner runs which job, so there's no perfection here. We're just doing best effort to match up a cancel with a stale runner.
So far so good but it's only been running for a couple days.
Let me know you want actual code examples and I can put them in another comment.
I just discovered a race-condition I thought would help anyone else that might be trying to address this problem in a similar way.
I had a case where
[...] is idle and not running [...]
[...]a recent cancel was found [...]
deleteRunnerWithGitHub(...)
behaviour422 Bad request - Runner "myshoes-<uuid>" is still running a job"
I went through the GH logs and found that indeed that runner was chosen at the same time and beat the delete code to the finish line.
In my case the ramifications were, we consumed the cancel but didn't reduce the runner count, so we remained in a over-producing state.
One thought I'm having now is to track "in-flight-cancels" and if this race fails again we put the in-flight-cancel back into the cancel pool so it can hopefully consome another one of the idle runners.
This case is specific to two scenarios we have in our org.
In both cases we see that myshoes
2024/05/23 19:20:32 7a8d1181-4452-49d4-93d4-272bada8dc76 is idle and not running 6h0m0s, so not will delete (created_at: 2024-05-23 19:14:55 +0000 UTC, now: 2024-05-23 19:20:32.330051792 +0000 UTC)
We're using some pretty expensive ec2 instances as well as have several contingent runs (that sometimes fail) so having unnecessary instances running for 6 hours is pretty expensive.
Having looked through your code base and understanding the challenges I can see why this hasn't been solved.
I modified my code base of myshoes to handle this, reasonably well. I'll post my solution in a follow up comment.