whywaita / myshoes

Auto-scaling VirtualMachine runner 🏃 for GitHub Actions
MIT License
128 stars 14 forks source link

GitHub http 503 response lead to a job not running #207

Open alfred-stokespace opened 1 month ago

alfred-stokespace commented 1 month ago

A user in my org contacted me with a job that never ran.

I found an error message like this ...

2024/05/23 21:16:34 failed to process job: failed to check to register runner (target ID: 10ed1fec-041c-4829-ab1c-b7de7ff9e673, job ID: 6bbe1e7e-9b8c-49c7-bbc3-623eee4ca54c): failed to check existing runner in GitHub: failed to get list of runners: failed to list runners: failed to list organization runners: GET https://REDACTED/actions/runners?per_page=100: 503 OrgTenant service unavailable []

I tracked that error down to call to list runners for the org. https://github.com/whywaita/myshoes/blob/cbe7edaf8e54aa48e4acb7d6197d447d1287ef6e/pkg/gh/runner.go#L48

in this particular case the trace is starting at starter.go in function ProcessJob where "Strict" config is true on a call to "checkRegisteredRunner".

The result of this 503 is deleteInstance is called in ProcessJob.

The overall impact of that error is that the runner is deleted. This lead to the job not getting worked on.

I contacted GitHub Enterprise support and they responded with the following suggestion...

Encountering a 503 error may occur when the server is temporarily overwhelmed and requires a moment to stabilize. This situation could be attributed to high traffic, maintenance activities, or a brief interruption.

In your specific case, the appearance of the error message "OrgTenant service unavailable" indicates a temporary disruption with the service responsible for managing organization actions/runners.

When confronted with a 503 error, it is advisable to establish a retry mechanism. It is important not to attempt immediate retries but rather consider implementing an exponential backoff strategy. This approach involves increasing the wait time between each retry to allow the server sufficient time to recover and mitigate potential complications.

I'll add a comment with how I mitigated w/code change.

alfred-stokespace commented 1 month ago

I noticed that the dependencies of your project already include github.com/cenkalti/backoff/v4

I opted to use that existing dependency rather than add a new one.

I wrapped the listRunners call with something like this ...

func RetryAbleListRunners(ctx context.Context, client *github.Client, owner, repo string, opts *github.ListOptions) (*github.Runners, *github.Response, error) {
    f := func() (*github.Runners, *github.Response, error) {
        return listRunners(ctx, client, owner, repo, opts)
    }

    return RetryingFunction(ctx, f, owner, repo)
}

a call to that function replaces this line https://github.com/whywaita/myshoes/blob/cbe7edaf8e54aa48e4acb7d6197d447d1287ef6e/pkg/gh/runner.go#L48

The RetryFunction establishes a back off timer ...

func GetBackoffTimer(ctx context.Context, maxRetry uint64) backoff.BackOff {
    off := backoff.NewExponentialBackOff()
    off.InitialInterval = 1 * time.Second
    off.Multiplier = 2
    off.MaxElapsedTime = 10 * time.Second
    off.NextBackOff() // burn one, no matter what I do I can't get the initial to be one second!?
    b := backoff.WithMaxRetries(backoff.WithContext(off, ctx), maxRetry)
    return b
}

... so far so good, hope this helps.