whywaita / myshoes

Auto-scaling VirtualMachine runner 🏃 for GitHub Actions
MIT License
128 stars 14 forks source link

Httpcache is causing excessive runner creation #205

Open alfred-stokespace opened 2 months ago

alfred-stokespace commented 2 months ago

Took quite a bit of digging to get a reliable test case that got me to a smoking gun...

First, the symptom....

For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.

As I turned debug on I kept finding these over and over...

2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)

and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed. yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call

func Test(t *testing.T) {
    config.Load()
    config.Config.Debug = true

    client, err := NewClient(os.Getenv("TEST_PAT"))
    if err != nil {
        t.Fatalf("%v", err)
    }
    count := 0
    for count < 300 {
        count++
        ctx := context.Background()
        var opts = &github.ListOptions{
            Page:    0,
            PerPage: 100,
        }
        runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
        if err != nil {
            t.Fatalf("failed to list repository runners %v", err)
        }
        for _, r := range runners.Runners {
            fmt.Println("API Response:" + *r.Name + "\n")
        }
        time.Sleep(400 * time.Millisecond)
    }
}

Running against my shell script...

until [ "" == "!" ]; do
    echo "### SSSTTTAAARRRTTT"
    date
    gh api \
      -H "Accept: application/vnd.github+json" \
      -H "X-GitHub-Api-Version: 2022-11-28" \
      --hostname REDACTED \
      -q '.runners[] | "\(.name, .status)"' \
      /repos/$org/$repo/actions/runners
    echo "### EEENNNDDD"
    sleep 2
done

and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.

So, I dug deeper! and found this...

func NewClient(token string) (*github.Client, error) {
    oauth2Transport := &oauth2.Transport{
        Source: oauth2.StaticTokenSource(
            &oauth2.Token{AccessToken: token},
        ),
    }
    transport := &httpcache.Transport{
        Transport:           oauth2Transport,
        Cache:               httpCache,
        MarkCachedResponses: true,
    }

So, I decide, hey lets deactivate that and see what happens...

this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})

became this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})

difference is, I'm not using the cache transport, I'm just using the outh2Transport now.

and when I put bash against the go test, they matched up! Good.

So I'm thinking is happening is the following...

Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry... 
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."

So eventually, after paying for a runner for 6 hours this gets cleaned up.

alfred-stokespace commented 2 months ago

As a mitigation for this, I'm going to do the following in my private fork.

I'm creating a new uncached client method


func NewUncachedClient(token string) (*github.Client, error) {
    oauth2Transport := &oauth2.Transport{
        Source: oauth2.StaticTokenSource(
            &oauth2.Token{AccessToken: token},
        ),
    }

    if !config.Config.IsGHES() {
        return github.NewClient(&http.Client{Transport: oauth2Transport}), nil
    }

    return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})
}

which will get called from starter.go's checkRegisteredRunner function

func (s *Starter) checkRegisteredRunner(ctx context.Context, runnerName string, target datastore.Target) error {
    client, err := gh.NewUncachedClient(target.GitHubToken)

which will then be used lower down ...

if _, err := gh.ExistGitHubRunner(cctx, client, owner, repo, runnerName); err == nil {
      // success to register runner to GitHub
      return nil
} else if !errors.Is(err, gh.ErrNotFound) {

and we'll see whether that blows everything up or not :)