webinstall / webi-installers

Primary and community-submitted packages for webinstall.dev
https://webinstall.dev
Mozilla Public License 2.0
1.92k stars 207 forks source link

[Bug] Every few months the version update hangs #534

Closed coolaj86 closed 1 year ago

coolaj86 commented 1 year ago

Feb 13, 2023 Update

I've deployed #565 to production.

I'm very confident (after extensive testing) that this is the final fix to intermittent failure:

Jan 24, 2023 Update

Original

Over the past few years we've had 2 or 3 reports like this one: See https://github.com/webinstall/webi-installers/issues/533

Something won't install because it just hangs, and the JSON version can't be fetched.

I think that we're either missing a timeout, or mishandling it.

Come to think of it, this may be due to https://github.com/therootcompany/request.js/pull/7 and may just need to update @root/request.

Thankfully the hangup is per-package and not everything at once, but I'd like to get to the bottom of it.

wreszelewski commented 1 year ago

Hello! It happens again for k9s. It looks exactly like this: https://github.com/webinstall/webi-installers/issues/533

coolaj86 commented 1 year ago

Hmmm, well the previous "fix" (https://github.com/webinstall/webi-installers/pull/535) didn't work.

I've restarted the service. I'll investigate further over Christmas break.

Note to Self

hwatts commented 1 year ago

The k9s install is consistently hanging with no output at the moment

ncstate01 commented 1 year ago

I'm seeing the same thing.

coolaj86 commented 1 year ago

Restarted again just now.

I'll escalate this to high-priority. I will find a more permanent solution by this weekend at the latest.

ncstate01 commented 1 year ago

It works again. Thank you.

On Tue, Jan 10, 2023 at 4:37 PM AJ ONeal @.***> wrote:

Restarted again just now.

I'll escalate this to high-priority. I will find a more permanent solution by this weekend at the latest.

— Reply to this email directly, view it on GitHub https://github.com/webinstall/webi-installers/issues/534#issuecomment-1377913902, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJJ6SX5PWHJMJEJ3SJBR3WRXJDHANCNFSM6AAAAAASACA3OA . You are receiving this because you commented.Message ID: @.***>

coolaj86 commented 1 year ago

I've reworked the release update code.

I believe I discovered (and refactored away) a promise deadlock, and put in better failsafes for when a request hangs or fails.

It's now deployed via #551.

I'll leave this issue open for another month or two just to see if the issue arises again.

P.S. I did have the solution written back when I said I would (or maybe a day late), but I wanted to test it out on the beta site for a bit before I brought it into production.

MP91 commented 1 year ago

Seems to happen again. At least I'm not able to install k9s. https://webinstall.dev/api/releases/k9s.json opens but is empty...

Don't know what causes this.

coolaj86 commented 1 year ago

The good news is that A) I actually saw an error log this time and B) it happened not just to k9s, but also to shfmt.

The bad news is that, mysteriously, only half of the error was logged (I double checked the code, which is correct, so I'm completely confused by this).

It confirms that I'm getting back some sort of error from the GitHub API, but unfortunately the important part - seeing the actual error message - wasn't there.

Thinking out loud...

There's only one more thing I can think to do at this point, which is to put the releases requests in a completely separate process with a file system cache:

getReleases()
    -> checkFs
        -> return mostRecentMatchingReleases
    -> checkReleaseApi
        -> if good, safely add to fs cache

One of the strangest bits of this is that a restart will fix it - which it shouldn't if this were based on hitting an API rate limit. And once the problem happens, it persists - which it shouldn't because there's a known-good cache in place that buffers between good requests and bad requests.

I will work on this again today.

coolaj86 commented 1 year ago

Okay... I spent more time checking error conditions locally this time:

Now as long as any usable data is available, the 500 will never propagate to the end user.

The worst possible outcome is that something stops getting release updates, but we don't know.

However, I'm almost certain that this was based on some sort of rate-limit trigger via GitHub Releases API.

I also increased the rate limit cache time, which was set far too low (I accidentally pushed the debug value of 5 seconds rather than 5 minutes last time).

coolaj86 commented 1 year ago

Closing this out as my confidence is quite high that my solution was well tested and we haven't had additional reports in a month (whereas the issue was becoming more frequent before the fix).