webinstall / webi-installers

Primary and community-submitted packages for webinstall.dev
https://webinstall.dev
Mozilla Public License 2.0
1.89k stars 206 forks source link

[Fixed] Webinstall.dev was down for several minutes. #855

Closed coolaj86 closed 2 months ago

coolaj86 commented 3 months ago

Problem

17 minutes of downtime today June 5th from 18:39 to 18:56 UTC.

Retrospective

What happened:

  1. There was a typo in the Authorization header, so authorization was not correctly sent
  2. Production makes many requests, quickly reaching the rate limits (which cannot easily be mocked in testing)
  3. The error is being thrown in an async function, which caused server restart
  4. The server refreshes at least one random package on start, which caused failure on start
  5. Successive failures in rapid succession caused systemctl to abort relaunching

What to do about:

  1. Fix the typo. \ This passed review without notice. It couldn't have been reasonably caught in testing. The typo was a valid word, so it wasn't caught by spellcheck either. 🤷‍♂️ As humans we make mistakes.
  2. Reconsider the error handling. \ Not sure if this category of error should cause this level of failure or not. The severity of the failure made it easy to identify and, since a user can't directly invoke this sort of failure remotely, it doesn't seems to present an attack vector.
  3. More time between hotfixes and refactors. \ "While we're here, might as well..." was the really root cause. It was not necessary to switch to using fetch (#852) in order to solve #850, #851. Even thought the commits were distinct, the process was not. If I had waited for a truly separate review on that change, the error would certainly have been more likely to be caught (i.e. review fatigue).

Status Updates

Not sure why yet. Investigating.

Possibly related to the change in fetching github releases and a difference between the staging and production environment.

coolaj86 commented 3 months ago
{
  "message": "API rate limit exceeded for 128.199.9.106. (But here's the good news: Authentic      ated requests get a higher rate limit. Check out the documentation for more details.)",
  "documentation_url": "https://docs.github.com/rest/overview/re      sources-in-the-rest-api#rate-limiting"
}

This is unexpected as the adjacent logs also indicates the username, which implies that the token was being used.

Also strange that a restart of the server "fixed" it.

coolaj86 commented 3 months ago

Typo in the authorization header.