web-platform-tests / wpt.live

A live version of the web-platform-tests project
https://wpt.live/
15 stars 11 forks source link

Investigate and remediate certificate renewal failure #57

Closed jcscottiii closed 11 months ago

jcscottiii commented 2 years ago

@foolip reported that the certificate for the wpt.live was getting to close to expiration.


Problem

Screenshot 2022-07-07 2 59 58 PM


Diagnosis

Screenshot 1 - CPU Pegged

Screenshot 2022-07-07 2 51 34 PM

Screenshot 2 - No space left on the device

Screenshot 2022-07-07 2 58 54 PM

Screenshot 3 - Bucket has not been touched in awhile

Screenshot 2022-07-07 3 00 48 PM (1)

Screenshot 4 - Logs do not indicate anything wrong

Screenshot 2022-07-07 2 57 54 PM

Screenshot 5 - Unable to log into the instance

image

Summary

CPU was pegged. There has to be some process that was gotten out of control since the cert renewal is only a cert bot script that runs once a day. Also something is causing the server to use up all the space. As a result of all of this, I could not log in to do further diagnosis. Need to restart/recreate the instance


Remediation Steps

Will close this issue after finishing these steps

Other recommendations

If we want to save money, move the cert-renewal to Cloud Run instance that starts up, runs, and terminates upon a cron schedule. This will remove the need to have an instance constantly on.

cc: @DanielRyanSmith

DanielRyanSmith commented 2 years ago

Great write-up @jcscottiii, thanks for looking into this 😊 The thorough documentation and info is a big help to understand the problem well.

Edit: Also just learning now that adding these checkboxes to an issue tracks the number of steps that need to be taken on the issue - very cool!

past commented 2 years ago

There is some investigation of an earlier instance of the disk running out of space in #35 that could be related.

jcscottiii commented 1 year ago

There was another failure to renew the cert recently. I did not check the serial console for the out of space error prior to restarting. But I suspect that it is the same problem. I have added some alerts If this happens again, we should migrate to the Cloud Run instance in the "Other Recommendations" section above. It would be easier to do that change instead of chasing down what's causing the instance to eventually run out of disk space.

past commented 1 year ago

There was another such failure over the weekend and I confirmed that it was the same issue. After restarting the service the cert was renewed again.