web-platform-tests / wpt.live

A live version of the web-platform-tests project
https://wpt.live/
15 stars 11 forks source link

Report: 2018-08-17 service outage due to failed subprocess #13

Closed jugglinmike closed 4 years ago

jugglinmike commented 6 years ago

Today at approximately 11:15 UTC, the WPT server running in production stopped responding to HTTPS requests made to port 443. Because the server is implemented with a number of independent sub-processes, the production deployment continued to serve HTTP and WebSocker requests.

This project includes a simple recovery mechanism to automatically re-start the server in case of failure. This mechanism was never triggered because the parent process never halted. The server was eventually restarted (and HTTPS traffic once again enabled) by an independent subsystem (namely, the subsystem which fetches the latest code from WPT and responds to changes by restarting the server).

Because a server that is not responding to HTTPS traffic is not capable of performing its role, this is an invalid state that should be avoided. I've submitted a simple fix that ensures the parent process halts in response to failure in any sub-process:

https://github.com/web-platform-tests/wpt/pull/12557

The conditions which initially caused the HTTPS sub-process to fail are not yet known. The fix referenced above will allow us to recover from the problem more rapidly, but it will not address the underlying issue. We'll keep our eye out for more information.

jugglinmike commented 4 years ago

We learned what we could from this incident, and we've since reimplemented the project, so this report can safely be closed.