web-platform-tests / wpt.fyi

web-platform-tests dashboard
https://wpt.fyi/
Other
190 stars 89 forks source link

Postmortem: wpt.fyi status checks not showing up on WPT repo #1661

Open stephenmcgruer opened 5 years ago

stephenmcgruer commented 5 years ago

Owner: @stephenmcgruer Postmortem Created: 2019-11-22 09:57 EST Status: Published Issue: https://github.com/web-platform-tests/wpt.fyi/issues/1660

Impact: Approximately 20.5 hours of PRs to WPT did not have status checks run on them to report the change in pass/fail rate of affected tests. 34 PRs were merged during this time. Any affected PR would not have Safari or Edge results uploaded to wpt.fyi.

Root Cause: The wpt.fyi app's secret was changed inadvertently. This is believed to have been caused by Chrome's password manager auto-filling the secret field when the app name was changed.

Timeline

Lessons Learnt

Things that went well

Things that went poorly

Where we got lucky

Action Items

Hexcles commented 5 years ago

Action item 1: filed a bug to Chrome Password Manager: https://crbug.com/1027556

stephenmcgruer commented 5 years ago

@foolip @Hexcles postmortem should be ready for review, PTAL

Hexcles commented 5 years ago

Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)

Done via Stackdriver. Tentatively set the threshold to 0.02 5XX responses per second (~= 1 error per minute).

stephenmcgruer commented 5 years ago

File bug on GitHub regarding hidden form (type=prevent, owner=@stephenmcgruer)

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

Hexcles commented 5 years ago

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

I stumbled upon https://github.community/ just now. Not sure if this is their UserVoice or issue tracker or both.

stephenmcgruer commented 5 years ago

From that page:

If you want to submit a feature request or feedback about either the GitHub Community Forum or GitHub itself, please use our contact form.

(The contact form is equivalent to emailing support@github.com afaik)

Hexcles commented 5 years ago

Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)

Follow-up: turns out Stackdriver cannot send emails to groups, so we have to put individual emails there.

stephenmcgruer commented 5 years ago

GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.

Follow-up: I received an email from support saying they will pass it onto the engineering team.

foolip commented 5 years ago

There is otherwise also no monitoring of 'expected' checks for a PR to WPT

@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.

foolip commented 5 years ago

Other than that, this postmortem is great, LGTM!

stephenmcgruer commented 5 years ago

@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.

Added action item to write a design doc for monitoring 'expected' checks on WPT PRs.

stephenmcgruer commented 5 years ago

In terms of the repair task:

Determine how to re-run checks for PRs that missed it

At this point I don't believe we're going to do this (most of the affected PRs have landed anyway), so marking it as such.

stephenmcgruer commented 4 years ago

We have two out-standing AIs here:

Write design doc for monitoring 'expected' checks on WPT PRs (type=detect, owner=???) Document /api/webhook/check (type=mitigate, owner=@stephenmcgruer)

The former is on our 2020 OKRs and related to the productionization effort being led by @LukeZielinski , so I think we can expect that to happen this year. The latter is still on me; I'll try to get that done soon so we can close this out :)

Hexcles commented 4 years ago

Re-assigning to folks with action items