Open stephenmcgruer opened 5 years ago
Action item 1: filed a bug to Chrome Password Manager: https://crbug.com/1027556
@foolip @Hexcles postmortem should be ready for review, PTAL
Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)
Done via Stackdriver. Tentatively set the threshold to 0.02 5XX responses per second (~= 1 error per minute).
File bug on GitHub regarding hidden form (type=prevent, owner=@stephenmcgruer)
GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.
GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.
I stumbled upon https://github.community/ just now. Not sure if this is their UserVoice or issue tracker or both.
From that page:
If you want to submit a feature request or feedback about either the GitHub Community Forum or GitHub itself, please use our contact form.
(The contact form is equivalent to emailing support@github.com afaik)
Setup monitoring for 500 errors from wpt.fyi (type=detect, owner=@Hexcles)
Follow-up: turns out Stackdriver cannot send emails to groups, so we have to put individual emails there.
GitHub does not have a public issue tracker. Sent email to support@github.com with details of problem.
Follow-up: I received an email from support saying they will pass it onto the engineering team.
There is otherwise also no monitoring of 'expected' checks for a PR to WPT
@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.
Other than that, this postmortem is great, LGTM!
@stephenmcgruer I think this warrants an action item as it would catch problems wherever in the chain they occur, and we could also monitor Taskcluster and Azure Pipelines on PRs with this.
Added action item to write a design doc for monitoring 'expected' checks on WPT PRs.
In terms of the repair task:
Determine how to re-run checks for PRs that missed it
At this point I don't believe we're going to do this (most of the affected PRs have landed anyway), so marking it as such.
We have two out-standing AIs here:
Write design doc for monitoring 'expected' checks on WPT PRs (type=detect, owner=???) Document /api/webhook/check (type=mitigate, owner=@stephenmcgruer)
The former is on our 2020 OKRs and related to the productionization effort being led by @LukeZielinski , so I think we can expect that to happen this year. The latter is still on me; I'll try to get that done soon so we can close this out :)
Re-assigning to folks with action items
Owner: @stephenmcgruer Postmortem Created: 2019-11-22 09:57 EST Status: Published Issue: https://github.com/web-platform-tests/wpt.fyi/issues/1660
Impact: Approximately 20.5 hours of PRs to WPT did not have status checks run on them to report the change in pass/fail rate of affected tests. 34 PRs were merged during this time. Any affected PR would not have Safari or Edge results uploaded to wpt.fyi.
Root Cause: The wpt.fyi app's secret was changed inadvertently. This is believed to have been caused by Chrome's password manager auto-filling the secret field when the app name was changed.
Timeline
/api/webhook/check
hits a 500 error forpayload signature check failed
. This goes unnoticed./api/webhook/check
is logged on the server.Lessons Learnt
Things that went well
/api/webhook/check
was set in GitHub.Things that went poorly
/api/webhook/check
is undocumented, which meant the engineer debugging was not sure what even calls that endpoint.Where we got lucky
Action Items
/api/webhook/check
(type=mitigate, owner=@stephenmcgruer)