Postmortem: 310 open PR branches deleted accidentally

stephenmcgruer commented 4 years ago

Owner: @stephenmcgruer Postmortem Created: 2020-01-25 15:11 EST Status: In Review Issue: No specific issue exists.

Impact: Minor, thanks to quick discovery, diagnosis, and recovery. Approximately 310 open PR branches were deleted and their associated PRs closed automatically by GitHub. All were ultimately recovered with no known data loss. The wpt/ repository was put into 'read only' mode for around 2 hours, so some PRs had a delayed landing (but minimal).

Root Cause: A git-push prune was accidentally run against the main web-platform-tests/wpt repository rather than the fork it was intended for. As the user had admin access to WPT, this began pruning branches and closing the associated pull requests.

Timeline (incident occurred and was recovered from during 2020-01-25, EST):

~13:00 EST: gsnedders merges an unrelated PR
~13:00 EST: gsnedders runs git push --prune --all, accidentally leaving out their own remote (gsnedders). Instead, the command runs with the default remote that they cloned from, and since they have write access to web-platform-tests/wpt it begins pruning branches in the main repo.
OUTAGE BEGINS
13:01 EST: The GitHub bot ('BitBot') in the w3c.org IRC #testing channel begins logging the closing PRs (which happens automatically as their branches are deleted).
13:06 EST: stephenmcgruer, jgraham, and gsnedders all near-simultaneously look at IRC and see the logs.
13:08 EST: GitHub closes the last of what will turn out to be 310 PRs.
13:09 EST: gsnedders rules out the possibility of some broken script using one of their auth tokens.
13:15 EST: stephenmcgruer contacts GitHub support. Due to the timing of the unrelated PR merge, the suspicion at this time is a strange GitHub bug.
13:16 EST: Hexcles requests the log from BitBot from jesopo, to determine the scale of the problem and to aid in recovery.
13:19 EST: gsnedders confirms their GitHub security log shows no suspicious activity.
13:23 EST: gsnedders realizes that the git command they ran at ~13:00 was responsible, and communicates this information. With the root cause known, focus shifts to recovery.
13:25 EST: jgraham and gsnedders begin working on restoring the branches from the copy of gsnedders's git repository in their latest hourly backup. It is an hour old, so minimal data loss is expected.
13:28 EST: stephenmcgruer begins working on an announcement to the 'reviewers' web-platform-tests/ team (the largest group of people who likely care about this outage).
13:29 EST: Hexcles suggests and begins working on making wpt/ read-only during the outage.
13:34 EST: jesopo provides the BitBot log for the day, with timestamps.
13:37 EST: Hexcles realizes we also need to stop the Chromium importer, or it will see the closed PRs and overwrite landed Chromium-side changes (that should be exported).
13:39 EST: jgraham and gsnedders finish restoring branches from gsnedders local git repository, and wait for GitHub to begin processing them.
13:39 EST: stephenmcgruer posts an announcement to the web-platform-tests/ reviewers team.
13:46 EST: jgraham raises a concern about how the CI system will handle having 100s of PRs re-opened simultaneously. Quick discussion, agreement to shut off the CI systems. Some debate on how to do so results in a decision to re-target the CI systems (which are all GitHub apps) to point at an empty repository in the web-platform-tests/ organization. (It appears otherwise impossible to disable a GitHub app without uninstalling it!)
13:47 EST: Hexcles sents out a CL to disable the WPT importer. Unfortunately, stephenmcgruer doesnt have OWNERs in Chromium's infra repository; thankfully Hexcles does and so can 'TBR' the change and land it.
13:58 EST: stephenmcgruer identifies 3 PRs that were not restored from the work that jgraham and gsnedders did, as they did not exist in gsnedders' checkout. stephenmcgruer tries to restore them manually (copying the diffs).
14:10 EST: jgraham filters the BitBot logs to the list of believed-affected PRs, Hexcles confirms it looks good.
14:13 EST: stephenmcgruer discovers that force-pushing to a closed PR will mean it can never be re-opened. Decision is made to just close the 3 Chromium-exported PRs instead and let the exporter re-export them (this is known to work).
14:23 EST: Chromium exporter re-exports the 3 'missing' PRs.
14:24 EST: With all branches restored, Hexcles begins working on a script to re-open the PRs.
14:56 EST: Hexcles finishes their script, tests it, and requests a review from stephenmcgruer.
14:59 EST: stephenmcgruer LGTMs, Hexcles runs the script.
15:03 EST: All issues are believed to be opened. The CI systems are confirmed to have successfully ignored the re-opening.
15:12 EST: Hexcles re-enables the CI systems, confirms they do nothing (correctly!), and tests them on a few new PRs (which the CI systems pick up correctly).
15:42 EST: Hexcles re-opens the tree.
OUTAGE OVER
15:47 EST: stephenmcgruer discovers he was just resubscribed to 'web-platform-tests/'.
15:57 EST: Hexcles realizes that by closing + opening the repo, everyone was forcibly re-subscribed to the repository. Doh!
18:20 EST: Hexcles re-enables the Chromium importer.

Lessons Learnt

Things that went well

Great teamwork from those that were available in the IRC channel, leading to quickly finding the root cause and recovering.

Things that went poorly

There was some confusion in IRC about who was doing what and who was responsible for what. This didn't have a huge impact but slightly slowed some response.
The method available to make WPT 'read only' caused us to resubscribe every WPT team member to email notifications for the repo -_-.

Where we got lucky

gsnedders had recently run fetch --all and had hourly local backups set up on their laptop, allowing fairly complete and quick recovery (15 minutes after starting to look at recovery).
The knowledge of the people available at the time allowed us to avoid multiple potential footguns: primarily not overwhelming the CI systems and not breaking the Chromium importer.

Action Items

[x] File a feature request for GitHub to reject branch deletion if the corresponding pull-request is open. (type=prevent, owner=sideshowbarker) (filed)
[ ] Make the BitBot logs available in a self-serve manner. (type=mitigate, owner=jesopo)
[x] Find and document a way to make the WPT tree 'read only' without re-subscribing all WPT team members to notifications (type=mitigate, owner=Hexcles)

jgraham commented 4 years ago

I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.

git push --mirror
git push :*

In terms of things that went well, branch protection ensured that master wasn't affected. We also got lucky that the irc bot had a full event list; otherwise we could have got this from GitHub but it would have been more work (the webhooks have a list of events they recieved).

It's not clear to me how to keep this from happening in the future; if we removed push access to branches and only allowed PRs from forks it would have several negative consequences (bots would need to be retooled, and collaboration on PRs would be harder). One can imagine designing a backup solution that listens for pushes and stores a ref for each position of each branch head, so that nothing is GCd and you can restore branches to arbitary previous revisions. It's likely quite a lot of work though (when reviewable was enabled it did something like this).

guest271314 commented 4 years ago

I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.

Am not a git power user. Would not executing the prospective command with a simulation option that can be piped to all of the relevant stakeholders (and/or programmatic analysis code, though that could also result in potential hazard due to a program alone being incapable or determining the consistency of axioms within its own system) provide a means to observe the result of the command without actually performing the task? E.g., from man apt-get

 -s, --simulate, --just-print, --dry-run, --recon, --no-act
           No action; perform a simulation of events that would occur based on
           the current system state but do not actually change the system.
           Locking will be disabled (Debug::NoLocking) so the system state
           could change while apt-get is running. Simulations can also be
           executed by non-root users which might not have read access to all
           apt configuration distorting the simulation. A notice expressing
           this warning is also shown by default for non-root users
           (APT::Get::Show-User-Simulation-Note). Configuration Item:
           APT::Get::Simulate.

           Simulated runs print out a series of lines, each representing a
           dpkg operation: configure (Conf), remove (Remv) or unpack (Inst).
           Square brackets indicate broken packages, and empty square brackets
           indicate breaks that are of no consequence (rare).

jgraham commented 4 years ago

git has --dry-run for several commands, but failing to use that is exactly the same class of error as we had in this case (per irc, using the wrong remote).

Hexcles commented 4 years ago

I read the GitHub help article on branch protection rules. I also don't think there's a way to prevent this from happening (or limit the likelihood). Everyone with push access to the repo (i.e. all reviewers) can do this. Git/GitHub does not have the concept of who "creates" a branch so we can't limit branch removal to only the creator.

The only thing I can think of is about persistent logging. @jesopo IIRC you said event logs were preserved by BitBot. Would it be possible to provide a self-serve portal on BitBot to see those logs so that we don't have to rely on you being available in case this happens again? (And thanks for the help last week! It was critical to get the accurate event logs quickly.)

Hexcles commented 4 years ago

Also for the record, this is the script I wrote: https://gist.github.com/Hexcles/20def95bd19864e8644a2a85c2dd791b

jesopo commented 4 years ago

yes, that is very possible. I'll look in to it in the morning.

gsnedders commented 4 years ago

From IRC:

10:01 < jgraham> The feature we want here is "reject branch deletion iff the corresponding PR is open", which is a GH feature request

This seems like a reasonable form of branch protection to request as a feature.

Hexcles commented 4 years ago

Also this is the Chromium tracking issue: https://crbug.com/1045520

sideshowbarker commented 4 years ago

I went ahead and raised https://github.com/isaacs/github/issues/1723 (“Reject branch deletion if corresponding PR is open” GitHub feature request)

stephenmcgruer commented 4 years ago

Hi all,

Sorry for the delay, but I finally got around to polishing this up and added some lessons learned and action items. I don't think there's a huge amount we would change out of this incident, but please feel free to suggest anything else that comes to mind! (On any part of the post-mortem).

cc @jgraham @Hexcles @sideshowbarker @jesopo

stephenmcgruer commented 4 years ago

Ping @jgraham @Hexcles @sideshowbarker @jesopo - I'm going to give this another week, and then if there are no objections I will consider it complete (and open tracking bugs for the action items to finish things out).

Hexcles commented 4 years ago

There's only one AI remaining now. @jesopo does bitbot have a self-service portal to see event logs now? Thanks!

jesopo commented 4 years ago

it appears #testing already has a public access log https://w3.logbot.info/testing/

Hexcles commented 4 years ago

Yeah that logs the whole IRC channel; I was wondering if bitbot could have a more dedicated logging portal with event details, etc., but this is really just a low-priority/nice-to-have feature request. (And thanks again for the making this helpful bot!)

I'm closing this issue now.

web-platform-tests / wpt

Postmortem: 310 open PR branches deleted accidentally #21424