web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
4.99k stars 3.09k forks source link

Postmortem: 310 open PR branches deleted accidentally #21424

Closed stephenmcgruer closed 4 years ago

stephenmcgruer commented 4 years ago

Owner: @stephenmcgruer Postmortem Created: 2020-01-25 15:11 EST Status: In Review Issue: No specific issue exists.

Impact: Minor, thanks to quick discovery, diagnosis, and recovery. Approximately 310 open PR branches were deleted and their associated PRs closed automatically by GitHub. All were ultimately recovered with no known data loss. The wpt/ repository was put into 'read only' mode for around 2 hours, so some PRs had a delayed landing (but minimal).

Root Cause: A git-push prune was accidentally run against the main web-platform-tests/wpt repository rather than the fork it was intended for. As the user had admin access to WPT, this began pruning branches and closing the associated pull requests.

Timeline (incident occurred and was recovered from during 2020-01-25, EST):

Lessons Learnt

Things that went well

Things that went poorly

Where we got lucky

Action Items

jgraham commented 4 years ago

I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.

git push --mirror
git push :*

In terms of things that went well, branch protection ensured that master wasn't affected. We also got lucky that the irc bot had a full event list; otherwise we could have got this from GitHub but it would have been more work (the webhooks have a list of events they recieved).

It's not clear to me how to keep this from happening in the future; if we removed push access to branches and only allowed PRs from forks it would have several negative consequences (bots would need to be retooled, and collaboration on PRs would be harder). One can imagine designing a backup solution that listens for pushes and stores a ref for each position of each branch head, so that nothing is GCd and you can restore branches to arbitary previous revisions. It's likely quite a lot of work though (when reviewable was enabled it did something like this).

guest271314 commented 4 years ago

I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.

Am not a git power user. Would not executing the prospective command with a simulation option that can be piped to all of the relevant stakeholders (and/or programmatic analysis code, though that could also result in potential hazard due to a program alone being incapable or determining the consistency of axioms within its own system) provide a means to observe the result of the command without actually performing the task? E.g., from man apt-get

 -s, --simulate, --just-print, --dry-run, --recon, --no-act
           No action; perform a simulation of events that would occur based on
           the current system state but do not actually change the system.
           Locking will be disabled (Debug::NoLocking) so the system state
           could change while apt-get is running. Simulations can also be
           executed by non-root users which might not have read access to all
           apt configuration distorting the simulation. A notice expressing
           this warning is also shown by default for non-root users
           (APT::Get::Show-User-Simulation-Note). Configuration Item:
           APT::Get::Simulate.

           Simulated runs print out a series of lines, each representing a
           dpkg operation: configure (Conf), remove (Remv) or unpack (Inst).
           Square brackets indicate broken packages, and empty square brackets
           indicate breaks that are of no consequence (rare).
jgraham commented 4 years ago

git has --dry-run for several commands, but failing to use that is exactly the same class of error as we had in this case (per irc, using the wrong remote).

Hexcles commented 4 years ago

I read the GitHub help article on branch protection rules. I also don't think there's a way to prevent this from happening (or limit the likelihood). Everyone with push access to the repo (i.e. all reviewers) can do this. Git/GitHub does not have the concept of who "creates" a branch so we can't limit branch removal to only the creator.

The only thing I can think of is about persistent logging. @jesopo IIRC you said event logs were preserved by BitBot. Would it be possible to provide a self-serve portal on BitBot to see those logs so that we don't have to rely on you being available in case this happens again? (And thanks for the help last week! It was critical to get the accurate event logs quickly.)

Hexcles commented 4 years ago

Also for the record, this is the script I wrote: https://gist.github.com/Hexcles/20def95bd19864e8644a2a85c2dd791b

jesopo commented 4 years ago

yes, that is very possible. I'll look in to it in the morning.

gsnedders commented 4 years ago

From IRC:

10:01 < jgraham> The feature we want here is "reject branch deletion iff the corresponding PR is open", which is a GH feature request

This seems like a reasonable form of branch protection to request as a feature.

Hexcles commented 4 years ago

Also this is the Chromium tracking issue: https://crbug.com/1045520

sideshowbarker commented 4 years ago

I went ahead and raised https://github.com/isaacs/github/issues/1723 (“Reject branch deletion if corresponding PR is open” GitHub feature request)

stephenmcgruer commented 4 years ago

Hi all,

Sorry for the delay, but I finally got around to polishing this up and added some lessons learned and action items. I don't think there's a huge amount we would change out of this incident, but please feel free to suggest anything else that comes to mind! (On any part of the post-mortem).

cc @jgraham @Hexcles @sideshowbarker @jesopo

stephenmcgruer commented 4 years ago

Ping @jgraham @Hexcles @sideshowbarker @jesopo - I'm going to give this another week, and then if there are no objections I will consider it complete (and open tracking bugs for the action items to finish things out).

Hexcles commented 4 years ago

There's only one AI remaining now. @jesopo does bitbot have a self-service portal to see event logs now? Thanks!

jesopo commented 4 years ago

it appears #testing already has a public access log https://w3.logbot.info/testing/

Hexcles commented 4 years ago

Yeah that logs the whole IRC channel; I was wondering if bitbot could have a more dedicated logging portal with event details, etc., but this is really just a low-priority/nice-to-have feature request. (And thanks again for the making this helpful bot!)

I'm closing this issue now.