Closed stephenmcgruer closed 4 years ago
I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.
git push --mirror
git push :*
In terms of things that went well, branch protection ensured that master
wasn't affected. We also got lucky that the irc bot had a full event list; otherwise we could have got this from GitHub but it would have been more work (the webhooks have a list of events they recieved).
It's not clear to me how to keep this from happening in the future; if we removed push access to branches and only allowed PRs from forks it would have several negative consequences (bots would need to be retooled, and collaboration on PRs would be harder). One can imagine designing a backup solution that listens for pushes and stores a ref for each position of each branch head, so that nothing is GCd and you can restore branches to arbitary previous revisions. It's likely quite a lot of work though (when reviewable was enabled it did something like this).
I don't know exactly what command was run in this case, but there are a bunch of commands that will have a similar effect e.g.
Am not a git
power user. Would not executing the prospective command with a simulation option that can be piped to all of the relevant stakeholders (and/or programmatic analysis code, though that could also result in potential hazard due to a program alone being incapable or determining the consistency of axioms within its own system) provide a means to observe the result of the command without actually performing the task? E.g., from man apt-get
-s, --simulate, --just-print, --dry-run, --recon, --no-act
No action; perform a simulation of events that would occur based on
the current system state but do not actually change the system.
Locking will be disabled (Debug::NoLocking) so the system state
could change while apt-get is running. Simulations can also be
executed by non-root users which might not have read access to all
apt configuration distorting the simulation. A notice expressing
this warning is also shown by default for non-root users
(APT::Get::Show-User-Simulation-Note). Configuration Item:
APT::Get::Simulate.
Simulated runs print out a series of lines, each representing a
dpkg operation: configure (Conf), remove (Remv) or unpack (Inst).
Square brackets indicate broken packages, and empty square brackets
indicate breaks that are of no consequence (rare).
git has --dry-run
for several commands, but failing to use that is exactly the same class of error as we had in this case (per irc, using the wrong remote).
I read the GitHub help article on branch protection rules. I also don't think there's a way to prevent this from happening (or limit the likelihood). Everyone with push access to the repo (i.e. all reviewers) can do this. Git/GitHub does not have the concept of who "creates" a branch so we can't limit branch removal to only the creator.
The only thing I can think of is about persistent logging. @jesopo IIRC you said event logs were preserved by BitBot. Would it be possible to provide a self-serve portal on BitBot to see those logs so that we don't have to rely on you being available in case this happens again? (And thanks for the help last week! It was critical to get the accurate event logs quickly.)
Also for the record, this is the script I wrote: https://gist.github.com/Hexcles/20def95bd19864e8644a2a85c2dd791b
yes, that is very possible. I'll look in to it in the morning.
From IRC:
10:01 < jgraham> The feature we want here is "reject branch deletion iff the corresponding PR is open", which is a GH feature request
This seems like a reasonable form of branch protection to request as a feature.
Also this is the Chromium tracking issue: https://crbug.com/1045520
I went ahead and raised https://github.com/isaacs/github/issues/1723 (“Reject branch deletion if corresponding PR is open” GitHub feature request)
Hi all,
Sorry for the delay, but I finally got around to polishing this up and added some lessons learned and action items. I don't think there's a huge amount we would change out of this incident, but please feel free to suggest anything else that comes to mind! (On any part of the post-mortem).
cc @jgraham @Hexcles @sideshowbarker @jesopo
Ping @jgraham @Hexcles @sideshowbarker @jesopo - I'm going to give this another week, and then if there are no objections I will consider it complete (and open tracking bugs for the action items to finish things out).
There's only one AI remaining now. @jesopo does bitbot have a self-service portal to see event logs now? Thanks!
it appears #testing
already has a public access log https://w3.logbot.info/testing/
Yeah that logs the whole IRC channel; I was wondering if bitbot could have a more dedicated logging portal with event details, etc., but this is really just a low-priority/nice-to-have feature request. (And thanks again for the making this helpful bot!)
I'm closing this issue now.
Owner: @stephenmcgruer Postmortem Created: 2020-01-25 15:11 EST Status: In Review Issue: No specific issue exists.
Impact: Minor, thanks to quick discovery, diagnosis, and recovery. Approximately 310 open PR branches were deleted and their associated PRs closed automatically by GitHub. All were ultimately recovered with no known data loss. The wpt/ repository was put into 'read only' mode for around 2 hours, so some PRs had a delayed landing (but minimal).
Root Cause: A git-push prune was accidentally run against the main web-platform-tests/wpt repository rather than the fork it was intended for. As the user had admin access to WPT, this began pruning branches and closing the associated pull requests.
Timeline (incident occurred and was recovered from during 2020-01-25, EST):
git push --prune --all
, accidentally leaving out their own remote (gsnedders
). Instead, the command runs with the default remote that they cloned from, and since they have write access to web-platform-tests/wpt it begins pruning branches in the main repo.infra
repository; thankfully Hexcles does and so can 'TBR' the change and land it.Lessons Learnt
Things that went well
Things that went poorly
Where we got lucky
fetch --all
and had hourly local backups set up on their laptop, allowing fairly complete and quick recovery (15 minutes after starting to look at recovery).Action Items