pr-preview-detect failing consistently - out of disk space

stephenmcgruer commented 4 years ago

See https://github.com/web-platform-tests/wpt/actions?query=workflow%3Apr-preview-detect, pr-preview-detect is failing consistently with: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wptpr.live', port=443): Max retries exceeded with url: /.git/worktrees/24662/HEAD (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f227ceddd50>: Failed to establish a new connection: [Errno 111] Connection refused',)).

This usually means wptpr.live is in some sort of crash loop.

stephenmcgruer commented 4 years ago

I see the following as the only suspicious thing in the logs. It repeats consistently:

A 2020-07-22T11:46:32.013707166Z 2020-07-22 11:46:32,013 INFO reaped unknown pid 298
A 2020-07-22T11:46:32.013764397Z 2020-07-22 11:46:32,013 INFO reaped unknown pid 299
A 2020-07-22T11:46:32.016807317Z open: 100
A 2020-07-22T11:46:32.019015701Z trusted: 101
A 2020-07-22T11:46:32.021600161Z active: 99
A 2020-07-22T11:46:32.032476938Z 2020-07-22 11:46:32,032 INFO reaped unknown pid 316
A 2020-07-22T11:46:32.032864487Z 2020-07-22 11:46:32,032 INFO reaped unknown pid 317
A 2020-07-22T11:46:32.035403004Z 2020-07-22 11:46:32,035 INFO reaped unknown pid 319
A 2020-07-22T11:46:32.035672515Z 2020-07-22 11:46:32,035 INFO reaped unknown pid 320
A 2020-07-22T11:46:32.038154028Z 2020-07-22 11:46:32,038 INFO reaped unknown pid 322
A 2020-07-22T11:46:32.039077945Z 2020-07-22 11:46:32,038 INFO reaped unknown pid 323
A 2020-07-22T11:46:32.040061728Z to delete:
A 2020-07-22T11:46:32.040082902Z to update: 10350 10449 11668 1234 14993 18228 18830 20165 20187 20239 20329 20351 20411 2092 21628 21646 21660 21696 21821 21959 22292 22348 22379 22504 22521 22566 22852 22854 22964 23137 23149 23281 23295 23336 23347 23369 23431 23504 23516 23527 23528 23541 23561 23562 23593 23594 23599 23638 23652 23691 23729 23789 23796 23845 23848 23853 23901 24010 24114 24133 24149 24155 24196 24200 24202 24220 24235 24342 24345 24352 24374 24394 24459 24511 24518 24521 24535 24556 24560 24566 24568 24599 24600 24601 24603 24608
A 2020-07-22T11:46:32.040092531Z to create: 24617 24618 24619 24633 24634 24658 24660 24662 24680 3060 4089 4354 4504
A 2020-07-22T11:46:32.042688789Z fatal: Unable to create '/root/wpt/.git/worktrees/10350/index.lock': File exists.
A 2020-07-22T11:46:32.042714968Z 
A 2020-07-22T11:46:32.042744895Z Another git process seems to be running in this repository, e.g.
A 2020-07-22T11:46:32.042753066Z an editor opened by 'git commit'. Please make sure all processes
A 2020-07-22T11:46:32.042758463Z are terminated then try again. If it still fails, a git process
A 2020-07-22T11:46:32.042764044Z may have crashed in this repository earlier:
A 2020-07-22T11:46:32.042788287Z remove the file manually to continue.

stephenmcgruer commented 4 years ago

Going looking for the first occurrence, based on searching logs for Another git process seems to be running in this repository.

stephenmcgruer commented 4 years ago

First failure logged from the github side was 12 days ago, July 10th around 10:46 EST.

stephenmcgruer commented 4 years ago

Earliest log I can find of this nature is:

2020-07-10 17:26:36.492 EDT Another git process seems to be running in this repository, e.g.

stephenmcgruer commented 4 years ago

Ah, in the period immediately before it:

Checking out files: 11% (7572/68833) Checking out files: 12% (8260/68833) Checking out files: 13% (8949/68833) Checking out files: 14% (9637/68833) Checking out files: 15% (10325/68833) Checking out files: 15% (10633/68833) Checking out files: 16% (11014/68833) Checking out files: 17% (11702/68833) Checking out files: 18% (12390/68833) Checking out files: 19% (13079/68833) Checking out files: 20% (13767/68833) Checking out files: 21% (14455/68833) Checking out files: 22% (15144/68833) Checking out files: 23% (15832/68833) Checking out files: 24% (16520/68833) Checking out files: 25% (17209/68833) Checking out files: 25% (17689/68833) DEBUG:web-platform-tests:GET /
error: unable to create file css/CSS2/selectors/first-letter-punctuation-116-ref.xht: No space left on device
error: unable to create file css/CSS2/selectors/first-letter-punctuation-116.xht: No space left on device
...

stephenmcgruer commented 4 years ago

As far as I can tell from looking at the VM instances, they have a 100GB persistent SSD attached to them. On my local machine, wpt is ~1 GB. So we would break at O(100) deployments of wpt PRs, not sure where we are at today. From terraform.tf:

  # The "submissions" deployment requires significantly more disk space because
  # it creates a new git working directory of the WPT repository for every
  # qualifying submission.
  wpt_server_disk_size = 100

stephenmcgruer commented 4 years ago

Ah, the logs actually log how many, it's literally written above:

A 2020-07-22T11:46:32.016807317Z open: 100
A 2020-07-22T11:46:32.019015701Z trusted: 101
A 2020-07-22T11:46:32.021600161Z active: 99

So we're out of disk space, for a start. cc @Hexcles - do you know how much more disk space would cost? I'm not sure that that's the answer, but want to get our facts together.

stephenmcgruer commented 4 years ago

To stop the bleeding I am doubling the disk space up to 200GB. The instances have just come back up and hopefully this will restore pr previews for now.

Ping @Hexcles again re disk space cost. Based on https://cloud.google.com/compute/disks-image-pricing#disk I think it's an additional $0.204 / GB / month, and we've increased the two submission servers by 100GB each, so 200 * 0.204 = $40.80 more a month. That's small but not trivial.

Longer term, I suspect we will have to start dropping PRs that haven't been updated recently. I will do some digging into the current set of actively mirrored PRs to see why there are so many.

Hexcles commented 4 years ago

Sorry for the late reply. Yeah we are using SSD in us-central1, which costs $0.170 USD / GB / month (~$0.2+ CAD). Regarding long-term remediation, in addition to GC

https://github.com/web-platform-tests/wpt.live/blob/42e51efc6129a66f842c1093434d0ee79a0cf6f7/terraform.tf#L91-L93

I wonder if we could use a shallow clone.

Also @stephenmcgruer could you send a PR regarding the disk size?

stephenmcgruer commented 4 years ago

Also @stephenmcgruer could you send a PR regarding the disk size?

Whoops, I thought I had done that already 🤦 . (Actually was just going to commit it like the terrible person I am).

Hexcles commented 4 years ago

Oh committing to master is also fine. I just don't want prod and master to diverge.

Hexcles commented 4 years ago

Also, reading the source code, I see each submission is a git worktree: https://github.com/web-platform-tests/wpt.live/blob/7aed8ca6d295896c9384cdedc0a5909193d276ff/src/mirror-pull-requests.sh#L77

which does not contain .git and should only be <400MB. The math doesn't seem to exactly check out here.

stephenmcgruer commented 4 years ago

which does not contain .git and should only be <400MB. The math doesn't seem to exactly check out here.

Yeah, I discovered that on Friday too; doing a worktree add locally is only +350MB for me. So 100 repos should still fit in 100GB.

I am going to add some logging to dump disk usage and see if we can track this down. (Actually first I'll see if I can get cloud-ssh to work, but that's historically been a pain...)

stephenmcgruer commented 4 years ago

Ok, I got ssh working (had to enable the rule). In the docker thats running in one of the gce instances:

root@wpt-submissions-wpt-servers-4b9b:~/wpt# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         193G   42G  152G  22% /
...

root@wpt-submissions-wpt-servers-4b9b:~/wpt# du -hs 24511
391M    24511

So the worktrees are ~400MB, which means in 100GB we should be able to fit O(250) of them (minus some space for the original WPT checkout, and misc stuff). Right now we have 103 of them, and are using 42GB which is approximately correct by that math.

Unfortunately this means it's very unclear why we ever got to a full disk. I'll ponder on this, but I don't see many options currently other than adding some logging to dump the disk if its nearly full.

There's definitely a bug where we're mirroring old and closed PRs; for example 21696 is a directory despite https://github.com/web-platform-tests/wpt/pull/21696 being (a) closed and (b) from February. That needs looked at too.

PS: For my future self, once ssh'd to the instance you need to connect to docker, so run docker ps -a and then docker exec -it <instance> bash.

stephenmcgruer commented 4 years ago

I'm returning the servers to 100GB rather than 200GB. There's no evidence that we should need more than that - as noted each open PR is currently ~400MB and there are ~60 open PRs now that the old ones have been closed, so we should be (and are, confirmed via ssh) using ~24GB.

There's some steps we could take here, such as using shallow clones, but really unless this reproduces we are unlikely to do anything. In the future, if this reoccurs we now have serial port and ssh access enabled to debug then.

web-platform-tests / wpt.live

pr-preview-detect failing consistently - out of disk space #35