whatwg / misc-server

Miscellaneous resources for the servers hosting *.whatwg.org domains
Other
22 stars 16 forks source link

Move ever-growing *.spec.whatwg.org storage off of the VM disk #107

Open foolip opened 4 years ago

foolip commented 4 years ago

This week marquee, which hosts all static whatwg.org sites, grew its disk usage past 80% of its 30GB and triggered an alert. I've increased the size to 50GB for now.

The constant increase is because of commit snapshots. We could compress on disk or deduplicate more, but it would still slowly grow, indefinitely. We shouldn't store these files on a fixed-size block device, but in an object store where there is no fixed upper limit.

DigitalOcean Spaces is a solution we could use, by letting nginx forward requests to it.

However, by still having all requests hit nginx we wouldn't be making full use of a solution like this. Spaces has a CDN feature with certificate handling, but it requires control over the DNS and is thus blocked by https://github.com/whatwg/misc-server/issues/75.

annevk commented 4 years ago

To clarify, request forwarding is a backend matter and does not involve redirects?

foolip commented 4 years ago

DigitalOcean Spaces doesn't support serving a website from it directly, but this is tracked in https://ideas.digitalocean.com/ideas/DO-I-318.

The smallest change that would work is to let nginx continue to handle redirects, and for requests that don't redirect proxy that to an internal Spaces endpoint. Spaces wouldn't itself ever respond with a redirect, at least not until https://ideas.digitalocean.com/ideas/DO-I-318 is fixed.

For all of the static sites, I think our requirements are:

foolip commented 4 years ago

The most elaborate redirect rules are in https://github.com/whatwg/misc-server/blob/master/debian/marquee/nginx/sites/whatwg.org.conf.

annevk commented 4 years ago

Sorry, to restate my question, will our end-user-visible response URLs remain unchanged?

foolip commented 4 years ago

Yes, of course, any solution that doesn't give full control of the URL layout I'd just rule out :)

foolip commented 4 years ago

Numbers in https://github.com/whatwg/meta/issues/161#issuecomment-598046081 suggest that everything would easily fit in a Git repo, but you can't serve a website from a repo so that doesn't solve everything here.

foolip commented 4 years ago

Hijacking this issue to drop some notes about using a CDN, which isn't the same problem as running out of disk space...

Some numbers based on using goaccess to analyze /var/log/nginx/access.log.{2,3,4}.gz, which seems to be about a day's worth of requests. With all hosts mixed together, we get 872.72 GiB of requests for /. Filtering out just html.spec.whatwg.org it's 721.76 GiB. So most of our traffic is serving https://html.spec.whatwg.org/. That's what I would have expected. If we were to use an CDN, we should do it for https://html.spec.whatwg.org/ first and see what that does for us.

I'm not sure about our numbers, I'm pretty sure they're the the compressed size, but we're not using 30*872 GiB ~= 26 TiB of transfer per month, more like 4-5 TiB. So this analysis is probably all wrong :)

foolip commented 3 years ago

It looks like https://www.digitalocean.com/products/app-platform/ could be something to look into for this. From a cursory view, it seems more like AppEngine, in that it supports Node.js and other languages, static content, and you don't manage the servers yourself.

foolip commented 3 years ago

I have looked into using DigitalOcean spaces with nginx in front, using proxy_pass to forward requests. This would allow us to keep all the redirects, which is nice.

The main problem this runs into is that a S3-like storage bucket is just a set of named objects whose names are paths, it's not a file system. The following can't be done in the usual way and needs some other solution:

I think that if the first problem could be solved, then the second can be done with a location directive handling anything with a trailing slash, and we could generate static directory listings where we want them.

domenic commented 3 years ago

It looks like DigitalOcean Spaces is maybe particularly bad at this: S3 has a whole "website hosting mode", see e.g. their docs on index.html files. Whereas https://www.digitalocean.com/community/questions/spaces-set-index-html-as-default-landing-page seems to have seen no activity. Maybe using S3 (which we already do for PR preview) would be the right way to go here?

foolip commented 3 years ago

Hmm, I hadn't consider just using AWS S3, but that would probably solve most of this. What's not great about it is that we'd depend on both DigitialOcean and S3 being healthy at all times.

What mystifies me is that neither S3 nor spaces seems to have a way to set a Location header for a specific object, but can customize Content-Type and friends. If that were possible, this would be easy enough in Spaces too.

domenic commented 3 years ago

S3 has a complicated system: https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-page-redirect.html . It is a bit mystifying why they don't allow something simpler. E.g. the most flexible option, the JSON rules, is capped at 50. And the per-object redirect doesn't seem to let you choose the status codes.

domenic commented 3 years ago

Probably a bad idea to diversify even further, but there's also Netlify which has very straightforward _redirects and _headers files. I can't tell if they're really meant to scale in the same way as S3, but they seem serious...

foolip commented 3 years ago

If we could put objects in the bucket which the nginx front end turns into a redirect to add a slash, then I think we'd be set. (We'd also need to generate file listings but that could be a deploy step, not too hard I think.)

@domenic do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

One option we could look into is "deprecating" URLs with a trailing slash and writing redirect rules for the ones we currently have. But I don't love having to muck around with our URLs because we're changing the storage solution.

domenic commented 3 years ago

Do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

From https://docs.aws.amazon.com/AmazonS3/latest/userguide/IndexDocumentSupport.html :

For example, the following URL, with a trailing slash, returns the photos/index.html index document.

http://bucket-name.s3-website.Region.amazonaws.com/photos/

However, if you exclude the trailing slash from the preceding URL, Amazon S3 first looks for an object photos in the bucket. If the photos object is not found, it searches for an index document, photos/index.html. If that document is found, Amazon S3 returns a 302 Found message and points to the photos/ key. For subsequent requests to photos/, Amazon S3 returns photos/index.html. If the index document is not found, Amazon S3 returns an error.

So, it sounds like it will 302 redirect them. That appears to be similar to what we have today (e.g. https://whatwg.org/validator currently 301 redirects to https://whatwg.org/validator/.)

foolip commented 3 years ago

https://github.com/aws-samples/amazon-cloudfront-secure-static-site looks fairly promising for this.

foolip commented 7 months ago

I won't be able to make time from WHATWG infra work this year, so here's a brain dump.

The /var/www/html.spec.whatwg.org/ directory on marquee is 29 GB, that's the biggest problem in any migration. As a Git repository it's 6GB, so that rules out any solution of the shape "put everything in Git and deploy on every commit". That's unfortunate, because there are many options for that.

A solution would take the shape of a storage bucket which deploys write into, and a frontend/CDN that just serves from that bucket. The hard part of that is preserving all of our redirects, and I've seen no storage buckets which have built-in redirect support that's expressive enough. (S3 has some stuff, not enough.) We would need something like https://developers.cloudflare.com/rules/url-forwarding/bulk-redirects/reference/csv-file-format/ I think.

This problem ought to be easy for someone who has experience maintaining large websites and migrating between hosting... if they were meticulous about preserving redirects.

That's all.