whatwg / misc-server

Miscellaneous resources for the servers hosting *.whatwg.org domains
Other
22 stars 16 forks source link

Figure out the server logging situation #89

Closed annevk closed 4 years ago

annevk commented 6 years ago

It'd be good to know the facts on what DigitalOcean ends up storing from visitors and what we store ourselves (presumably whatever Nginx and Apache default to).

And Amazon S3 for whatpr.org.

foolip commented 6 years ago

DigitalOcean itself doesn't store any access logs AFAICT, or at least I can't find anything in their dashboard.

For nginx (marquee) I thought we didn't have any logs at all since I never included it in any configuration file and in fact removed it from examples, but it seems like there's default logging. It includes IP, date, UA string and the HTTP command, like "GET / HTTP/2.0",

For Apache (multicol) we also get access and error logs.

For nginx on noembed (the node server) we also get the nginx logs. @domenic, is there any extra logging at the node level?

domenic commented 6 years ago

No access logs. pm2 maintains error logs and when-did-we-restart-the-server logs.

othermaciej commented 5 years ago

The Steering Group is working on a privacy policy for WHATWG.org. It would be really useful to know in more detail what the various web servers log, and how long that info is retained. Can anyone provide samples of the logs for the various web servers? If posting that publicly is not good, then privately emailing it to me would be ok, and I'll show it to the folks drafting the privacy policy. Also information about how long logs are retained.

sideshowbarker commented 5 years ago

For blog.whatwg.org and wiki.whatwg.org, it looks like the default Debian Apache2 logging is being used. That seems to amount to 14 days of log files. Logs older than 14 days are removed. So as of today (March 20), the oldest log file is for March 7. The logs are just in the standard Apache log format: IP, date, HTTP request method and URL path, HTTP response code, UA string.

For all the other domains, it looks like the default Debian nginx logging is being used. As with Apache, that seems to amount to 14 days of log files. Logs older than 14 days are removed. So as of today (March 20), the oldest log file is for March 7. The logs are in the same format as Apache logs: IP, date, HTTP request method and URL path, HTTP response code, UA string.

annevk commented 5 years ago

For whatpr.org https://aws.amazon.com/compliance/data-privacy-faq/ might help (though quickly skimming I couldn't find the information we're looking for), though also note we currently do not have access ourselves. @tobie still has the keys for the backing S3 instance.

tobie commented 5 years ago

How about we take that as an opportunity to transfer the AWS account?

FYI: PR Preview runs on Heroku and logs a number of things on https://papertrailapp.com/. I think those logs are retained for a week only.

foolip commented 5 years ago

I can confirm what @sideshowbarker says for marquee, which serves whatwg.org itself and all specs, everything static really. The oldest current log entry is March 8. Here's a sample of the access logs with IPs changed:

1.2.3.4 - - [08/Mar/2019:06:25:15 +0000] "HEAD /specs/web-apps/current-work/ HTTP/1.1" 301 0 "-" "Java/1.7.0_80"
1.2.3.5 - - [08/Mar/2019:06:25:16 +0000] "GET /standard-shared-with-dev.css HTTP/2.0" 200 2922 "https://encoding.spec.whatwg.org/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
1.2.3.6 - - [08/Mar/2019:06:25:16 +0000] "GET /file-issue.js HTTP/2.0" 200 4981 "https://encoding.spec.whatwg.org/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
foolip commented 5 years ago

@othermaciej are there other logs you would like samples of as well? They'd all be very similar to this.

othermaciej commented 5 years ago

I think that sample is sufficient to cover all the Apache and nginx servers. I'll share that sample, a summary of what it contains, and the 14-day retention window with the people drafting the privacy policy.

It sounds like the only remaining case where we don't have a definitive answer yet is whatpr.org.

annevk commented 5 years ago

Until #75 is fixed there's also lists.whatwg.org, which isn't really accessible due to HSTS and is still on DreamHost.

tobie commented 5 years ago
  1. PR Preview relies on Heroku (for hosting the application), Papertrail (for logs), and GitHub's API. The application is stateless beyond that.

    1. Papertrail logs:
    2. Heroku:
  2. whatpr.org relies on the following AWS solutions

    1. AWS S3 (two S3 bucket hosted in North Virginia. Logging is disabled on both).
    2. AWS Route 53
    3. AWS CloudFront

I hope this helps.

tobie commented 5 years ago

And here's what those PR Preview logs look like:

Mar 23 09:34:08 pr-preview heroku/router: at=info method=POST path="/github-hook" host=pr-preview.herokuapp.com request_id=36393b5e-346c-4e99-a5fe-1ff9f703bd56 fwd="192.30.252.39" dyno=web.1 connect=0ms service=3ms status=200 bytes=219 protocol=https 
Mar 23 09:34:08 pr-preview app/web.1: Currently running: [] 
Mar 23 09:34:08 pr-preview app/web.1: Found repo config file { src_file: 'index.bs', 
Mar 23 09:34:08 pr-preview app/web.1:   type: 'bikeshed', 
Mar 23 09:34:08 pr-preview app/web.1:   params:  
Mar 23 09:34:08 pr-preview app/web.1:    { 'md-status': 'LS-COMMIT', 
Mar 23 09:34:08 pr-preview app/web.1:      'md-warning': 'Commit {{ sha }} {{ pull_request.head.repo.html_url }}/commit/{{ sha }} replaced by {{ config.ls_url }}', 
Mar 23 09:34:08 pr-preview app/web.1:      'md-title': '{{ config.title }} (Pull Request Snapshot #{{ pull_request.number }})', 
Mar 23 09:34:08 pr-preview app/web.1:      'md-Text-Macro': 'SNAPSHOT-LINK {{ config.back_to_ls_link }}' }, 
Mar 23 09:34:08 pr-preview app/web.1:   ls_url: 'https://streams.spec.whatwg.org/', 
Mar 23 09:34:08 pr-preview app/web.1:   title: 'Streams Standard', 
Mar 23 09:34:08 pr-preview app/web.1:   back_to_ls_link: '<a href="https://streams.spec.whatwg.org/" id="commit-snapshot-link">Go to the living standard</a>', 
Mar 23 09:34:08 pr-preview app/web.1:   post_processing: { name: 'emu-algify', options: { throwingIndicators: true } } } 
Mar 23 09:34:09 pr-preview app/web.1: s3: Bucket name: whatpr.org. 
Mar 23 09:34:09 pr-preview app/web.1: Fetch: https://api.csswg.org/bikeshed/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fsurma-dump%2Fstreams%2Feafd8637479cad13bb1f3bdec917efc762131b1e%2Findex.bs&md-status=LS-COMMIT&md-warning=Commit%20eafd8637479cad13bb1f3bdec917efc762131b1e%20https%3A%2F%2Fgithub.com%2Fsurma-dump%2Fstreams%2Fcommit%2Feafd8637479cad13bb1f3bdec917efc762131b1e%20replaced%20by%20https%3A%2F%2Fstreams.spec.whatwg.org%2F&md-title=Streams%20Standard%20(Pull%20Request%20Snapshot%20%23999)&md-Text-Macro=SNAPSHOT-LINK%20%3Ca%20href%3D%22https%3A%2F%2Fstreams.spec.whatwg.org%2F%22%20id%3D%22commit-snapshot-link%22%3EGo%20to%20the%20living%20standard%3C%2Fa%3E 
Mar 23 09:34:09 pr-preview app/web.1: s3: Bucket name: whatpr.org. 
Mar 23 09:34:09 pr-preview app/web.1: Fetch: https://api.csswg.org/bikeshed/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fwhatwg%2Fstreams%2Fa7f62107f12d223f093f6bb64a197c7489f25765%2Findex.bs&md-status=LS-COMMIT&md-warning=Commit%20a7f62107f12d223f093f6bb64a197c7489f25765%20https%3A%2F%2Fgithub.com%2Fsurma-dump%2Fstreams%2Fcommit%2Fa7f62107f12d223f093f6bb64a197c7489f25765%20replaced%20by%20https%3A%2F%2Fstreams.spec.whatwg.org%2F&md-title=Streams%20Standard%20(Pull%20Request%20Snapshot%20%23999)&md-Text-Macro=SNAPSHOT-LINK%20%3Ca%20href%3D%22https%3A%2F%2Fstreams.spec.whatwg.org%2F%22%20id%3D%22commit-snapshot-link%22%3EGo%20to%20the%20living%20standard%3C%2Fa%3E 
Mar 23 09:34:22 pr-preview app/web.1: s3: Attempting to cache streams/999/a7f6210.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Attempting to cache streams/999.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Succesfully cached streams/999/a7f6210.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Available at https://whatpr.org/streams/999/a7f6210.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Succesfully cached streams/999.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Available at https://whatpr.org/streams/999.html. 
Mar 23 09:34:30 pr-preview app/web.1: s3: Bucket name: whatpr.org. 
Mar 23 09:34:30 pr-preview app/web.1: Fetch: https://services.w3.org/htmldiff?doc1=https%3A%2F%2Fwhatpr.org%2Fstreams%2F999%2Fa7f6210.html&doc2=https%3A%2F%2Fwhatpr.org%2Fstreams%2F999.html 
Mar 23 09:34:38 pr-preview app/web.1: s3: Attempting to cache streams/999/a7f6210...eafd863.html. 
Mar 23 09:34:38 pr-preview app/web.1: s3: Succesfully cached streams/999/a7f6210...eafd863.html. 
Mar 23 09:34:38 pr-preview app/web.1: s3: Available at https://whatpr.org/streams/999/a7f6210...eafd863.html. 

And it turns out I can't successfully spell "successfully" in a log.

foolip commented 4 years ago

Is there anything left to do here, should the answer be documented and kept up-to-date somewhere, or was this a one-time audit?

foolip commented 4 years ago

This has now been figured out and is covered by https://whatwg.org/privacy-policy.