web-platform-tests / wpt.live

A live version of the web-platform-tests project
https://wpt.live/
15 stars 11 forks source link

wpt.live is down due to a new HTTPS port in wptserve #32

Closed stephenmcgruer closed 4 years ago

stephenmcgruer commented 4 years ago

Reported on #testing on irc.w3.org. Conversation appears to have found multiple problems:

6:50 AM Huh, fascinating. Logs are complaining about unmerged files blocking pulling.
6:51 AM Looks like src/supervisord.conf assumes the tree will be clean when it pulls
6:51 AM But why isn't the tree clean...
6:52 AM Starts at 2020-06-29 15:38:08.374 from the logs
6:53 AM Hah, it hit a merge conflict, interesting
6:53 AM `+ 95cdf1fdf4...a2c943ce89 master -> origin/master (forced update)`
6:54 AM `A 2020-06-29T19:37:16.430689458Z Auto-merging html/cross-origin-opener-policy/resources/coop-coep.py`
6:54 AM `A 2020-06-29T19:37:16.430739765Z CONFLICT (add/add): Merge conflict in html/cross-origin-opener-policy/resources/coop-coep.py`
6:56 AM This was also a  fatal loop, btw, so now looking for the beginning of that
6:57 AM Ahah! Ok, so the root cause (and one that would cause a fatal loop anyway) was one I saw coming and had on my list to fix today - wpt.live doesn't know about the new HTTPS port and that causes wptserve to fail.
6:58 AM Because the new HTTPS port isn't optional
6:58 AM So that caused a crash, but why that then caused a merge error I am not sure, nor am I sure why the code is trying to pull rather than just reset to origin/master
6:58 AM *hard reset
6:59 AM I can't imagine a case where we have or want local changes

First thing to fix will be to teach wpt.live about the new HTTPS port. Then probably fixup the scripts to git reset --hard rather than pull.

stephenmcgruer commented 4 years ago

cc @Hexcles as an FYI

stephenmcgruer commented 4 years ago

Argh, adding a new HTTPS port is much more complicated than I wanted. It relies on a Bocoup-owned literal fork-of-a-fork of a terraform config (https://github.com/bocoup/terraform-google-multi-port-managed-instance-group/blob/master/README.md), which I think will have to be changed to make this properly work. There's a chance that maybe we can just make wptserve start the port however, even if external traffic won't be allowed access...

stephenmcgruer commented 4 years ago

Attempting to push an image with 856cd76 to wpt.live. I ran:

make wpt-server-tot
make wpt-server-submissions
make publish-wpt-server-tot
make publish-wpt-server-submissions
terraform apply
stephenmcgruer commented 4 years ago

The added HTTPS port broke the Chromium WPT Importer and is being reverted in https://github.com/web-platform-tests/wpt/pull/24394

stephenmcgruer commented 4 years ago

Ok, so adding a HTTPS port eventually worked. It seemed to take a really long time for the server to redeploy; I deployed around 7:30am EST, and the change didn't take hold till ~9:00am EST.

As such, we're actually only reverting the javascript part of the change, to unbreak Chromium. So this issue can be closed - right after we file follow-up issues for the various things we've learnt (not to use git pull and to look into how long things take to deploy.)

Hexcles commented 4 years ago

We probably also need to set up new firewall rules to forward/load-balance the new port correctly, and maybe need to update the cert, too.

stephenmcgruer commented 4 years ago

🤦 oh right, I forgot I had only fired it up on wptserve but hadn't done the other parts of the problem.

Hexcles commented 4 years ago

and maybe need to update the cert, too.

Probably not. I checked our cert and the common name and DNS names don't include port numbers.

sideshowbarker commented 4 years ago

Incidentally, this is another case where it seems reasonable to expect we should have some way to ensure changes to wpt like the one that caused this don’t get merged without first notifying the wpt.live maintainers — and also me, because the change broke w3c-test.org too

stephenmcgruer commented 4 years ago

cc @web-platform-tests/wpt-core-team

We have a process for notifying such maintainers - it is the RFC process. This had an RFC - https://github.com/web-platform-tests/rfcs/pull/57, which went through the full 7 day waiting period after approval. I'm open to suggestions to improve this process, but I want to make it clear we do have a process.

Putting on my wpt.live hat, the tricky thing for me as a downstream maintainer was realizing the true impact of the change. I wrote the wptserve change, but despite that I still failed to realize that downstream configs (wpt.live, Chromium, w3c-test.org) would exist and be different, and thus cause an assert in wpt. I don't have good ideas for how to solve that better in the abstract - as long as a downstream consumer is using a live version of a project versus a pinned version, it seems likely that downstream consumer will be broken at some point. Of course for wpt.live (and w3c-test.org) using a live version is the exact point...

stephenmcgruer commented 4 years ago

I've pushed a new terraform config with 8443 exposed. That will take a bit of time to apply, and then hopefully we should be good here.

stephenmcgruer commented 4 years ago

This appears to not have fully worked; something like https://wpt.live:8443/origin-isolation/resources/send-origin-isolation-header.py?header=%3F1 doesn't respond.

I did just find a list of ports in the gce load balancing config which didn't include 8443, so I added it there and we'll see if that helps.

stephenmcgruer commented 4 years ago

(Also updated the firewall rules in the VPC network section, to add 8443 again)

stephenmcgruer commented 4 years ago

No luck so far.

gcloud compute firewall-rules list --project=wpt-live shows 8443 as allowed.

SSH-ing into one of the tot instances and running netstat, I see:

$ sudo netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      798539/python       
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      358/sshd            
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      798542/python       
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      798541/python       
tcp        0      0 127.0.0.1:34683         0.0.0.0:*               LISTEN      305/containerd      
tcp        0      0 0.0.0.0:8000            0.0.0.0:*               LISTEN      798540/python       
tcp        0      0 0.0.0.0:9000            0.0.0.0:*               LISTEN      798536/python       
tcp6       0      0 :::8001                 :::*                    LISTEN      798537/python       
tcp6       0      0 :::8002                 :::*                    LISTEN      798538/python       
udp        0      0 127.0.0.53:53           0.0.0.0:*                           260/systemd-resolve 
udp        0      0 10.127.0.24:68          0.0.0.0:*                           240/systemd-network 

Weirdly the docker instances don't appear to report any ports being mapped? (But likely not relevant, given that other ports work)

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
a0fa4e58f675        gcr.io/wpt-live/wpt-live-wpt-server-tot                              "/usr/bin/supervisord"   47 hours ago        Up 47 hours                             klt--faah
e0b5325d65c2        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /usr…"   47 hours ago        Up 47 hours                             stackdriver-logging-agent
$ docker port a0fa4e58f675
$ docker port e0b5325d65c2
$

cc @Hexcles - any ideas?

Hexcles commented 4 years ago

I took a look around and believe we forgot to add the new port to the load balancer: https://github.com/web-platform-tests/wpt.live/blob/6e13edaf30860b319b5a6a29f9bd437a49835da8/infrastructure/web-platform-tests/load-balancing.tf#L16

stephenmcgruer commented 4 years ago

Ok, I pushed a new terraform update with that change and I think we're good now. That was 'fun', may we never add a new port again.