web-platform-tests / pulls.web-platform-tests.org

[Deprecated] Some functionalities are now provided by wpt-pr-bot https://github.com/web-platform-tests/wpt-pr-bot
7 stars 23 forks source link

Frequent downtime on pulls.web-platform-tests.org #47

Open foolip opened 6 years ago

foolip commented 6 years ago

https://pulls.web-platform-tests.org/ is now 504 Gateway Time-out.

@mdittmer set up https://bit.ly/ecosystem-infra-status a while ago and from that it's clear that downtime is pretty frequent, some downtime almost every day. This matches what I've experienced, which is that every so often that I take a look, it's slow or down. Recent reports of the same kind: https://github.com/w3c/wpt-pullresults/issues/39 https://github.com/w3c/wpt-pullresults/issues/42 https://github.com/w3c/wpt-pullresults/issues/46

I'm calling this a roadmap issue, because apparently there's something not quite right about the setup causing it to frequently go down. Let's call this resolved when we've seen a week with no downtime.

@mdittmer, can you increase the checking rate to 5 minutes for these checks?

foolip commented 6 years ago

@lukebjerring FYI

boazsender commented 6 years ago

This appears to be caused by long SELECT times in postgres.

This usually causes the web server to hang, which makes the application appear down, but results still get aggregated.

In the case of #56, this may have caused the results to never be populated.

Two possible solutions: 1) Increase CPU resources on the server (postgres selects appear to be CPU bound according to htop) 2) Separate web server from database server, consider using managed db product, like amazon's RDS in production. If/when this second option is taken, we should also consider how the pullresults services will share resources, data models, and programs with the [w3c/wptdashboard] and http://wpt.fyi constellation of services.

foolip commented 6 years ago

It's surprising that there are selects that take anything more than milliseconds given the small amount of data in the system still. What are those queries?

boazsender commented 6 years ago

I'm not sure, I'll have to chase this a bit more through the flask ORM. I'll likely do so when we're closer to implementing a solution, though a more well tuned computer is probably what is actually in order.

For what it's worth, I observed multi-second postgres selects in htop for each load of the home page. When I did several, the server became non-responsive.

foolip commented 6 years ago

This continues to be a serious problem. I am getting 504 Gateway Time-out on https://pulls.web-platform-tests.org/job/23710.13 and other URLs today, and https://bit.ly/ecosystem-infra-status shows very frequent downtime.

jgraham commented 6 years ago

I reccomend hooking it up to New Relic to understand which queries are slow and what the downtime is like.

foolip commented 6 years ago

A problem today as well, need to look at https://pulls.web-platform-tests.org/job/24794.11 to understand what's wrong with https://github.com/w3c/web-platform-tests/pull/9641 but it's 504 Gateway Time-out.