web-platform-tests / results-collection

Other
41 stars 46 forks source link

Extensive delays due to degraded Firefox performance #602

Closed jugglinmike closed 6 years ago

jugglinmike commented 6 years ago

Over the past four days, the time to collect results has steadily increased. This project is no longer capable of collecting results at the normal cadence (4 times per day for Firefox and Chrome, once per day for Edge and Safari).

In the following chart, each data point describes a partial execution of WPT (called a "chunk"). The horizontal axis describes relative time, and the vertical access describes "chunk" execution duration in minutes. Each cluster represents a collection attempt (for Firefox and Chrome, we make four attempts per day):

screenshot from 2018-09-07 18-28-37

Firefox typically requires 45 minutes to execute a "chunk." This project responds to missing tests by re-trying collection attempts up to two additional times.

I have downloaded the logs from a handful of the outliers, and all of them exhibited the crash reported in gh-592. Part of the explanation for their extended execution time is that they are re-trying. I'm still not convinced of a correlation, though, because this delay has only become apparent over the past four days (and because some of those delays are far greater than the 200% increase we'd expect from two retries).

That chart shows how the delay has caused collection efforts to overlap--we are currently scheduling collections faster than we can execute them. This is also visible in the build history for Edge and Safari. We collect from these browsers once per 24 hours, and we do so using many more "chunks":

screenshot from 2018-09-07 20-25-17

As noted in gh-592, a patch has been filed for crash recovery. I am unable to merge that patch because it was submitted to a downstream project. Besides, it's now Friday evening in local time, so I'd be reluctant to perform the merge even if I could.

I'm hopeful that the patch will solve this problem, but given that we don't fully understand the underlying issue (e.g. what has changed in the past four days, why the crash is occurring in the first place), I'm not taking it for granted.

mdittmer commented 6 years ago

Do we have the resources to do a bisect over WPT commits in the neighbourhood of the ¼ mark in the first graph in a day or so? @jugglinmike do you think the results of such a bisect would be useful?

On Fri., Sep. 7, 2018, 9:41 p.m. jugglinmike, notifications@github.com wrote:

Over the past four days, the time to collect results has steadily increased. This project is no longer capable of collecting results at the normal cadence (4 times per day for Firefox and Chrome, once per day for Edge and Safari).

In the following chart, each data point describes a partial execution of WPT (called a "chunk"). The horizontal axis describes relative time, and the vertical access describes "chunk" execution duration in minutes. Each cluster represents a collection attempt (for Firefox and Chrome, we make four attempts per day):

[image: screenshot from 2018-09-07 18-28-37] https://user-images.githubusercontent.com/677252/45248967-27843c80-b2e6-11e8-842a-c45555a62cad.png

Firefox typically requires 45 minutes to execute a "chunk." This project responds to missing tests by re-trying collection attempts up to two additional times.

I have downloaded the logs from a handful of the outliers, and all of them exhibited the crash reported in gh-592 https://github.com/web-platform-tests/results-collection/issues/592. Part of the explanation for their extended execution time is that they are re-trying. I'm still not convinced of a correlation, though, because this delay has only become apparent over the past four days (and because some of those delays are far greater than the 200% increase we'd expect from two retries).

That chart shows how the delay has caused collection efforts to overlap--we are currently scheduling collections faster than we can execute them. This is also visible in the build history for Edge and Safari. We collect from these browsers once per 24 hours, and we do so using many more "chunks":

[image: screenshot from 2018-09-07 20-25-17] https://user-images.githubusercontent.com/677252/45248968-2b17c380-b2e6-11e8-8552-51211500fe68.png

As noted in gh-592 https://github.com/web-platform-tests/results-collection/issues/592#issuecomment-419603340, a patch has been filed for crash recovery. I am unable to merge that patch because it was submitted to a downstream project. Besides, it's now Friday evening in local time, so I'd be reluctant to perform the merge even if I could.

I'm hopeful that the patch will solve this problem, but given that we don't fully understand the underlying issue (e.g. what has changed in the past four days, why the crash is occurring in the first place), I'm not taking it for granted.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/web-platform-tests/results-collection/issues/602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsWSP3_6fcuJLxdfpK_200SKBpqROa9ks5uYyBAgaJpZM4Wfqh3 .

whimboo commented 6 years ago

First, the patch won't help us to reduce the execution time of the tests with Firefox. But it can help us to figure out what exactly crashes. Maybe this is one of the shutdown hangs we experience here, and the background thread monitor kills Firefox after the 60s max shutdown duration.

If you want to play around and you can reproduce on Travis with a PR maybe try to check the toolkit.asyncshutdown.crash_timeout preference, and shorten the time until Firefox gets killed. If that makes tests faster then we have an indication.

@jugglinmike, do you know which tests exactly take that long?

jugglinmike commented 6 years ago

Here's a graph with some more recent data:

screenshot from 2018-09-10 12-53-02

As of this weekend, outliers are exceeding the expected build time by an order of magnitude. This is a familiar pattern--see gh-545. Checking the AWS CPU utilization graphs confirms this. Here is a graph detailing per-machine credit availability for the past week or so:

screenshot from 2018-09-10 17-40-31

I suspect that gh-592 created a CPU credit deficit. Just like in gh-545, we were unaware of this initially because we generally maintain many credits on reserve. It wasn't until we spent this reserve that we began to witness the more general performance degradation described by this issue, one machine at a time. I was initially thrown off by Chrome's relatively stable performance. A problem like worker CPU throttling should effect Chrome and Firefox evenly. My guess is that Chrome's sub-30-minute build time makes it the least susceptible to reduced resources--it may even be that it can complete testing before spending the few credits we manage to accrue between builds.

If that's right, then resolving gh-592 will address the deficit, allow the system to recover CPU credits, and ultimately lead to build times we were enjoying at the end of August.

(@whimboo Since this problem appears to be a symptom of gh-592, I'll respond to your comment about the crashes in that thread.)

jugglinmike commented 6 years ago

Do we have the resources to do a bisect over WPT commits in the neighbourhood of the ¼ mark in the first graph in a day or so? @jugglinmike do you think the results of such a bisect would be useful?

That could be useful. Unfortunately, we don't have the resources to run this kind of experiment (otherwise, I'd also suggest testing a few different releases of Nightly). I think for now, we have to put our hopes on the proposed workaround for the crash

jugglinmike commented 6 years ago

I upgraded all workers to Ubuntu 18.04 on 2018-09-20, and as expected, we haven't experienced a Mir-related build error since then. Thanks for the tip, @jgraham!

jugglinmike commented 6 years ago

Almost forgot: here's the commit which implements the upgrade: https://github.com/web-platform-tests/results-collection/commit/c5dcfc317f51630ad182eef5f4f1d0947f80c023

whimboo commented 6 years ago

That's great to hear! So I assume all the extensive delays are gone now.