add performance metrics to pulls dashboard

web-platform-tests / pulls.web-platform-tests.org

[Deprecated] Some functionalities are now provided by wpt-pr-bot https://github.com/web-platform-tests/wpt-pr-bot

7 stars 23 forks source link

add performance metrics to pulls dashboard #28

Closed bobholt closed 6 years ago

bobholt commented 6 years ago

cc @foolip

This adds a performance-tracking page at https://pulls.web-platform-tests.org/performance to demonstrate that pull requests are being tested in a timely manner. This is to aid in calculating the Google OKR that PRs are tested in each browser within 30 minutes, but is generally useful information for verifying timeliness of the CI process.

metrics

Features:

Start/End dates can be selected to analyze a specific range
- Defaults to previous quarter if today is in the first half of a quarter.
- Defaults to current quarter if today is in the last half of a quarter.
"OK" TravisCI statuses include PASSED, FAILED, and FINISHED
- CREATED, QUEUED, STARTED, and ERRORED statuses are automatically scored as 0.

This also includes some cleanup of templates and config I encountered while testing.

This change is

foolip commented 6 years ago

Looks great! Did we really have a PR that took 7 days to get started?

bobholt commented 6 years ago

That's on my branch, and is because I restarted that job over and over again for a week. TravisCI re-uses job and build IDs (which we share), which will skew the percentage down in cases we have re-triggered builds. But, as we discuss, that's a good thing if builds are re-triggered because of CI errors. I'm probably going to add an enhancement issue to move to our own ids so that we can track initial builds vs rebuilds.

foolip commented 6 years ago

On the metrics question, I love that "Jobs completed in under 30 minutes" directly maps to our OKR scoring, and would like to land that first until Q3 is over. But I also think that as we improve, the number is going to creep closer to 100%, and it'll cease to be meaningful for setting goals. Could you experiment with calculating the 50th percentile (median) latency and 90th percentile latency?

@Hexcles, do you have any thoughts on how to set and score OKRs based on latency from import/export OKR planning?

Hexcles commented 6 years ago

That looks quite neat!

@foolip Regarding latency-based KRs, we currently use two variants in Q3:

all jobs under X min (score = % of jobs under X min)
50% jobs (median) under Y min (score = min(1, 2 * % of jobs under Y min))

Some thoughts:

Average should be avoided, in favor of median. Both import/export & Travis jobs are likely to see some rare but huge outliers.
It's tempting to use some other percentile (e.g. 90th) to capture the concept of "majority", but I think using all is sufficient, is more intuitive to score, and suits the adventurous spirit of OKRs better.
The strategy is, if we'd like to be aggressive, use variant 1 only with a relatively small X (which is my preferred metric for import/export as they are now in a good shape); otherwise, use variant 2 and probably supplement it with variant 1 but a larger X.

I haven't materialized the actual numbers for import/export yet, but will propose some next week.

foolip commented 6 years ago

@Hexcles, thanks, that's very helpful. I think that the "all jobs under X min" variant does better capture what we're aiming for with both import, export and PR results, namely something highly reliable where people can count on the delay we aim for. Although it's usually going to be much faster, so it wouldn't tell people what to usually expect.

Since we're not really free to change the shape of the delay distribution however we like, a single metric that we make slightly aggressive compared to past performance is probably OK. @bobholt, WDYT?

foolip commented 6 years ago

@jgraham is OOO, @gsnedders or I will review this when done.

foolip commented 6 years ago

@bobholt, I saw you pushed some changes, is this ready for review?

bobholt commented 6 years ago

I have added a bunch to this PR:

It is now Google OKR-agnistic. It present a number of durations and the percentage of builds that complete in that time.
It includes overall statistics for wait and build times, including min, median, mean, and max
It includes a chart of job completion up to the first hour of CI so we can see where our jumps are.
It includes charts showing the distribution of wait and build times over the first hour of CI
It includes a list of PRs that take over an hour to build, ordered by worst to best. These PRs are not represented in our charts, and indicate the biggest problems.

There's a lot there now. You can check it all out at https://pulls-staging.web-platform-tests.org/performance?start=2017-08-15&end=2017-09-30. This is data dumped from the production database yesterday morning, and should serve as a fairly accurate representation of what it will look like in production.