web-platform-tests / pulls.web-platform-tests.org

[Deprecated] Some functionalities are now provided by wpt-pr-bot https://github.com/web-platform-tests/wpt-pr-bot
7 stars 23 forks source link

add performance metrics to pulls dashboard #28

Closed bobholt closed 6 years ago

bobholt commented 6 years ago

cc @foolip

This adds a performance-tracking page at https://pulls.web-platform-tests.org/performance to demonstrate that pull requests are being tested in a timely manner. This is to aid in calculating the Google OKR that PRs are tested in each browser within 30 minutes, but is generally useful information for verifying timeliness of the CI process.

metrics

Features:

This also includes some cleanup of templates and config I encountered while testing.


This change is Reviewable

foolip commented 6 years ago

Looks great! Did we really have a PR that took 7 days to get started?

bobholt commented 6 years ago

That's on my branch, and is because I restarted that job over and over again for a week. TravisCI re-uses job and build IDs (which we share), which will skew the percentage down in cases we have re-triggered builds. But, as we discuss, that's a good thing if builds are re-triggered because of CI errors. I'm probably going to add an enhancement issue to move to our own ids so that we can track initial builds vs rebuilds.

foolip commented 6 years ago

On the metrics question, I love that "Jobs completed in under 30 minutes" directly maps to our OKR scoring, and would like to land that first until Q3 is over. But I also think that as we improve, the number is going to creep closer to 100%, and it'll cease to be meaningful for setting goals. Could you experiment with calculating the 50th percentile (median) latency and 90th percentile latency?

@Hexcles, do you have any thoughts on how to set and score OKRs based on latency from import/export OKR planning?

Hexcles commented 6 years ago

That looks quite neat!

@foolip Regarding latency-based KRs, we currently use two variants in Q3:

  1. all jobs under X min (score = % of jobs under X min)
  2. 50% jobs (median) under Y min (score = min(1, 2 * % of jobs under Y min))

Some thoughts:

I haven't materialized the actual numbers for import/export yet, but will propose some next week.

foolip commented 6 years ago

@Hexcles, thanks, that's very helpful. I think that the "all jobs under X min" variant does better capture what we're aiming for with both import, export and PR results, namely something highly reliable where people can count on the delay we aim for. Although it's usually going to be much faster, so it wouldn't tell people what to usually expect.

Since we're not really free to change the shape of the delay distribution however we like, a single metric that we make slightly aggressive compared to past performance is probably OK. @bobholt, WDYT?

foolip commented 6 years ago

@jgraham is OOO, @gsnedders or I will review this when done.

foolip commented 6 years ago

@bobholt, I saw you pushed some changes, is this ready for review?

bobholt commented 6 years ago

I have added a bunch to this PR:

There's a lot there now. You can check it all out at https://pulls-staging.web-platform-tests.org/performance?start=2017-08-15&end=2017-09-30. This is data dumped from the production database yesterday morning, and should serve as a fairly accurate representation of what it will look like in production.