Inconsistent `perf:library` results

schneems commented 4 years ago

The goal of perf:library was to be able to provide a pretty solid single point where benchmarks could be run and a single, trustworthy output could be generated.

I've made progress on that goal, but we're not quite there yet. For example I ran the same test 10 different times last night and here's what I saw (on code triage):

$ env SHAS_TO_TEST='6aab468,c321bd2' DERAILED_PATH_TO_LIBRARY='.' bundle exec derailed exec perf:library

Executed 10 times gave me:

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.225173 seconds)
  SLOWER 🐢🐢🐢 by:
    0.9922x [older/newer]
    -0.7862% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2000145 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.09999999999999998

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.210941 seconds)
  FASTER 🚀🚀🚀 by:
    1.0020x [older/newer]
    0.1978% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2173045 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.05999999999999994

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.208945 seconds)
  SLOWER 🐢🐢🐢 by:
    0.9993x [older/newer]
    -0.0660% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.206828 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.05999999999999994

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.2028555 seconds)
  FASTER 🚀🚀🚀 by:
    1.0012x [older/newer]
    0.1245% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2068465 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.07500000000000001

❤️ ❤️ ❤️  (Statistically Significant) ❤️ ❤️ ❤️

[6aab468] "Fstring gem" - (3.186942 seconds)
  FASTER 🚀🚀🚀 by:
    1.0083x [older/newer]
    0.8270% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.213518 seconds)

Iterations per sample: 200
Samples: 105

Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.16891057858422326
D max: 0.2380952380952381

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.200376 seconds)
  FASTER 🚀🚀🚀 by:
    1.0054x [older/newer]
    0.5373% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.217666 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.10000000000000003

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.204838 seconds)
  FASTER 🚀🚀🚀 by:
    1.0013x [older/newer]
    0.1256% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2088685 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.095

👎👎👎(NOT Statistically Significant) 👎👎👎

[6aab468] "Fstring gem" - (3.208226 seconds)
  SLOWER 🐢🐢🐢 by:
    0.9973x [older/newer]
    -0.2692% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.199613 seconds)

Iterations per sample: 200
Samples: 200

Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.07

❤️ ❤️ ❤️  (Statistically Significant) ❤️ ❤️ ❤️

[6aab468] "Fstring gem" - (3.165487 seconds)
  FASTER 🚀🚀🚀 by:
    1.0200x [older/newer]
    1.9636% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.228888 seconds)

Iterations per sample: 200
Samples: 61

Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.22160858543889803
D max: 0.26229508196721313

❤️ ❤️ ❤️  (Statistically Significant) ❤️ ❤️ ❤️

[6aab468] "Fstring gem" - (3.186347 seconds)
  FASTER 🚀🚀🚀 by:
    1.0100x [older/newer]
    0.9945% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2183525 seconds)

Iterations per sample: 200
Samples: 176

Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.1304653444395813
D max: 0.17613636363636365

Results of 10 runs

7 not significant
- Faster - 4
- Slower - 3
3 significant
- Faster - 3
- Slower - None

So these results aren't too bad. We are consistent in that it seems the runs that say "statistically significant" are all pointing in the right direction (saying the code got faster).

Based off of this set you could say "if it's not significant, then discard the results" and you would end up with 3 tests showing the same thing, which is good.

However here's an older case that I had with the exact same code commits but it was showing statistically significant and the results did not agree with the other 3 tests.

❤️ ❤️ ❤️  (Statistically Significant) ❤️ ❤️ ❤️

[6aab468] "Fstring gem" - (4.445098688172043 seconds)
  SLOWER by:
    0.7744x [older/newer]
    -29.1292% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.442366107526882 seconds)

Iterations per sample: 200
Samples: 93

Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.17947750551100136
D max: 0.30107526881720426

Yikes, in the prior 10 runs even the statistically significant results are saying that it's only around a 1% difference while this is claiming things got 30% slower.

This was before I switched from average to median. Doing this does make the results not quite so dramatic and 30% slower becomes 9.4% slower. But the problem still persists. This is very far from the "consensus" picture that running the tests 10 times shows.

I think it's statistically significant because it is definitely two different curves, however, the outliers on the "new" data set are massive here is a comparison of tail numbers:

Old       New
3.925983  6.560837
4.049389  6.708022
4.065886  6.761255
4.173183  6.842483
4.21175   8.125128
4.212002  8.129534
4.243142  8.302587
4.367608  8.7203
4.769074  10.282378
4.792769  10.284609
4.936911  12.238133

Sorted values, these are not the actual comparison runs, here is raw data https://www.dropbox.com/s/fwhw3cxir18zz2h/problem-data.zip?dl=0

While we could run this perf:library test multiple times to see if results agree, I want to be able to only have to run this task once and get a fairly confident 👍 or 👎 . My question is essentially: Could we massage all of this data in a way so that the -29.1292% result is discredited (or known to be suspicious)? Or flagged somehow.

One thing that stood out to me here is that "old" has a variance of 0.16 while "new" has a variance of 3.5. While it's okay to say that newer is slower, we cannot conclusively say that it is 9.4% slower (which is a lot).

Here's the "good" result set to compare to: https://www.dropbox.com/s/awlb395xnsl8f6g/derailed-11-runs.zip?dl=0 (with the "bad" data also annotated).

Here's a spreadsheet showing distribution of the "bad" data here https://docs.google.com/spreadsheets/d/19W5PP656bns9K7Oo4bPcu-F7K4IVn0mS306sih9IBMk/edit?usp=sharing

schneems commented 4 years ago

Talked with @noahgibbs and I think the next step to take is to remove the "stop on statistical significance" feature from this code path.

Here's what can happen, the code is really noisy but bounces into significance, so we stop comparing, but if I had let it run it might have bounced back in.

Rather than making the developer decide ahead of time how many samples they want before beginning the comparison process we should change to running the full duration of the test by default and instead of outputting the statistical significance every X iterations, for example, every 50 iterations. If a developer is trying to literate and needs faster results they can accept the intermediate outputs as a weak indicator that their efforts are working. If they want to know the "real" result, then let the whole thing run.

A "best" solution would be to also run the same A/B test against the same test twice to really make sure that it doesn't accidentally give us statistically significant results as well. Right now we have a codepath to guard against this change, but we could consider adding an env var to disable it when desired.

schneems commented 4 years ago

This was "solved" by adding histograms.

zombocom / derailed_benchmarks

Inconsistent `perf:library` results #162

Results of 10 runs