Closed schneems closed 4 years ago
Talked with @noahgibbs and I think the next step to take is to remove the "stop on statistical significance" feature from this code path.
Here's what can happen, the code is really noisy but bounces into significance, so we stop comparing, but if I had let it run it might have bounced back in.
Rather than making the developer decide ahead of time how many samples they want before beginning the comparison process we should change to running the full duration of the test by default and instead of outputting the statistical significance every X iterations, for example, every 50 iterations. If a developer is trying to literate and needs faster results they can accept the intermediate outputs as a weak indicator that their efforts are working. If they want to know the "real" result, then let the whole thing run.
A "best" solution would be to also run the same A/B test against the same test twice to really make sure that it doesn't accidentally give us statistically significant results as well. Right now we have a codepath to guard against this change, but we could consider adding an env var to disable it when desired.
This was "solved" by adding histograms.
The goal of
perf:library
was to be able to provide a pretty solid single point where benchmarks could be run and a single, trustworthy output could be generated.I've made progress on that goal, but we're not quite there yet. For example I ran the same test 10 different times last night and here's what I saw (on code triage):
Executed 10 times gave me:
Results of 10 runs
So these results aren't too bad. We are consistent in that it seems the runs that say "statistically significant" are all pointing in the right direction (saying the code got faster).
Based off of this set you could say "if it's not significant, then discard the results" and you would end up with 3 tests showing the same thing, which is good.
However here's an older case that I had with the exact same code commits but it was showing statistically significant and the results did not agree with the other 3 tests.
Yikes, in the prior 10 runs even the statistically significant results are saying that it's only around a 1% difference while this is claiming things got 30% slower.
This was before I switched from average to median. Doing this does make the results not quite so dramatic and 30% slower becomes 9.4% slower. But the problem still persists. This is very far from the "consensus" picture that running the tests 10 times shows.
I think it's statistically significant because it is definitely two different curves, however, the outliers on the "new" data set are massive here is a comparison of tail numbers:
While we could run this
perf:library
test multiple times to see if results agree, I want to be able to only have to run this task once and get a fairly confident 👍 or 👎 . My question is essentially: Could we massage all of this data in a way so that the-29.1292%
result is discredited (or known to be suspicious)? Or flagged somehow.One thing that stood out to me here is that "old" has a variance of 0.16 while "new" has a variance of 3.5. While it's okay to say that newer is slower, we cannot conclusively say that it is 9.4% slower (which is a lot).
Here's the "good" result set to compare to: https://www.dropbox.com/s/awlb395xnsl8f6g/derailed-11-runs.zip?dl=0 (with the "bad" data also annotated).
Here's a spreadsheet showing distribution of the "bad" data here https://docs.google.com/spreadsheets/d/19W5PP656bns9K7Oo4bPcu-F7K4IVn0mS306sih9IBMk/edit?usp=sharing