Re-scoring previous test runs causes confusion

jgraham commented 1 year ago

Recent changes to motion-path and URL tests caused a noticeable overall change in the Firefox score. This is fine; those tests changes were agreed and the score change was predictable. However, what caused some problems was that people saw an overall score of X on one day, and then on the next day saw that not only had the score dropped to < X, but that the graph suggested that the score had never been as high as X in the first place. That caused a lot of confusion.

This happens because we try to rescore previous runs as if we had the current test set (zero filling results for tests that weren't in the previous runs). That's reasonable; it means drops in the graph usually (but not always e.g. in the case that existing tests are edited to have different pass conditions) correspond to actual browser regression. But there are a couple of problems:

There's a lack of documentation explaining exactly what the system is. To actually understand what's going on in detail the best source is the scoring code itself (which is well commented!), but it's unreasonable to expect most people to find that.
The fact that old scores are silently changed makes it very difficult to quote a specific score. Consider a press article that says "at time of writing browser B had score Y". Then someone reading the article some time later tries to verify that, and finds a graph that shows B never getting a score of Y. In that situation the reader would likely conclude that the article author had made an error, rather than digging in to the discrepancy.

I don't think the rescoring system is necessarily bad, but I do think we need to do more to make it clear what's going on. In particular the following seems like it would help:

Clearly document that we re-score previous runs using the current set of tests, backfilling zero where the test doesn't exist, and explain the set of tradeoffs that led to this system.
Generate, and publish, the actual measured results for each run, with the test set at the time of the run. Functionally the backend for this would just be a new CSV file that we append the latest results to every time a new aligned run is processed. On the frontend, having some way to switch the graphs to show either the re-scored results, or the historic point-in-time results would make it much more transparent what's going on.

foolip commented 1 year ago

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

cc @DanielRyanSmith

jgraham commented 1 year ago

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

I think there's a lot to be said for just putting all of the metadata directly in to web-platform-tests rather than having a separate repo. For example it would allow people to update tests and metadata in the same commit. But this would indeed mean revisting a lot of tooling that's based on the current separation. If we did this we could publish the combined metadata as an artifact and maybe do something similar to the results cache for long-term storage.

DanielRyanSmith commented 1 year ago

It is true that there's an inherent risk of "rewriting history" with the current scoring process we have in place. We're at least making progress in freezing the scoring of previous years, which should be live soon.

My fear with maintaining each score historically as they're written, rather than re-aggregating, is that we run the risk of solidifying scoring mistakes from non-finalized metadata or broken test suites. I could see a situation where metadata changes cause an increase in questions raised about, "Why has the score jumped drastically from yesterday?", and "What caused this score drop last week?" This could permanently add these scoring calibration changes to the historical data, and it's something we don't realize is happening much today because they're retroactively corrected.

I don't know the full risk of the above scenario, but it seems that metadata and test suite changes are not infrequent, even as the interop year progresses.

(I know this issue is not advocating for removal of the rescoring process - just documenting my thought process here)

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

This seems like the easiest way to do some explaining about what's happening behind the scenes. Although I agree with @jgraham that people who notice scoring discrepancies will likely assume a blog post or the dashboard has made some mistake rather than reading deeper into the scoring process (and I am likely one of those people 😅).

jgraham commented 1 year ago

Right, the proposal is not to display the graphs of historic scores by default, but to have an option to display them instead of the current graph, so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

DanielRyanSmith commented 1 year ago

Sorry, I realized I rambled more about the current scoring process more than the problem at hand in my above comment.

so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

I agree that it would be useful to have some historical accuracy. I'm wondering if having any access to a different view of historical scores on the dashboard could serve to confuse a greater audience, since I imagine it's not easy to explain the discrepancies in these scores concisely to a general user.

There's a bit of a conflict in making this historical score easy to find, because ideally it's not exposed to users who don't need it since it could be confusing, but In the blog post scenario described, it would need to be easy enough to find that a user could determine the veracity of the blog post's score.

I'm not a UX expert here, so I will defer any suggestions outside of explaining our current process in a link on the dashboard, which seems easy and useful. I am indifferent about metadata storage location.

web-platform-tests / interop

Re-scoring previous test runs causes confusion #356