web-platform-tests / interop

web-platform-tests Interop project
275 stars 28 forks source link

Re-scoring previous test runs causes confusion #356

Open jgraham opened 1 year ago

jgraham commented 1 year ago

Recent changes to motion-path and URL tests caused a noticeable overall change in the Firefox score. This is fine; those tests changes were agreed and the score change was predictable. However, what caused some problems was that people saw an overall score of X on one day, and then on the next day saw that not only had the score dropped to < X, but that the graph suggested that the score had never been as high as X in the first place. That caused a lot of confusion.

This happens because we try to rescore previous runs as if we had the current test set (zero filling results for tests that weren't in the previous runs). That's reasonable; it means drops in the graph usually (but not always e.g. in the case that existing tests are edited to have different pass conditions) correspond to actual browser regression. But there are a couple of problems:

I don't think the rescoring system is necessarily bad, but I do think we need to do more to make it clear what's going on. In particular the following seems like it would help:

foolip commented 1 year ago

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

cc @DanielRyanSmith

jgraham commented 1 year ago

For keeping historical results unchanged, an alternative way to achieve that is to move labels into WPT itself somehow, in a way where it's possible to get the labels either for an arbitrary commit or at least for all tagged commits. That would be a lot more work, however.

I think there's a lot to be said for just putting all of the metadata directly in to web-platform-tests rather than having a separate repo. For example it would allow people to update tests and metadata in the same commit. But this would indeed mean revisting a lot of tooling that's based on the current separation. If we did this we could publish the combined metadata as an artifact and maybe do something similar to the results cache for long-term storage.

DanielRyanSmith commented 1 year ago

It is true that there's an inherent risk of "rewriting history" with the current scoring process we have in place. We're at least making progress in freezing the scoring of previous years, which should be live soon.

My fear with maintaining each score historically as they're written, rather than re-aggregating, is that we run the risk of solidifying scoring mistakes from non-finalized metadata or broken test suites. I could see a situation where metadata changes cause an increase in questions raised about, "Why has the score jumped drastically from yesterday?", and "What caused this score drop last week?" This could permanently add these scoring calibration changes to the historical data, and it's something we don't realize is happening much today because they're retroactively corrected.

I don't know the full risk of the above scenario, but it seems that metadata and test suite changes are not infrequent, even as the interop year progresses.

(I know this issue is not advocating for removal of the rescoring process - just documenting my thought process here)

How about documenting this in a README in https://github.com/web-platform-tests/results-analysis/tree/main/interop-scoring and linking it from https://wpt.fyi/interop-2023?

This seems like the easiest way to do some explaining about what's happening behind the scenes. Although I agree with @jgraham that people who notice scoring discrepancies will likely assume a blog post or the dashboard has made some mistake rather than reading deeper into the scoring process (and I am likely one of those people 😅).

jgraham commented 1 year ago

Right, the proposal is not to display the graphs of historic scores by default, but to have an option to display them instead of the current graph, so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

DanielRyanSmith commented 1 year ago

Sorry, I realized I rambled more about the current scoring process more than the problem at hand in my above comment.

so that in case someone does quote a score in a blog post or press article, and that score is later changed, there's a clear way to confirm that the number they gave was accurate at the time of publication.

I agree that it would be useful to have some historical accuracy. I'm wondering if having any access to a different view of historical scores on the dashboard could serve to confuse a greater audience, since I imagine it's not easy to explain the discrepancies in these scores concisely to a general user.

There's a bit of a conflict in making this historical score easy to find, because ideally it's not exposed to users who don't need it since it could be confusing, but In the blog post scenario described, it would need to be easy enough to find that a user could determine the veracity of the blog post's score.

I'm not a UX expert here, so I will defer any suggestions outside of explaining our current process in a link on the dashboard, which seems easy and useful. I am indifferent about metadata storage location.