web-platform-tests / interop

web-platform-tests Interop project
273 stars 28 forks source link

Scoring: variants and multi-globals #256

Open gsnedders opened 1 year ago

gsnedders commented 1 year ago

One suggestion for webcodecs was to use all the tests which match the search video: https://wpt.fyi/results/webcodecs?label=master&label=experimental&aligned&view=subtest&q=video

However, due to extensive use of multi-global tests and variants this ends up not working particularly well. Most obviously, webcodecs/videoDecoder-codec-specific.https.any.js ends up contributing 20% to the overall score, as the one file contributing the following ten tests (out of a total of 48 tests):

This isn't the first time we've had problems like this with our scoring, but I think this is a much more extreme case than we've had otherwise.

foolip commented 1 year ago

I agree this isn't great, and it also causes a bit of inflation in general for multi-global tests on wpt.fyi. There I've often thought that we should use the manifest and group these tests under the filename somehow, perhaps going so far as to call the filename "the test" and treat all of the variants as subtests. But that'd be a lot of work.

For the problem at hand, we could just not label the worker variants and reduce the size of the problem, but that isn't a very reusable approach...

gsnedders commented 1 year ago

I guess the challenge is our current implementation of the scoring is JS, versus the rest of the WPT infra (including all the manifest stuff) being Python… hmm.

must… not… rewrite… this… while… on… holiday…

foolip commented 1 year ago

I'm not aware of any hurdles we'll run into fetching and using the manifest from JS. The main issue is that it would be very slow. Storing all manifests in a tree-deduplicating setup more like https://github.com/web-platform-tests/results-analysis-cache would make it faster.

foolip commented 1 year ago

This came up again in https://github.com/web-platform-tests/interop/issues/281. We have cases (URL) where we want to include some variants but not others, so that rules out a "clean" approach of labeling file names and using the manifest to figure out which test names to include, while treating it as 1 test, scored as 0-1.

The more complex solution then is:

Label test names, but use the manifest to figure out which tests are defined in the same file. Treat those as a group and score them 0-1.

jgraham commented 1 year ago

So, the previous situation was that each variant is its own top-level test for the purposes of scoring, and the proposal is that we define things based on the file rather than on the test id?

FWIW I don't feel especially strongly either way; I think "the score just doesn't quite match reality" is an inevitable feature of the setup, and it's also possible to have a case where one file containing many subtests exercises a lot of the feature whereas a few tests that were moved to seperate top-level files only cover edge cases, but end up dominating the scoring. But if people feel that defacto today it's a better tradeoff to treat variants as a single test, I think it's reasonable to change.

foolip commented 1 year ago

Summary from the notes:

When we have variants, we can group them using information from the manifest, score them individually, and divide by the number of variants in the group. Similarly for multi-global tests.

So yes, it would be based on the file, but importantly we need to handle the case where we've only labeled some of the variants or multi-global tests.

To be robust we need to use the manifest, so this isn't trivial to implement.

I also think it would be very good if we could do the same grouping on wpt.fyi, otherwise we can't make the interop score view match.

foolip commented 1 year ago

We've discussed this in a meeting. We have a pretty good idea of what we'd change to address this, but nobody assigned to do the work.