web-platform-tests / rfcs

web-platform-tests RFCs
75 stars 63 forks source link

RFC 114: Change results aggregation on wpt.fyi and visualization on Interop-20** views #114

Closed DanielRyanSmith closed 1 year ago

DanielRyanSmith commented 2 years ago

This is a proposal for a larger change to the results view of wpt.fyi.

Screenshot examples (with links)

Tests with harness errors display warnings with title text next to results (link)

Screen Shot 2022-06-21 at 10 45 08 AM

Interop-2022 results are viewed with an aggregation that is more indicative of actual interop-2022 scores (link)

Screen Shot 2022-06-21 at 10 52 19 AM

Harness status will no longer count toward subtest totals (link)

The run in the left column is aggregated using the new method, compared to the old totals displayed in the right column. This change more accurately represents the scope of test failures. Screen Shot 2022-06-21 at 10 45 58 AM

zcorpan commented 2 years ago

The proposed changes make sense to me.

Does this change affect the scoring for Interop 2022?

foolip commented 2 years ago

Does this change affect the scoring for Interop 2022?

No, but I would like us to pick a scoring method that is as close as possible. And if it's not exactly the same, I'd like to use the scoring method we pick here for Interop 2023.

foolip commented 2 years ago

@DanielRyanSmith I love how this resolves https://github.com/web-platform-tests/wpt.fyi/issues/421, nice!

On the precision, I'm leaning towards no decimal points, because it's more compact, and a difference of <1% rarely matters.

jgraham commented 2 years ago

I haven't looked at this in detail, but I'm concerned that we originally avoided percentage scores because they're so easy to misinterpret into some kind of ranking. They're also lossy; you can't tell whether a score of 50% is based on passing 1/2 tests or 500/1000. The latter is at least suggestive of a more complete testsuite.

For Interop-* we are putting a lot of care into selecting the tests so that everyone agrees that it's OK to publish this kind of simplified score. I'm not sure who the intended audience is for doing that for all tests; I don't see that it's especially helpful to browser developers who I consider to be the main target for the raw test scores.

foolip commented 2 years ago

@jgraham note that the number of tests are preserved next to directory names. In an earlier prototype they were next to the percentage, but since they're the same for each cell it was moved outside. Some other way of showing the number of tests would be OK too I think.

DanielRyanSmith commented 2 years ago

@zcorpan

Does this change affect the scoring for Interop 2022?

This will not affect Interop 2022 scores and will move wpt.fyi a bit closer to mirroring interop 2022's scores.

@jgraham

They're also lossy; you can't tell whether a score of 50% is based on passing 1/2 tests or 500/1000. The latter is at least suggestive of a more complete testsuite.

This is true for a single test with multiple subtests. Like @foolip mentioned, an attempt to mitigate this has been to make the number of tests in a directory visible next to the path name. Not displaying subtests is a potential risk, but hopefully mitigated by the test numbers display.

davidsgrogan commented 2 years ago

I'd like to use the staging instance to see what this long diff query looks like, but I get an error when I try to just substitute the staging domain into that url

'diff' was requested, but is only valid when comparing two runs.

Are those runs specified at the end of the URL not available to the staging instance?

DanielRyanSmith commented 2 years ago

I'd like to use the staging instance to see what this long diff query looks like

Thanks @davidsgrogan for taking a look 😊 - the original plan was to only change the results view to percentages and leave the diff query view with its current implementation. Perhaps this is something that people have opinions about?

The view you linked is not available because the run IDs differ between the staging and production environments, but the view should look mostly the same as it currently is in production.

foolip commented 2 years ago

Not displaying subtests is a potential risk, but hopefully mitigated by the test numbers display.

I think that counting subtests in aggregate, like for all of css/ or editing/ doesn't make sense, and am glad to see that go away. But if anyone thinks it's important to continue to show subtest counts at the test level, an option would be to continue to show "3 / 4" instead of 75% for the cell for an individual test.

zcorpan commented 2 years ago

Or for tests that have subtests, show the number of subtests next to the filename

foolip commented 2 years ago

The tricky part of that is that the number of subtests can be different between browsers, in cases like these: https://wpt.fyi/results/media-capabilities/decodingInfo.webrtc.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned&q=status%3Amissing https://wpt.fyi/results/touch-events/multi-touch-interactions.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned

We can just pick the biggest number to show since it would be purely informational and not for scores, and I think that'd be OK enough, and is how the single "Showing 18 subtests" in the blue banner is currently computed.

@DanielRyanSmith do you think displaying that next to test names would be OK? What do we do about names that are too long? (This can also happen with directories, but much more rarely.)

DanielRyanSmith commented 2 years ago

Or for tests that have subtests, show the number of subtests next to the filename

This is a quick and easy addition, and here is an example of the change. If any subtests exist in the test, the number will be displayed next to the name.

In terms of text wrapping, it would happen the same as it does for long test file names now. Here is an example of long test names. Some are already wrapping without the new subtest numbers.

One possible issue I could see is if there is a need to differentiate the numbers displayed by a directory (tests) with numbers displayed by a single test (subtests), but it seems relatively intuitive to me. Great suggestion!

davidsgrogan commented 2 years ago

Thanks @davidsgrogan for taking a look 😊 - the original plan was to only change the results view to percentages and leave the diff query view with its current implementation. Perhaps this is something that people have opinions about?

I have opinions on it! 😊

I will need to calculate the interop2022 score difference caused by engines changing behavior for a large subset of tests. It will be absurdly tedious to do manually based on the existing diff view because of subtests. So showing score change in the diff view would eliminate a lot of manual calculation on my part.

Optimally we could have a toggle button that lets us users switch the view between score difference and the existing view that shows # test differences, but maybe that would add too much complexity on your side?

mfreed7 commented 2 years ago

Overall, thanks for implementing this! It'll be helpful as we work through improving Interop2022 scores.

On the precision, I'm leaning towards no decimal points, because it's more compact, and a difference of <1% rarely matters.

I think we need to keep the decimal precision. In the third screenshot example, "backgrounds" has 480 sub-tests and a pass rate of 99.8%, meaning a "combined" score of 480*0.998 = 479. So there's a full point to pick up there, which wouldn't be visible if you rounded that to 100%.

DanielRyanSmith commented 2 years ago

@davidsgrogan

I have opinions on it! 😊

Your pain point is an important one, and is tangentially related. I think it would be very useful to describe your situation in an issue on wpt.fyi if it's not already there. Because it's a separate view with a use case that is typically different than this results view it might take some more discussion, but this feels like very useful information.

foolip commented 2 years ago

@jgraham this change is related to (motivated by) these wpt.fyi issues, some very old, some new:

The intention is of course that it should be an improvement for users of wpt.fyi, and not regress any important use cases.

That last issue is about being able to browse (and sort) the Interop 2022 tests. The original suggestion was a fraction like "43.95 / 90", and that would actually work better when drilling down into a part of the test suite, since a percentage would be normalized at each level, while a test count would not.

One could split this all up into the following considerations:

Which parts of this would be fine and which need more discussion?

jgraham commented 2 years ago

I think that wanting the results on wpt.fyi to look like Interop 2022 scores is unnecessarily constraining the solution space. If we drop that requirement then there are more options to surface relevant information in a more usable way, without it having to look like a single numerical score.

For reftests I agree that it makes sense to inline the actual test result in the directory level view.

For tests with subtests, the limiting assumption seems to be that we have to combine both the overall harness status and the subtest scores into a single meaningful number. I don't think that's true. For example, for tests with both some passing subtests and a non OK status we could display the result like "6/7 âš " with title-text on the warning triange to convey the problematic status. For directory level views we could display the test count, the subtest pass rate and also the warning triange with details of any non-OK overall statuses. Maybe clicking the triange could take you through to a view of just the problematic tests. From that thought, I wonder if we should actually be displaying the subtest failure count rather than the pass count, and making it possible to click through to a view only containing the failures (or tests containing failures at the directory level). Maybe that's too confusing, and I'm certainly not wedded to the failure-triangle suggestion; I'm all in favour of a better way to convey the same information. But in general setting up the dasboard to be useful to developers looking for problems that need to be solved involves surfacing different information to setting it up as a source of metrics.

I also understand that there are requests to be able to reflect Interop metrics directly in the wpt.fyi display. Generally I think this kind of thing is a bit misguided; knowing that fixing case A will improve the score by 0.5% whereas B will improve it by 1% doesn't mean that B is going to be twice as useful as A; we just haven't calibrated the scores down to that level and I don't think we could even if we tried. That's different performance benchamarks, where a 1% win is pretty much twice as desirable as a 0.5% win.

Of course once you have a metric that's a proxy for some underlying goal (e.g. "allow developers to use without running into interop problems") people are going to want to optimise for the metric rather than the goal. I don't think we should encourage that, but to the extent that people disagree with me there, I think we should at least make sure that any view designed to support the metric is gated on only being available when displaying tests that are part of the metric. So anything specifically for surfacing Interop 2022 scores should be contingent on first having filtered down to one of the interop-2022 labels.

mfreed7 commented 2 years ago

I think that wanting the results on wpt.fyi to look like Interop 2022 scores is unnecessarily constraining the solution space. If we drop that requirement then there are more options to surface relevant information in a more usable way, without it having to look like a single numerical score.

As a browser implementer who raised one of the issues leading to this change, I just wanted to chime in that I do find changes along these lines helpful. The old display improperly (in my opinion) weighed sub-test failures over test-file failures. My request (which I believe this change fixes) was to report fractional files-passing statistics, rather than subtest-passing statistics. I'm generally agnostic to the many different ways that can be achieved, but I do think it's an important change to make. This was motivated by Interop2022, which, for the first time, set out a methodology for how to score a set of tests. While we don't need to be tied to that exact methodology, I think it is more valuable to use that approach than to stick to just counting sub-tests.

I also understand that there are requests to be able to reflect Interop metrics directly in the wpt.fyi display. Generally I think this kind of thing is a bit misguided; knowing that fixing case A will improve the score by 0.5% whereas B will improve it by 1% doesn't mean that B is going to be twice as useful as A; we just haven't calibrated the scores down to that level and I don't think we could even if we tried. That's different performance benchamarks, where a 1% win is pretty much twice as desirable as a 0.5% win.

Yes, Goodhart's law applies. But Goodhart doesn't offer a better solution to the problem. We need to measure things to try to improve them, and to do that, we need a well-defined metric. Certainly the Interop2022 methodology isn't perfect, but it is very practically helpful. Compat2021 was fairly successful I think, using a similar methodology.

DanielRyanSmith commented 2 years ago

Hello everyone, I've drafted some changes to the view and have made it available to test in the staging environment.

To summarize:

Some views that have a new look as a result:

I think this looks like a promising direction, but criticisms are encouraged here. Thanks @jgraham for bringing up the suggestion for parts of this, and I think this is closer to your original suggestion if I'm not mistaken @mfreed7. I think we're moving closer to something with the most utility for users. 😎

mfreed7 commented 2 years ago

I think this is closer to your original suggestion if I'm not mistaken @mfreed7. I think we're moving closer to something with the most utility for users. 😎

This definitely still solves my problem. And I think I actually like it better - there's more information available now. Not just the fraction of tests passing but also the total number, in the same place.

I do kind of think the parenthesis might not be enough to distinguish these. The difference between "19 / 19" and "(10 / 10)" is pretty subtle, especially if you haven't been in this conversation. I wish I had a great alternative suggestion to make, but I don't. Perhaps it's too verbose, but what about "0.8/1 (8/10 subtests)" or something like that?

zcorpan commented 2 years ago

Is there a need to differentiate? The fact that it's a folder vs a file seems clear to me. You can click though to get detailed results of the subtests.

foolip commented 2 years ago

Thanks for the update @DanielRyanSmith!

I think it will be hard to distinguish between test and subtest fractions in a way that is clear without explanation. It's already pretty clear what's a directory and what is a test, but neither that nor the parenthesis makes it obvious what the fractions mean without explanation.

Maybe we should have an "how to read these results" explanation somewhere? And/or we could have a tooltip that says "8 of 10 subtests pass"?

foolip commented 2 years ago

On harness errors, I'm not sure about the warning triangles. I would prefer to inline the status into test listing, making it even more obvious that there's a problem. But this depends on how scores are aggregated, and makes the most sense if we count a test as 0 if there's a harness error, which I think we should since the test didn't run to completion.

DanielRyanSmith commented 2 years ago

Hello everyone - I have been OOO and haven't been able to respond here. I was hoping the wording on the blue banner would shed some light on how to read into subtests

Screen Shot 2022-06-12 at 9 14 09 PM

@jgraham do you have any further opinions on this implementation? Soon I'll update the RFC proposal with more info and focus on the user we're targeting here (which has been the browser engineers).

jgraham commented 2 years ago

I had a discussion about this with @foolip @gsnedders and @past

My conclusions:

DanielRyanSmith commented 2 years ago

I still think it's very misleading to make the entire dashboard look like an interop metric. Changes to display something that look like Interop-202x scores should be gated on being in a mode that only shows the labelled tests which we've agreed are suitable for use in that kind of metric. For all other cases, the fractional scores aren't adding any useful information, but are confusing.

There seems to be a consensus that this new view would be more useful specifically for Interop scores, so I've been working to make that so.

I'm happy with not counting harness scores as part of the overall summary (for test types with subtests), but think we should bubble overall status information upwards in some way (e.g. the warning triangles). I think it's actively unhelpful to "score" those tests which run some subtests as 0; that's a clear example of where the current proposal makes a tradeoff in favour of something that looks like a score rather than something that actually helps browser engineers understand the state of the test.

There also seems to be a consensus on not counting the harness status toward the subtest count and displaying harness errors above in some way. 👌 I don't think I explained this very well, but my rationale towards marking tests with a harness error as 0 was less about mirroring the Interop scoring and more to increase visibility of possibly larger failures (a test marked as 0 will catch more eyes than one with a better pass rate because missing tests aren't weighted). However, making the harness error more visible above is a solid middle-ground there.

I've kept in the warning triangles with title text currently.

For tests without subtests (reftest, crashtests), having seen the current state I think we should present the result in directory views as a score out of 1, rather than directly as PASS, FAIL, etc. Mixing the two kinds of presentation in a single view is visually noisy.

I agree that the mixture of numbers and PASS/FAIL can look cluttered in some views where they show together. I've removed this from all views as of now.

In terms of helping browser engineers, I think we'd get more milage out of making some common filters easily accessible (e.g. "all tests in the current directory that don't pass in [browser]" or "[browser] specific failures", rather than putting more effort into tweaking the presentation of the existing data. Although it's currently possible to construct queries for those cases, it's not trivial and so I suspect we're leaving a lot of value on the table.

Perhaps this is something to look into for the future if there is enough demand - I think it sounds like a valuable idea.


Here is a demonstration of the newest iteration. Our staging environment has some resource issues so an error message might occur, but here is an example of an Interop score with:

Adding the view param to the query string switches this view as desired. The subtest view will be the default for all views, but the links in the Interop 2022 page will direct to a different view (I've set them to use the percentage view because that is how they are displayed on the Interop 2022 page).

The main page should look mostly unchanged with the exception of removing the harness status from subtest aggregation and having harness error notices.

Thanks for the detailed thoughts @jgraham 😊 Maybe having these alternate hidden views and adding them where it makes sense is a workable approach.

jgraham commented 2 years ago

I haven't fully looked at this, but a quick comment: I'd really like views like the percentage view to be gated on having a filter that selects interop-* labels, i.e. if you don't have such a filter the view parameter should be ignored (alternatively the view parameter could apply a prefix to the specified filter so that only interop-* tests are visible). I know the generic filtering mechanism makes this tricky, but I think it's important that the policy is actually enforced in the code.

DanielRyanSmith commented 2 years ago

I've updated the RFC, as well as the screenshots and links in the PR description (see these for examples of the current iteration). I believe I've responded to the concerns that have arisen at this point. Are there any more items anyone would like to raise?

foolip commented 2 years ago

If there is no more feedback on this by Friday I'll consider this RFC accepted and merge it. (Per our process.)

gsnedders commented 2 years ago

If there is no more feedback on this by Friday I'll consider this RFC accepted and merge it. (Per our process.)

I disagree that's our process; I think it's clear there was substantive disagreement from @jgraham's comments above (and I broadly agree with them).

foolip commented 2 years ago

@gsnedders the requested changes were made though, right? Do you have additional changes you'd like to see?

foolip commented 2 years ago

The part of the process I was referring to is this step:

When no new information is forthcoming, a decision should be made within 1 additional week.

This isn't explicit about what happens if the person requesting a change doesn't get back to the RFC within a week. An out-of-band ping would be helpful, and I didn't do that here. I'll do that now.

foolip commented 2 years ago

@gsnedders do you want to review/approve as well, given https://github.com/web-platform-tests/rfcs/pull/114#issuecomment-1171747244?

foolip commented 1 year ago

@gsnedders said in https://github.com/web-platform-tests/wpt/issues/34752#issuecomment-1179209069 they'd be out this week, so I'll go ahead and declare this RFC accepted due to a lack of further feedback. Sam, feel free to raise any issues when you're back, nothing is set in stone.