web-platform-tests / rfcs

web-platform-tests RFCs
74 stars 63 forks source link

Enable view=test mode in wpt.fyi #190

Closed jcscottiii closed 3 weeks ago

jcscottiii commented 2 months ago

Enable view=test mode in wpt.fyi


Preview

mathiasbynens commented 2 months ago

I’m really excited about this proposal, as it solves the use case described in https://github.com/web-platform-tests/wpt.fyi/issues/3740 without affecting the default wpt.fyi view.

jgraham commented 2 months ago

This feels extremely sketchy to me in terms of the incentive structure it creates.

In particular having entire tests show up as FAIL if one subtest is failing feels misleading, and is likely to incentivise people to either remove/not submit tests that have a single failing subtest, or arbitarily split out the parts of the test that they don't pass into a seperate test.

To some extent we already do this for Interop, of course, but it's a massive pain and really something that I'd like to change if possible. And Interop is already better because it gives partial credit for cases where some subtests are passing and others are failing.

foolip commented 2 months ago

I like this simplified view, I'm often trying to understand the big picture in terms of tests that need attention. Currently, I have to use a status:!ok status:!pass query and look at the "Showing 172 tests" text at the top while navigating around to get an idea of where there are the most tests with failures.

It would resolve https://github.com/web-platform-tests/wpt.fyi/issues/421 but also for testharness.js failures. This view shows it nicely I think: https://staging.wpt.fyi/results/infrastructure/expected-fail?label=master&label=experimental&aligned&view=test

This will necessarily hide some information when there are multiple types of failures, but drilling down would still reveal it.

@jgraham I agree this simplification comes with downsides, but since the view can be useful I'd rather have it than not. I'd also like to have the interop-style calculation be possible for any set of tests. There's discussion in https://github.com/web-platform-tests/rfcs/pull/114, and IIRC the sticking point was percentages.

How about if we also added a view that's like view=interop but without the percentages? Then the results could be understood given your choice of tallying, while leaving the default unchanged.

jcscottiii commented 2 months ago

Met about this today @jgraham will comment on this RFC.

jgraham commented 2 months ago

A concern I have here is that the original motivation is specifically "we want to track a target at the test level", and I worry that's going to cause problems. For example what happens if we reach the end of the quarter and another contributor adds extra subtests to the existing tests which makes it look like the goal has been failed? In some cases that could be adding one subtest to many top-level tests (historical example of this kind of thing would be idlharness tests that check where on the prototype chain attributes end up, or WebDriver classic tests for the interaction of user prompt handing with all other features). I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view, even though that's nearly the best case where each test has one subtest (it's not quite true because window-onload-test.html has 6, and one is passing in all browsers, which is itself somewhat unexpected in an expected-fail directory; the "test" view totally hides that fact).

If we want a way to find any test that has subtest failures I think we should consider if we can solve it with search options rather than a new view. I also think there could be mileage in adding more sorting options to wpt.fyi e.g. maybe being able to sort by subtest pass rate would enable identifying tests with lots of subtest failures, in case that's ever userful.

mathiasbynens commented 2 months ago

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right. I don't think any of these is a good option.

For what it’s worth, as the original requester of this feature, I don't think these are the only possible outcomes. The scenario of subtests being added has already happened several times with WebDriver BiDi tests in the past, and it was never a problem for us — if anything, it’s a good thing when a gap in test coverage is identified and addressed by adding more subtests. We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

jgraham commented 2 months ago

We do want to track our implementation progress against a set of tests over a long period of time, but that doesn’t mean there’s no room for nuance in our tracking.

My concern with this proposal is exactly that it removes nuance from the dashboard; a view where any subtest failure marks the entire test as FAIL is more unstable than one that shows the subtests individually.

I also agree there are use cases where you think you probably should pass all the tests for a given feature and want to have some way to find out when new tests are added that you don't pass. But I don't think this view is necessary or sufficient for that use case; for example with that goal, I care about a new test being added to a directory just the same as I care about a subtest, so labeling individual tests to include in the tracked set doesn't really work; you either need to be confident that you will never have a reason to ignore a test in some directory, or you need to label exclusions. But labeling exclusions falls down because we can't label subtests. Also, once features are in maintenance mode you really want push-notifications for changes rather than having to poll a dashboard, because people will stop polling the dashboard once they're mostly working on something else.

foolip commented 2 months ago

I'd predict in that case people end up being annoyed because they have the choice of either explaining why their dashboard now shows that they are failing all the tests, or unilaterally taking action to back out the change, or reorganising the tests to make their dashboard look right.

I think it's a small risk, but yes, there is a risk that some people will be annoyed. If there is reverting of good changes to meet a metric, I think the wpt core team should step in and say no, don't do that. We can document this risk and the expected response in this RFC, or even preemptively put something in our code of conduct or general READMEs to point to.

For @foolip's infrastructure test example, I think that's markedly worse than the current default view

I find it helpful to see more information about the kind of failure elevated one level. But I'm not claiming it is a perfect view, better that view=subtest in every situation. It's just a useful view, that I think I'll often use.

@jgraham Would documentation of how to understand each view directly on wpt.fyi address your concern? It is quite difficult to understand the default view=subtest even, so I think this would be good.

past commented 2 months ago

I understand the concern about the incentive structure and I think it is valid. A key point to remember is that incentives are affected by changing the defaults. Optional views, flags, search queries or configurations are only useful for narrow use cases to get a particular job done. I am of the opinion that we should try to accommodate as many of these use cases as possible so that WPT remains a core workflow for people and they don't have to migrate to other solutions.

past commented 1 month ago

it's not appropriate to ask for correct changes to be reverted or refactored due to their effect on the test view or derived metrics

I want to underscore this point that I agree 100% with. We are trying to make a great conformance suite for the web platform here, not play games with the metrics.

jcscottiii commented 3 weeks ago

I appreciate the thoughtful discussions and input on this RFC. I am now merging the pull request.