web-platform-tests / interop

web-platform-tests Interop project
https://wpt.fyi/interop
283 stars 28 forks source link

Expanding 2021 and 2022 focus areas for 2023 #119

Closed foolip closed 1 year ago

foolip commented 2 years ago

I'm filing this issue to get the thoughts of @web-platform-tests/interop on an idea I haven't discussed with anyone yet.

For Interop 2022, we included the 5 focus areas from Compat 2021, with some additional test review, but not adding anything.

As we keep repeating the process, I think it could make sense to expand the test lists of already included features, to keep them solid over time.

A small example might be the :modal pseudo-class, which we carved out in https://github.com/web-platform-tests/interop/issues/79, but it could make sense to consider :modal part of the Dialog Element focus area in the future.

A bigger example would be looking through the CSS Grid tests which aren't part of Interop 2021/22, and see if we should add some of them: https://wpt.fyi/results/css/css-grid?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned&view=interop&q=%21label%3Ainterop-2021-grid

The Interop 2023 process doesn't talk about this kind of proposal, but I think we could simply treat it as any other proposal, one per focus area we'd like to expand.

In terms of implementation, I think we'd label such tests interop-2023-grid, so that we could still produce the 2021 score based on labels, but for a 2023 dashboard we'd show it as just "Grid".

karlcow commented 2 years ago

That would be neat but a few questions.

  1. Is there a risk of a never ending growing pile, that would little by little make more difficult the inclusion of new work. I'm always a bit worried about the take more work without abandoning other things. Not healthy for people or projects ;)
  2. When do we consider that a feature is "finished" (as in stable enough to not be part anymore from the interop project)
foolip commented 2 years ago
  1. Is there a risk of a never ending growing pile, that would little by little make more difficult the inclusion of new work. I'm always a bit worried about the take more work without abandoning other things. Not healthy for people or projects ;)

I think most test suites stop growing as the feature becomes robust. But for features that are relatively new when first included, the test suite might grow with implementation experience, and I'd like to review such additions.

  1. When do we consider that a feature is "finished" (as in stable enough to not be part anymore from the interop project)

This is a question of how we weigh past focus areas in the metric, if at all. For 2022 we gave the 2021 focus areas 1/3 of the score. For 2023 this is still undecided. I think there's value in keeping them to avoid regressions, but there are multiple ways this could be achieved.

karlcow commented 2 years ago

This is a question of how we weigh past focus areas in the metric, if at all. For 2022 we gave the 2021 focus areas 1/3 of the score. For 2023 this is still undecided. I think there's value in keeping them to avoid regressions, but there are multiple ways this could be achieved.

The score is not the part which is worrying (for me). 😄 The work is (or exactly the time it takes to handle a growing pile).

it's always good to know how to finish work, the same way one takes on new work.

foolip commented 2 years ago

I suppose my ideal outcome is that we reach 100% and just stay there without doing any additional work. Sticky Positioning is pretty close, with https://bugzilla.mozilla.org/show_bug.cgi?id=1676564 being the remainder before all browsers score 100%.

If we get to that point I think there's a special argument for just leaving those areas alone at 100%, having a higher bar for adding tests, since it's the difference between having to do something and having to do nothing.

foolip commented 2 years ago

Another plausible case of test expansion is viewport, where the result of https://github.com/web-platform-tests/interop-2022-viewport might be a bunch of new tests. And those would probably make more sense to group in with the existing Viewport Units focus area than creating a new group.

cc @bramus

karlcow commented 2 years ago

That makes sense. But that looks like more and more a score system based on features (which is a good thing for webdevs) more than browsers.

The pile of growing tests could help webdevs figure out if the technology is stable, mature enough for being usable. caniUse currently says "yes it's implemented", but we know that the evil is in the details on how much this is implemented. WPT helps to figure out a bit more, but results are hard to decipher.

I wonder how we can be more helpful as a collective in signaling to the webdevs: "Yes we checked this feature and we are now confident that this part and this part of viewport is usable in an interoperable way."

gsnedders commented 2 years ago

Much of this is actually a question of how we want to define the focus areas, whether they're defined by scope of the feature and immediately around it, or by the exact set of tests within them?

jensimmons commented 2 years ago

The pile of growing tests could help webdevs figure out if the technology is stable, mature enough for being usable.

I hope web developers do not start doing this. Automated test suite results are very limited in what they can do. Deciding as a developer whether or not your usecase is well served is a different question to whether or not engines pass tests. Developers need to try their site in multiple browsers and see if they are getting the intended result.

Grid, for example, was useable in March 2017, even if there were edge case bugs in browsers and in the spec itself. Nothing about nailing down the details of edge cases affected typical real world uses of Grid.

Meanwhile another technology might pass 98% of tests, but still not work for a particular site. Perhaps because the problem is untestable by automation. Perhaps because the problem falls into the 2% that is failing.

astearns commented 1 year ago

If we kept Color Spaces and Functions as a focus area (it still has low scores in two browsers) I’d like to see some tests added around support of color spaces in WebGL canvases. From what I understand there are tests for this in the Khronos WebGL conformance tests, but this is not covered in WPT tests.

jensimmons commented 1 year ago

@astearns Color Spaces and Functions will be a focus area for 2023, since all focus areas roll over into the next year.

Adding more scope to that area is a fine proposal. Will you open an issue for this, so it can be considered? You can mention that you believe it should be added to the existing Color Spaces and Functions focus area. List any existing tests to be added.

astearns commented 1 year ago

@jensimmons https://github.com/web-platform-tests/interop/issues/168 - thanks!

jgraham commented 1 year ago

It's not clear to me that "all focus areas roll over into the next year" is decided. We incorporated the Compat 2021 scores into Interop 2022 at a lower weight, but I don't think that's should be taken as firm precedent meaning we need to do the same in the future.

It might be interesting to examine how that played out; did we see substantial improvements from any implementation on those focus areas (ideally we might have some kind of control features to see if being in Interop was responsible for any observed improvement, but I don't know how one would pick reasonable controls).

For features which already reached a good score, including them only serves to raise the baseline score / reduce the effect of work on new focus areas. So I don't think it's obvious that including those going forward is helpful (I assume browsers already have mature systems to avoid regressing tests, but certainly it might be a good reason to continue computing the scores from earlier years to test the hypothesis that once a feature scores well that is unlikely to be lost).

For features that haven't reached a good score, that's arguably evidence that merely being in the metric was insufficient to incentivise that work. In such cases it would be good to reassess the priority compared to features from the new year. We don't want to create a situation in which implementors are incentivised to prefer working on a less-useful-to-users/authors feature because it was accepted in Interop Year X, and is automatically rolled over to Year X+1, and so have to deprioritise work on a more useful feature that might otherwise have been included in Year X+1.

In the case of new test additions, those would clearly require renewed consensus (as always).

So overall I'm more inclined to consider rollover on a case-by-case basis (although I'd certainly welcome discussions on ways to optimise that process or have some special low-weight category for rollover).

jgraham commented 1 year ago

We've rather dropped the ball on this: we'renow well into the decision making period and we haven't got a clear idea how we'll handle carry over. In the absence of a decision to the contrary, I interpret that as there being no carry over.

If we do want carry over, a proposal is the following:

  1. Create a focus area for all the tests that were in Interop 2022 but don't yet pass in three engines.
  2. Give vendors a chance to exclude tests/areas where they believe that those fixes are lower priority than other work they would displace.
  3. Consider that like any other proposal.

Technically we should already have created the proposal and be nearly done with step 2 if we want to take that approach. But given that people were maybe making some implicit assumptions about what would happen in this case, perhaps we should carve out a small exception to the timeline to give us the chance to do this?

foolip commented 1 year ago

My preference/proposal is that we take all of the 15 focus areas to date, with the same test lists, and let those be a given part of the score, say 15% or 30%. We would decide the exact percentage when we see the number of new focus areas we'll have for 2023, for example if there are 18 we might like to give each 4%, leaving 28% to the carryover.

jgraham commented 1 year ago

That approach seems operationally simple, which might be a winning argument given the timeline, but it doesn't address any of the concerns in https://github.com/web-platform-tests/interop/issues/119#issuecomment-1265660721.

gsnedders commented 1 year ago

It might be interesting to examine how that played out; did we see substantial improvements from any implementation on those focus areas (ideally we might have some kind of control features to see if being in Interop was responsible for any observed improvement, but I don't know how one would pick reasonable controls).

For simplicity's sake, here's a link to results of 31 Dec 2021 v. today.

If we exclude those that pass in all six of those runs, we get this, which shows ongoing progress.

chrishtr commented 1 year ago

+1 to including all 15 areas with a small percent each.

gsnedders commented 1 year ago

FWIW, in the meeting yesterday I tried to break this down into a variety of smaller questions:

  1. Should we carry over focus areas from last year?
  2. Should we carry over tests in those focus areas that pass everywhere?
  3. Should we increase the scope of those focus areas?
  4. Should we increase the test coverage of those focus areas?
  5. Should that increased test coverage for those focus areas also affect Interop 2021/22 scoring?
  6. How should we weigh carried over focus areas?

My understanding is:

  1. We have consensus on yes.
  2. I think we're strongly leaning towards "yes"?
  3. We have consensus on "no".
  4. I think we have consensus on "yes"?
  5. I think we have consensus on "no"?
  6. I don't think we really got anywhere on?
foolip commented 1 year ago

Thanks for that summary, @gsnedders! When it comes to increasing the test coverage within the current score, did you discuss how we'd do that? Is it by filing test change review issues?

jgraham commented 1 year ago

I'm not convinced that we have consensus on 2, or at least I think we didn't really resolve the underlying questions that led to the discussion.

From my point of view, carrying over focus areas from last year, including all the tests that are already passing, and just giving those a reduced score doesn't make a lot of sense. Clearly already-passing tests aren't a real "focus area"; assuming that browsers' regression tracking works well then people aren't going to focus on them, and the only effect of carrying over the tests is to dilute the impact of further improvements on the metric.

Conversely it seems like there ought to be a higher bar for adding new tests for any features that are automatically brought forward into the new interop area. That's because fixing bugs in features that are not otherwise receiving attention has a higher context-switching cost. In some cases that's well worth it: if the bug is actively causing compat problems, or blocking authors from using features. But for general differences it's unclear why features that were in a previous iteration of Interop should automatically be valued over features that were already pretty (but not entirely) interoperable in pre-Interop times. Looking at wpt.fyi and bug trackers, it's clear that there are differences in CSS1/2 era features that still affect browsers today, but presumably haven't been considered important enough to prioritise in 20 years.

An edge case here is the Web Compat focus area: I re-submitted it as a proposal for this year specifically because I didn't assume that previous focus areas should be carried forward. Even if previous focus areas were carried forward I wouldn't want to down-weight it, because it's supposed to be comprised of bugs that are observed to cause problems in the world, and therefore are worth context-switching to fix.

I think part of this difference in approach comes from how different people are viewing the Interop-20xx score. I think my point of view makes sense if you consider it as the result of work over a time-bounded period of agreed scope. But if you think that it's actually more about providing an overall metric for the interoprability of browsers on the platform, with some kind of weighting toward novel areas where interop is on average worse, then I can see why you'd assume that past focus areas are obviously still going to be part of the future score.

So given we didn't sort out these differences in time to affect the proposals period, it's clear some compromise is needed. I think on that basis 1 is fine, since that's clearly what people want. I don't really agree with the model that leads to 2, but if that's what's needed to make progress, and we really can't agree something else, it's OK for this iteration but not as precedent. For 4. I think we should at least have a higher bar for adding coverage to older areas: for compat-affecting bugs I'd prefer they're added to the web-compat focus area, and for non-compat affecting bugs there should be a clear rationale about why the proposed change is worth adding in terms of improving the platform (i.e. "this test is for a part of this feature that was defined at the time it was originally included in Interop, therefore add the test" should be considered insufficient).

Going forward, I think we need to resolve what interop "is" and then pick an approach for carry over that fits that model of what we want the metric to represent.

nt1m commented 1 year ago

Should we carry over tests in those focus areas that pass everywhere?

I'm leaning towards yes, if the answer to Should that increased test coverage for those focus areas also affect Interop 2021/22 scoring? is also yes. We've done this in 2022 for 2021 focus areas, and I think it reflects a reality of how interop works:

  1. Implementation work uncovers edge cases that need clarification.
  2. Those clarifications lead to tests that are already passing to start failing.

I've noticed 2 happening a lot for flexbox, where the recent Github activity on the flexbox folder has lead to tests that we thought were fairly stable to start failing in both FF and Safari. Transforms and grid focus areas are also fairly unstable as well in the same manner, and we can all agree that interop on those areas is fairly important. Part of the reason is that those focus areas are mainly reftest-covered, which mean they cover a lot of things at once, but also part of the reason is that issues got recently resolved as implementation was being done (https://github.com/w3c/csswg-drafts/issues/6683 and many flexbox issues for instance).

However, I also realize that we have a lot more focus areas in 2022 compared to 2021, so carrying over passing tests from 15 focus areas may make the measure meaningless, and de-prioritize needed work. To mitigate this, we could consider these possibilities on a focus area basis:

  1. fully carrying over focus areas that are important and where tests are unstable, and including expanding the focus areas + changing scores retro-actively (Flexbox/Grid/Transforms are good examples IMO)
  2. creating a bucket for failing tests in areas that need a small amount of work but not a major amount where we know the tests are stable, and passing tests are unlikely going to turn into failing ones (potentially Cascade Layers/Forms/Web Compat 2022?)
  3. dropping focus areas where we are pretty confident no work is needed (Viewport units, depending on the investigation)
  4. use the web compat category if some issues not covered by 1/2/3 affect websites in a major way.

I do want to emphasize that we should figure this out on a per-focus area basis, and if we do not have time this year to do this analysis, I think the default option would be 1 & 3, which what we've done for 2021 carry-over.

nt1m commented 1 year ago

Again, I do think the gist of the disagreement here is weighting on various things, which I think is more productive to discuss after all the 2023 focus areas have been figured out.

jgraham commented 1 year ago

One thing I realised when trying to articulate the problem space here is that there's a technical concern which I don't think has been addressed.

If we make the scores for previous years "live", we need to continue to be careful about which test changes are allowed in those files. This isn't such a concern for reftests, but for testharness tests we typically have many tests in a single file but are unable to select tests for Interop at a level more granular than the file. Updates to those files could add new tests which are out of scope for the original focus area. Given that updates often happen in vendor repos, it's likely that the developer updating the file would be unaware that the file was part of an Interop metric.

A more concrete example of the problem is with an idlharness.js test: if in Interop-20XX we decide to include the Foo interface as part of interop, we would naturally include the autogenerated idlharness tests for that interface. If at a later time a new feature extends Foo with new properties, those will automatically show up in the score for Interop-20XX, even though it wasn't part of the original scope.

Obviously we could work around this, but it seems like we'd need either complex subtest-level test selection, or to put in place some heavyweight processes to ensure we don't ever make scope-affecting changes to any test that was part of a previous Interop effort.

chrishtr commented 1 year ago

If we make the scores for previous years "live", we need to continue to be careful about which test changes are allowed in those files. This isn't such a concern for reftests, but for testharness tests we typically have many tests in a single file but are unable to select tests for Interop at a level more granular than the file. Updates to those files could add new tests which are out of scope for the original focus area. Given that updates often happen in vendor repos, it's likely that the developer updating the file would be unaware that the file was part of an Interop metric.

I think we should just deal with this problem if/when it occurs, via the principle of only introducing new infra complexity if we really need it. I'm happy to commit Chrome resources in 2023 to refactoring tests to fix any unintended/misleading/unfair score implications as necessary, on behalf of all browsers. Or also resources to add some infra if we really need it.

To that end, I think the simplest thing to do is also the best to do (in part because it's simple):

I don't think it will be too hard to explain this publicly to those who don't know the inside baseball. We can just say that the yearly scores mean "the cumulative impact pieces for focus areas up to that year", and that "Interop is an inherently ever-evolving thing, and as with all software systems, we're always finding new spec and implementation bugs. That means apparent scores may temporarily decrease as we deepen the interop of them by fixing these bugs."

jgraham commented 1 year ago

I think we should just deal with this problem if/when it occurs, via the principle of only introducing new infra complexity if we really need it.

It's not a very hypothetical problem; it's something that already happens. For example https://github.com/web-platform-tests/wpt/pull/35830 was a totally automated PR that added tests for the navigator.userActivation API to the existing html/dom/idlharness.https.html test. Fortunately we're not using that specific IDL test for anything in Interop, but the general concept of automated updates to existing tests is incompatible with maintaining the scope of those tests.

https://github.com/web-platform-tests/wpt/pull/34560 is a case from this year where we had to split out some parts of a generated test in an akward way in order to get the scope in line with Interop 2022; that took until the end of June before someone noticed and made the change. If we decide to add new form controls then it's totally reasonable for someone to include them in the main file again, without realising that breaks a metric.

None of this is a blocker to any specific handling of previous Interop focus areas, but we should be aware that it's in general not "free" to keep a metric running: we have to actively put in work to ensure that the tests remain confined to the original scope of the metric, and we currently have systems and processes that actively work against that. We also have additional technical limitations on Interop tests (like the inability to handle timeouts that change the total number of tests) which mean we have to be especially careful updating such tests in the future.

chrishtr commented 1 year ago

It's not a very hypothetical problem; it's something that already happens. For example web-platform-tests/wpt#35830 was a totally automated PR that added tests for the navigator.userActivation API to the existing html/dom/idlharness.https.html test. Fortunately we're not using that specific IDL test for anything in Interop, but the general concept of automated updates to existing tests is incompatible with maintaining the scope of those tests.

web-platform-tests/wpt#34560 is a case from this year where we had to split out some parts of a generated test in an akward way in order to get the scope in line with Interop 2022; that took until the end of June before someone noticed and made the change. If we decide to add new form controls then it's totally reasonable for someone to include them in the main file again, without realising that breaks a metric.

These are good examples. I certainly didn't mean to imply that it was hypothetical--sorry if I implied otherwise--just that my hypothesis was it was not common enough to be a big problem.

None of this is a blocker to any specific handling of previous Interop focus areas, but we should be aware that it's in general not "free" to keep a metric running

Agree it's not free, but based on 2022 experience not a huge cost. (Let me know if I'm off base.) And I'm happy to sponsor Chromium representatives to pitch in effort to take care of this. Because I think it's really valuable to be able to continue the growth and depth of interop, and measure that progress year by year.

foolip commented 1 year ago

Interop 2023 launched yesterday, so closing this. We ended up expanding the flex and grid test suites.