web-platform-tests / rfcs

web-platform-tests RFCs
75 stars 63 forks source link

RFC 122: Remove browser specific failures graph #122

Open jgraham opened 1 year ago

jgraham commented 1 year ago

Rendered.

DanielRyanSmith commented 1 year ago

I agree that there is likely a better way to leverage these metrics, and it seems like an outright improvement to developer utility if the graph is replaced with links that display queries of all BSFs for a given browser.

foolip commented 1 year ago

I think we should remove the graph from the top of /results, but I don't think we should just remove it. We have triaged Chrome-only failures to keep our BSF number under 500, and I see it might be time to do that again. And based on PRs from @gsnedders to the metrics code I assume they've looked at it too.

/insights already has "Anomalies" which allows getting to views for browser-specific failures, like this one: https://wpt.fyi/results/?label=master&label=experimental&product=chrome&product=firefox&product=safari&view=subtest&q=%28chrome%3A%21pass%26chrome%3A%21ok%29%20%28firefox%3Apass%7Cfirefox%3Aok%29%20%28safari%3Apass%7Csafari%3Aok%29

(Although it's buggy, I filed https://github.com/web-platform-tests/wpt.fyi/issues/2964.)

If I can make a wishlist, it would be:

foolip commented 1 year ago

Maybe a view like this would be the most friendly: https://wpt.fyi/results/?label=master&label=experimental&product=chrome&product=firefox&product=safari&view=subtest&q=%28chrome%3A%21pass%26chrome%3A%21missing%29%20firefox%3Apass%20safari%3Apass

jgraham commented 1 year ago

My view is that if people want to use the concept of browser specific failures as an internal tool for understanding areas of interop difficulty that's good, and I fully support that. But I don't think we have widespread agreement on its use as a public-facing metric, and the reasoning in the RFC suggests that the lack of curation makes the numbers difficult to interpret.

If specific vendors want a number to look at I think it's reasonable to make that number an internal metric instead. That has the additional advantage that it allows some customisation e.g. it allows filtering the inputs to exclude tests that aren't considered a priority/problem for whatever reason, or dividing up the score into team-specific metrics rather than just having one top-level number. That isn't something we can do with a purely shared metric.

gsnedders commented 1 year ago

While I've certainly looked at the metric, it's far from the only data derived from WPT results that I've looked at. I think I otherwise agree with @jgraham here.

gsnedders commented 1 year ago

To be clear, as the RFC says, there are a variety of biases with this metric, and some of these get quite extreme:

Looking at the Safari data, /html/canvas/offscreen accounts for 32.48% of Safari's current score, and /css/css-ui/compute-kind-widget-generated 7.56%.

I don't personally believe 40.04% of Safari's "incompatibility" or "web developer pain" (or however we want to define the goal of the BSF metric) is down to those two features.

If we look at the graph over the past year with those two directories removed, we see a very different graph:

The browser-specific-failure graph with a slowly decreasing Safari over the first six months, stabilising afterwards
foolip commented 1 year ago

@gsnedders thanks, that clearly demonstrates the outsized impact of test suites with lots of individual tests. For comparison/posterity, here's the current BSF graph on wpt.fyi:

image

A few options for improving the metric:

I disagree with deleting the graphs outright, but would be happy with both moving it to /insights and tweaking it.

jgraham commented 1 year ago

I think a proposal for a new interop metric, even if based on BSF, would clearly be something for the Interop team to consider.

past commented 1 year ago

Improving the BSF metric seems like a worthwhile goal, either through ideas like the ones Sam and Philip propose or through a reimagined Interop metric based on BSF as James suggests. I would encourage the Interop team to explore that path.

However, since we don't have that yet, removing the metric entirely would be a step backwards. In Chromium we do pay attention to the overall score and invest considerably in improving interoperability over time. Hiding that number in favor of team-specific metrics will regress that effort. It will reduce visibility of Chromium interoperability issues at the organizational level and will pass the burden to individual teams with different priorities.

From my perspective, removing things that are currently in use without a suitable replacement is wrong. But perhaps moving the graph to /insights as an interim step before we have an improved metric would be a reasonable compromise.

karlcow commented 1 year ago

Not fully matured idea: If the graph is a kind of barometer on web technologies support across browsers, would it make sense to have there things which are only supported (standard positions) uniformly by the 3 browsers represented in the graph?

foolip commented 1 year ago

@karlcow I've also toyed with the idea of allowing filtering by spec status or implementation status, and I think that would be valuable. I think at least the following filters would be worth trying out:

I would not describe the current graph as a barometer on web technologies support across browsers. Rather the idea is to surface browser-specific failures, problems that occur in just one of the 3 tested browsers, which would ideally trend towards zero. A barometer of cross-browser support should instead be growing as the size of the interoperable web platform grows. It's an old presentation by now, but I looked at that in The Interop Update, where I teamed up with @miketaylr.

If we work on filtering and weighting we'll have to see which defaults then make the most sense, but I think it's important to be able to see Chrome-only failures over time that includes features Chrome hasn't implemented at all, such as https://wpt.fyi/results/storage-access-api, MathML (until recently) or fastSeek().

karlcow commented 1 year ago

@past

It will reduce visibility of Chromium interoperability issues at the organizational level and will pass the burden to individual teams with different priorities.

What is(are) the audience(s) for the graph?

And depending on that what are the useful views for each specific audience?

past commented 1 year ago

The audience is senior leaders who are making sure Chromium remains interoperable and competitive with other browser engines over time. The current view of overall browser specific failures is still useful in that task.

jgraham commented 1 year ago

Whilst I'm happy that Chrome's leadership are finding the graph useful, that usefulness as a metric is not a consensus position among browser vendors, and therefore it seems more appropriate to host it at a Chromium-specific location.

foolip commented 1 year ago

@jgraham how do you see this RFC interacting with https://github.com/web-platform-tests/rfcs/pull/120? Per that RFC the interop team will take ownership of this.

And I now see that RFC should be considered passed, given approvals and several weeks passing. I'll hold off merging for a bit though.

gsnedders commented 1 year ago

And, as I think the above slightly-modified graph shows, the experience of WebKit leadership has been that understanding the graph has been very difficult. There's no intuitive way to discover that those two directories account for such a disproportionate weight of the metric.

If you look at a view of WPT such as this, seeing Safari has fixed over 10k browser-specific-failures (2102 tests (10512 subtests)) over the past year, it seems reasonable to ask "why has the score continued to creep upwards, with no notable improvement at any point?".

On the face of it, there's a number of potential explanations:

  1. The tests which we've fixed have had little to no impact on the metric,
  2. Tests which fail only in Safari have been added at a rate greater than that of our fixes,
  3. Other browsers are fixing two-browser failures making them lone-browser failures.

Of these:

  1. is a hard hypothesis to test short of adding lots of debug info to the scripts that generate the BSF metric
  2. roughly maps to this query: 977 tests (3436 subtests)
  3. roughly maps to this query: 1927 tests (4630 subtests)

Even from all these, it's hard to understand how we end up at the graph currently on the homepage.

[Edited very slightly later to actually use properly aligned runs]

past commented 1 year ago

While still supporting improvements to the graph, I will say that adding up the numbers in your three bullets above seems to reasonably explain the lack of impact of the improvements you made.

gsnedders commented 1 year ago

While still supporting improvements to the graph, I will say that adding up the numbers in your three bullets above seems to reasonably explain the lack of impact of the improvements you made.

The number of subtests do, yes. But that's a complete coincidence, given the "normalisation" to test.

If you look at the actual directory-level diff, it becomes very apparent the overwhelming majority of the change is in /html/canvas/offscreen. And if you look at that directory, you'll see there's only 1805 tests (1910 subtests), which account for almost the entire lack of overall progression.

Again, the problem is to a large degree weighting all the tests the same.

gsnedders commented 1 year ago

@jgraham how do you see this RFC interacting with #120? Per that RFC the interop team will take ownership of this.

And I now see that RFC should be considered passed, given approvals and several weeks passing. I'll hold off merging for a bit though.

For anyone confused, I believe we (the WPT Core Team) decided to defer this RFC until the Interop Team had time to consider it.

foolip commented 1 year ago

We never resolved (merged) https://github.com/web-platform-tests/rfcs/pull/120 but indeed that seems like the best way to handle this.