Suport fuzzy matching for reftests

foolip commented 7 years ago

At https://webengineshackfest.org/ talking to @jdm, @emilio and others.

There are a lot of tests in http://searchfox.org/mozilla-central/search?q=fuzzy&case=false&regexp=false&path=*.list that could be shared if only wpt supported fuzzy matching.

We have heard the same from Chromium teams, see web-platform-tests pain points and "it is not always feasible to write stable/cross-platform reftests when dealing with rasterization features."

This ought to be fairly easy to implement this.

@bobholt @jgraham

emilio commented 7 years ago

Big problem is that fuzziness usually depends on the platform, so I suppose this should be integrated at the test expectation level, rather than at the test level itself?

foolip commented 7 years ago

Do you mean out-of-band metadata, or what is the test expectation level?

jdm commented 7 years ago

Emilio is referring to files like https://github.com/servo/servo/blob/master/tests/wpt/metadata/FileAPI/FileReaderSync.worker.js.ini that are not shared between vendors.

foolip commented 7 years ago

So would that just be a matter of tweaking the fuzziness factor, or would the actual algorithm to check fuzzy equality also have to be different?

jdm commented 7 years ago

The reftest equality algorithm right now is to hash two images and compare the hashes. Adding support for fuzziness will require comparing pixels and keeping track of how many differ.

jdm commented 7 years ago

https://github.com/w3c/web-platform-tests/blob/217f45791a19ea87d11473681bb8b26cb57bd189/tools/wptrunner/wptrunner/executors/base.py#L261-L265

foolip commented 7 years ago

Right, but can we just have a <meta name="fuzzy" content="how-fuzzy"> or does the how-much need to be configurable on a per-test-per-engine basis?

jdm commented 7 years ago

<meta name="fuzzy" content="how-fuzzy"> would work; it would likely end up having to be increased as each engine discovers that its results are slightly more fuzzy than the previous engine that investigated that failing test.

gsnedders commented 7 years ago

There two different issues here:

One is things that are impossible to write tests for in pixel-for-pixel identical ways (e.g., curves in SVG).
The other is implementation-related differences (e.g., font rendering issues or HW-accelerated differences).

We only really want to deal with the former, as the latter needs some sort of per-implementation data.

gsnedders commented 7 years ago

(Also I remember someone around Mozilla at some point pointing out that you really only want to allow fuzziness in specific places.)

(Also looking through the tests with annotations in mozilla-central, a lot of them are things like gradients where you don't actually want unlimited fuzziness, you just want to allow for roughly different values across the gradient.)

foolip commented 7 years ago

@chrishtr, @eaenet, is there anything like this in Blink, and is there a body of tests you could contribute to wpt if only this were supported? I suppose for such tests we might still want to save exact -exepected.png baselines in the imported tests to catch regressions, but that seems doable.

chrishtr commented 7 years ago

is there anything like this in Blink, and is there a body of tests you could contribute to wpt if only this were supported?

Blink doesn't have any fuzzy matching tests. Instead we have platform-specific baselines. The idea has been discussed in the Blink community many times, and each time we concluded that we don't want fuzzy checkers because they can mask important off-by-one or subpixel errors.

foolip commented 7 years ago

Hmm, so for that kind of test, I guess what we'd need is per-browser/platform baselines in web-platform-tests and a way of updating them, much like in LayoutTests.

@emilio @jdm, how much of your use case would be covered by that? Do you have cases of non-deterministic rasterization?

gsnedders commented 7 years ago

Blink doesn't have platform-specific baselines which are references (i.e., -expected.html), does it?

foolip commented 7 years ago

It would probably work if you put files in the right place, but -expected.html can't be generated automatically, so in practice no that's not a thing.

youennf commented 7 years ago

This should really be a last resort kind of test. I remember some WebKit members not liking that idea as it could be quickly abused by making any ref test as fuzzy. Having a mask/script to define where errors happen might help.

glennw commented 7 years ago

If some form of fuzziness is introduced, I think it needs to be specified per-engine.

Some engine implementations may know that they expect pixel perfect (or near) results for a test, and wouldn't want to have fuzziness added due to implementation details of a different engine.

In terms of specifying fuzziness, even just per-engine isn't enough. For example, Gecko can specify that fuzziness only occurs on a certain platform, or if a certain preference is enabled.

Instead, it seems like it should be possible to specify a string (or similar) per-engine that is interpreted by the reftest harness for that engine, allowing each engine to customize how it specifies fuzziness.

jdm commented 7 years ago

The wptrunner metadata that exists outside of tests already supports per-platform and other configuration variables; the only downside there is that those metadata values aren't available to wpt.fyi.

emilio commented 7 years ago

@emilio @jdm, how much of your use case would be covered by that? Do you have cases of non-deterministic rasterization?

Pixel tests with images has been frowned upon before too... I guess per-platform baselines should work for CI purposes, but there may be fuzziness in the developer machine due to different drivers or what not, how does Blink deal with that?

Does the test just not pass on the dev's machine but pass on CI? How do you update per-platform expectations if you don't have access to a machine on that platform?

foolip commented 7 years ago

We all frown upon pixel tests, but it sounds like we also all think they're so useful that we have them in our non-shared tests. @chrishtr, do you see any path to get rid of pixel tests for paint features in Blink entirely?

If we did this (with or without fuzziness), then I think the baselines would have to be stored out of band. wpt.fyi would generate them for all configurations it runs, and would need some infrastructure for vetting changes. (Remember Opera's SPARTAN? Heh.)

I don't think is the most urgent thing to fix in web-platform-tests, the bigger question is how to upstream "all" engine-specific tests to wpt. If there's a big chunk that really couldn't be shared without this kind of infrastructure, then I think we just have to do it at some point. How big is that chunk? I don't know yet.

chrishtr commented 7 years ago

Pixel tests with images has been frowned upon before too... I guess per-platform baselines should work for CI purposes, but there may be fuzziness in the developer machine due to different drivers or what not, how does Blink deal with that?

Does the test just not pass on the dev's machine but pass on CI? How do you update per-platform expectations if you don't have access to a machine on that platform?

We only record baselines on canonical machines per platform, and use software rendering of GL for composited content. An automated bot can update the baselines on behalf of a developer by running the test on the canonical machine in a datacenter, and add the changed pixel output to a patch. Code reviewers can then review the changed images and verify whether they look correct.

There is quite a bit of pain when updating the canonical machine from one to another (e.g. a new machine for a new version of Windows or Mac), due to changes in the platform implementation.

We all frown upon pixel tests, but it sounds like we also all think they're so useful that we have them in our non-shared tests. @chrishtr, do you see any path to get rid of pixel tests for paint features in Blink entirely?

Not a reasonable path to getting to zero, no. There are a number of cases where it has been very hard to use a reftest, and the extreme effort is not worth it. In these cases we are saving a whole lot of developer time by not forcing a reftest.

One thing we could do to mitigate the issue is to only run tests on one platform (e.g. Linux, since that is what 90% of Chromium developers use to write code). It's sometimes the case that we can make a reftest work on one platform but others have issues. The rest of the platforms would be in a Chromium-internal location so we don't lose test coverage. I'm not super happy about this idea, but it's one idea...

foolip commented 7 years ago

There is quite a bit of pain when updating the canonical machine from one to another (e.g. a new machine for a new version of Windows or Mac), due to changes in the platform implementation.

Yeah, that makes sense. I suppose that for wpt.fyi, as long as we're using Sauce (or BrowserStack) to run tests, we might have the issue that we're not always getting the same machine, or that a minor OS upgrade happens which breaks everything.

In an area like SVG, how much of the testing can we reasonably get done without something like this. I'm not exactly eager to have it, but what would it take to share 90% of our tests?

chrishtr commented 7 years ago

I'm not exactly eager to have it, but what would it take to share 90% of our tests?

Not sure. What is the current percent? @fsoder

fsoder commented 7 years ago

I think our currently shared percent is is roughly zero. There are some (I think) fairly obvious problem areas, but a there's also a lot of overlap with what FXTF specs would need to test. It's hard to comment on what it would take to achieve a certain percentage, but my gut feeling would be that 90% would be very difficult to reach.

eaenet commented 7 years ago

@foolip Like @chrishtr said we have nothing like this in Blink and I'm not sure we would want to add it either. A single pixel difference can be very significant, especially when it comes to borders, paint invalidation, and text rasterisation.

We have talked about adding support for only matching a specific region in a test however, ignoring differences outside of the designated region. It's nothing more than an idea at the moment though.

gsnedders commented 7 years ago

cc/ @thejohnjansen who I know has thoughts on this

gsnedders commented 7 years ago

I don't think is the most urgent thing to fix in web-platform-tests, the bigger question is how to upstream "all" engine-specific tests to wpt. If there's a big chunk that really couldn't be shared without this kind of infrastructure, then I think we just have to do it at some point. How big is that chunk? I don't know yet.

There's also the issue of tests where the spec doesn't define enough to actually have reftests. How do you test border-style: dashed? The spec requires you draw a "dashed" border (that's… it), which you obviously can't test without a human in any general way (though obviously each implementation could have pixel or other references to make sure they draw the dashing as they expect).

In an area like SVG, how much of the testing can we reasonably get done without something like this. I'm not exactly eager to have it, but what would it take to share 90% of our tests?

IIRC, the big problem is testing the various types of curves, and things that depend on subpixel rendering (like scaling and transforms to things that don't end up on round pixel numbers).

gsnedders commented 7 years ago

When it comes to SVG, see discussion in #4015 too.

emilio commented 7 years ago

This came up again on IRC on #servo:

23:44 — gw wonders if WPT has any coverage for elliptical border radii on box shadows 23:44 — gw assumes not 23:56 gw: my experience with WPT is that as soon as you enter rare-case-land, there are no tests... :( 23:57 gw: But that shouldn't prevent you from adding them! 23:57 emilio: I think it's probably not feasible for elliptical border radii on box shadows, due to the lack of fuzziness support in WPT. But that might change in the future :)

jgraham commented 7 years ago

I think this is something we should discuss at TPAC since it seems like different vendors have different concerns here. It would be useful to get a clear list of use cases in the next couple of weeks to guide that discussion.

foolip commented 7 years ago

we have nothing like this in Blink and I'm not sure we would want to add it either. A single pixel difference can be very significant, especially when it comes to borders, paint invalidation, and text rasterisation.

@eaenet, I was mainly thinking about a definition of fuzzy that allows each pixel's RGB values to be off by a tiny amount, where shifting a border by 1px would fail.

I suppose that SSIM (or similar) over the whole image might also be useful in some scenario, but it isn't what I created the issue for.

foolip commented 7 years ago

@gsnedders, can you add this to the TPAC agenda? Is there a doc somewhere for it already?

jdm commented 7 years ago

Concrete test example from servo's suite right now: https://github.com/servo/servo/issues/19271

jgraham commented 7 years ago

There was some discussion of this at TPAC and I didn't get any strong objections to implementing something gecko-like with N pixels allowed different and maximum difference per channel of M. I guess I will implement this next year sometime if no one does it first.

gsnedders commented 6 years ago

ping @liamquin, who mentioned some stuff that had been done previously while at TPAC, when I was talking to him about testing SVG.

liamquin commented 6 years ago

@gsnedders, i think i mentioned using a gaussian blur e.g. for gradient dithering comparisons. But i didn't mean it had been done (to my knowledge) for SVG or in WPT.

Also for testing PostScript rendering, where the exact pixels set for arcs wasn't well-defined for thick lines, converting the strokes to outlines and then to straight-line segments with a given tolerance, but CSS border arcs aren't exposed in the same way as PostScript paths, there's no equivalent of pathforall. You might get some mileage by scaling down an image of the result, though, using a specified scaling algorithm.

foolip commented 6 years ago

I came across https://wpt.fyi/css/css-masking/clip-path today, where it seems like many of the failures are due to anti-aliasing differences.

jgraham commented 6 years ago

I'm going to work on this.

The tentative plan is to make this work with a pair of properties on each reference link statement, one that defines the maximum difference per channel, and one that defines the maximum total pixels difference. Ranges will also be supported (this is in the gecko featureset). It will be possible to define these in the test for cases where we think that the reftest is intrinsically imperfect and also to override it in the expectation metadata for cases where there is a browser or platform specfic behaviour that causes differences (e.g. antaliasing which depends on the gfx stack).

In terms of the detailed syntax, I'm thinking that in-test data could be specified as query parameters on the test file e.g. &fuzzy=50;100 specifying difference per channel; total differences respectively. Or we could adopt something more explicit/wordy if people prefer.

For the expectation metadata it's less clear what to do since we don't currently have per-link metadata.

foolip commented 6 years ago

Is "maximum difference per channel" the sum of absolute differences in that channel for all pixels? And "maximum total pixels difference" the number of pixels for which there is any difference at all?

How would one use this feature for testing a box with a gradient, where the gradient color might be slightly off, but the box's position mustn't move around?

staktrace commented 6 years ago

In the Gecko reftest syntax, the "maximum difference per channel" is the maximum difference allowed in the channel value for any given pixel. It's not the sum. And the "maximum total pixels difference" refers to the number of pixels, yes.

For a 100x100 box with a gradient where all the pixels are slightly off, you'd use something like (2, 10000). So each RGB component of each of the 10000 pixels would be allowed to vary by up to 2.

Edited to add: if the box's position moved, then there would be pixels that changed from e.g. white (background) to blue (assuming a blue gradient) and the blue channel value difference of that pixel would exceed the "2" threshold.

foolip commented 6 years ago

That makes sense, thanks @staktrace!

Is there tooling that suggests what the minimum values would be to pass based on a failure, or how do people pick the numbers? Also, I presume that they are CSS pixels, so that on hidpi screens you multiply these numbers by 4 or whatever the ratio is for the screen?

jgraham commented 6 years ago

In general reftests don't work on HiDPI screens, so I wouldn't worry too much about getting this right for HiDPI.

The gecko reftest harness spits out the relevant values for failing tests, so presumably one picks something based on that.

gsnedders commented 6 years ago

In general reftests don't work on HiDPI screens, so I wouldn't worry too much about getting this right for HiDPI.

This is a comparison of Safari TP on lodpi v. hidpi, which shows surprisingly few differences. This is basically #5498 (we should probably change the title of that…).

jgraham commented 6 years ago

I just opened https://github.com/web-platform-tests/wpt/pull/12187 which implements gecko-like reftest fuzziness in a wpt-friendly way. It isn't ready to land, but is ready for feedback.

gsnedders commented 5 years ago

Fixed by #12187.

web-platform-tests / wpt

Suport fuzzy matching for reftests #7543