web-platform-tests / rfcs

web-platform-tests RFCs
74 stars 63 forks source link

RFC #165 expected-fail-message in test metadata #165

Closed WeizhongX closed 7 months ago

WeizhongX commented 9 months ago

Chromium developers want to have a way to track the expected failure message, since a change in how a test fails can also be unintended and suggestive of a problem.

Made a proposal for discussion.

Bug: 42034

foolip commented 9 months ago

I am somewhat concerned that this puts additional constraints on test authors. For example how does this work with a message like expected 250 got 312 where the latter can vary? Does the existing baseline setup allow some kind of wildcards?

Currently in Chromium that would lead to a flaky test which needs to be disabled in TestExpectations, or if the difference is just between platforms, we have platform-specific -expected.txt baselines. (There are no wildcards.) However, this all seems surprisingly rare. By keeping the same basic scheme, it won't add new constraints on Chromium developers or other test authors.

jgraham commented 9 months ago

However, this all seems surprisingly rare

That is very surprising to me, because I can think of multiple features with a lot of tests with failure messages like that. And certainly the audio tests that dump floating point numbers into the test names (despite the fact that this is a violation of the testsuite semantics) cause no end of problems for us.

Anyway, my point of view is that this is broadly up to you in the sense that if Chromium want this feature as proposed then it's fine to go ahead and implement it. My only concern is that it shouldn't later be used as a reason to make test failure messages less useful / more consistent (e.g. by removing variable data from the output). If you do find yourselves wanting that you should instead figure out a way to relax these conditions (e.g. having a way to not update the annotations for tests where the output is not consistent, or a way to specify regex to match for the expected output.

Another concern is that people start treating the test failure message as an actual part of the API so instead of actually writing subtests they just dump a bunch of textual data in the failure message and assume that's good enough to distinguish which cases passed and which failed. I hope the risk of that is small since it hasn't happened so far, but if it does happen I think we'd need to consider how to prevent it.

WeizhongX commented 9 months ago

My only concern is that it shouldn't later be used as a reason to make test failure messages less useful / more consistent (e.g. by removing variable data from the output)

"expected-fail-message" is an optional field, so when not used there is not any requirement on the fail messages.

Another concern is that people start treating the test failure message as an actual part of the API so instead of actually writing subtests they just dump a bunch of textual data in the failure message

There is such risk, but failure message is implementation specific, so tests written this way would not work on a different browser thus will be surfaced pretty soon?

As I said in the bug(42034), we are already running WPTs this way in Chromium, and the error message is mandatory, so it is unlikely to deteriorate the status.

jgraham commented 9 months ago

"expected-fail-message" is an optional field, so when not used there is not any requirement on the fail messages.

The concern is that people intentionally sacrifice usefulness for consistency e.g replacing assert_greater_than(x, 200) with assert_true(x > 200) because the latter produces a message that's independent of the value of x whereas the former embeds useful — but possibly variable — data into the message.

This is more of a concern when the failure message looks like an official part of the wpt API vs when it's clearly only part of vendor embeddings.

WeizhongX commented 9 months ago

"expected-fail-message" is an optional field, so when not used there is not any requirement on the fail messages.

The concern is that people intentionally sacrifice usefulness for consistency e.g replacing assert_greater_than(x, 200) with assert_true(x > 200) because the latter produces a message that's independent of the value of x whereas the former embeds useful — but possibly variable — data into the message.

This is more of a concern when the failure message looks like an official part of the wpt API vs when it's clearly only part of vendor embeddings.

Our automation tool will not make change to the test. It will simply mark the test as flaky. Will developers favor assert_true over assert_greater_than? I think it is up to developer's decision. But as I said we are already using baselines in chromium, so it might be that developers are already using that technology to deflake a test. But as expected-fail-message is optional, we now give them a second option by continue to use assert_greater_than without the expected-fail-message in the ini file. If the developers think the error message is not interesting, they should simply not using expected-fail-message, instead of make a change to the test?

The intention of expected-fail-message is to have a way to assert behavior. In that sense assert_greater_than has more information and would be prefered if they want to use expected-fail-message? This matches with the case I mentioned in bug/42034

mfreed7 commented 9 months ago

I am somewhat concerned that this puts additional constraints on test authors. For example how does this work with a message like expected 250 got 312 where the latter can vary? Does the existing baseline setup allow some kind of wildcards? But once we have wildcards we can't auto-update the metadata (or at least can only update to literals, which seems insufficient).

This is precisely the thing we'd like to be able to test for. If the test "usually" says expected 250 got 312 but as the result of some code change starts saying expected 250 got 313 then we'd like to catch that and triage it appropriately. I don't think it places additional constraints on test authors - they should keep doing this! It's helpful. Note that in both cases this will still be a failure, just a failure change that can be detected.

mfreed7 commented 9 months ago

Anyway, my point of view is that this is broadly up to you in the sense that if Chromium want this feature as proposed then it's fine to go ahead and implement it. My only concern is that it shouldn't later be used as a reason to make test failure messages less useful / more consistent (e.g. by removing variable data from the output). If you do find yourselves wanting that you should instead figure out a way to relax these conditions (e.g. having a way to not update the annotations for tests where the output is not consistent, or a way to specify regex to match for the expected output.

Totally agree with this - adding this capability shouldn't be used as a reason to change test output text.

jgraham commented 9 months ago

If the test "usually" says expected 250 got 312 but as the result of some code change starts saying expected 250 got 313 then we'd like to catch that and triage it appropriately.

The cases I was thinking of are things like timing tests where the output is inherently unstable (e.g. if some timing property is supposed to be 0 but due to a bug you set it to some actual time interval, the actual interval is going to be highly dependent on the specific environmental conditions and so vary from run to run). With e.g. layout tests it seems likely that things are more stable, but even for fuzzy annotations we allow a range of values.

As I've said, I'm not confident that this doesn't create pressure to elide such information from tests if the implementation is both doing the failure message comparison, and happens to fail the test at the point of creation. But if you're already doing this anyway, and are going to continue to do so, it doesn't seem obviously worse for it to be in wptrunner itself.