Reftest simplification - Githubissues

gsnedders commented 5 years ago

jgraham commented 5 years ago

@fantasai will likely have opinions here.

From my point of view; the original design in wpt was that we just considered AND to be the only condition and didn't have multi-file chains at all. That's much simpler since you only need to examine a single file to determine what to run.

If there is consensus that cases involving both AND and OR shouldn't be allowed, then I would much rather regain that simplicity by requiring that all links go in the test file using a property in the file to determine whether we are using AND or OR semantics e.g. <meta name="ref-combine" content="AND">. Having everything about the test determined by reading a single file is a huge simplication and performance win for several pieces of infrastructure; the test runner, the lint, the manifest generator, etc.

Even if the consensus is that we do have valid use cases for both AND and OR in a single test, I'd like to explore ways to allow expressing that within the confines of a single file e.g.

<link rel="match" href="ref1-1.html" title=group1>
<link rel="match" href="ref1-2.html" title=group1>
<link rel="match" href="ref2-1.html" title=group2>

Yes this is a bit of a semantic abuse, but it's a strawman proposal to give an idea of the kinds of syntax that would make this feature easier to support and easier to author without losing flexibility, if we determine that flexibility is actually required.

gsnedders commented 5 years ago

After landing https://github.com/web-platform-tests/wpt/pull/15523, we now only have three tests with more than two outward links (all having three):

/css/CSS2/box/rtl-linebreak.xht
/css/css-fonts/font-default-04.html
/css/css-transforms/transform-compound-001.html

The first and the last of these look like they're again meant to be using AND and not OR; only the second looks like the correct use of the feature. The first had the link added in https://github.com/web-platform-tests/wpt/commit/2820fcbf0751eece2cdc2a557002bdd6cf0d571a (and this represents the multiple lines in the reftest.list file at the time), so that seems like that conversion was wrong (depending on how the semantics of the link was specified then); the third seems to have been wrong from when it was originally authored.

It might be worthwhile going through the other 108 tests with alternates and seeing how many of those should also be AND.

Certainly we should expect alternates to be rare, because cases where we allow multiple behaviours should be rare.

gsnedders commented 5 years ago

Myself and @jgraham had some discussion whereby both of us are in principle okay with getting rid of the AND option entirely (AFAIK, nothing outside of WPT has this; all the browser-specific things are two-page only); we essentially have the AND option there as a matter of making explicit some dependencies between tests, which we don't have in general, and doesn't seem to have ever posed much actual problem.

jgraham commented 5 years ago

To be clear, whilst I'm definitely OK with that, I'm not convinced it's my favoured option compared to allowing complex tests to be defined in a single file. But I think feedback from reftest authors on requirements is the key thing we're missing here.

jgraham commented 5 years ago

https://lists.w3.org/Archives/Public/public-test-infra/2014JulSep/0036.html is the thread from the last time we discussed this and changed to the current system.

gsnedders commented 5 years ago

https://lists.w3.org/Archives/Public/public-test-infra/2016OctDec/0027.html was some earlier discussion from chains, through from memory we didn't actually do anything on that occasion.

jgraham commented 5 years ago

This reached the initial timeout for going ahead with uncontroversial RFCs. In this case I think the fact we know we are missing input from key constituencies around requirements means that we should take the position there are substantive remaining issues that prevent the RFC merging by default.

fantasai commented 5 years ago

Heya. So, I have to admit I don't really understand what's happening in this issue. Here's a few comments that might be relevant?

I've seen cases where having OR was necessary (we were in fact discussing one just last week), and also cases where having AND was useful (to help ensure a single bug didn't affect both the test and its reference, iirc there were some text-indent tests set up like that). Don't recall any cases where both relationships were necessary, but I expect most of those could split out into multiple tests (by duplicating one of the files with different metadata) if necessary.

I think using a meta tag to distinguish AND vs OR listings of match links is error-prone and confusing.

Putting all AND references into the test file seems not great because we have many cases where a reference file is shared across multiple tests: if it has its own verification, any link into the reference should be picking that up. We don't want to have to update lists of references in multiple tests if someone adds a backup reference to a reference.

FWIW, the Mozilla reftest infrastructure (last time I looked at it which was years ago) treated each edge in the graph as a test, rather than treating a particular node as a test, and the convention was the first node named the test and the second the reference. (In our system it would be the outward-pointing file of the edge names the test.) Maybe it's helpful to take that approach if the current one seems complicated? You'd have to group alternates together, but otherwise each test would simply be a pair of files.

jgraham commented 5 years ago

Heya. So, I have to admit I don't really understand what's happening in this issue. Here's a few comments that might be relevant?

Thanks, this is helpful.

I've seen cases where having OR was necessary (we were in fact discussing one just last week), and also cases where having AND was useful (to help ensure a single bug didn't affect both the test and its reference, iirc there were some text-indent tests set up like that). Don't recall any cases where both relationships were necessary, but I expect most of those could split out into multiple tests (by duplicating one of the files with different metadata) if necessary.

I think using a meta tag to distinguish AND vs OR listings of match links is error-prone and confusing.

One of the key things that @gsnedders has found is that the current setup is misused very often, presumably because people are bad at understanding what chain of refs is being built up across multiple files. At least to me the evidence is convicing that localising the test metadata into a single file would be an improvement over the multi-file graph approach.

Putting all AND references into the test file seems not great because we have many cases where a reference file is shared across multiple tests: if it has its own verification, any link into the reference should be picking that up. We don't want to have to update lists of references in multiple tests if someone adds a backup reference to a reference.

I think I agree that there's a tradeoff here, just not what the correct option is :) There might be something we could do to help like lint to ensure that all ref chains with a common node have the same subtree rooted on that node. Or there could be some way to include an external file with a list of references like

test.html

<link rel="match" href="reftest.reflist">

reftest.reflist

match ref1.html
  match ref1-and.html
match ref1-or.html
  mismatch ref2.html

That's kind of unfortunate because it creates a whole different syntax for complex cases, and means reading 2 files rather than 1, but it is a class of approaches to address this kind of concern.

FWIW, the Mozilla reftest infrastructure (last time I looked at it which was years ago) treated each edge in the graph as a test, rather than treating a particular node as a test, and the convention was the first node named the test and the second the reference. (In our system it would be the outward-pointing file of the edge names the test.) Maybe it's helpful to take that approach if the current one seems complicated? You'd have to group alternates together, but otherwise each test would simply be a pair of files.

That seems like the "don't support AND at all, just model each pair as a seperate test" option? Certainly having multiple links mean OR and no support for single tests with multiple required matches is a workable system that addresses the main requirements, as long as people are happy with the fact that in some set of tests that currently look like A == B == C, D == B == C, E == B == C we end up with A == B, B == C, D == B, E == B so a failure of B == C starts failing one test not all tests.

gsnedders commented 5 years ago

I think using a meta tag to distinguish AND vs OR listings of match links is error-prone and confusing.

As @jgraham says, people are misusing the current setup, so we probably need something better than the status quo.

Putting all AND references into the test file seems not great because we have many cases where a reference file is shared across multiple tests: if it has its own verification, any link into the reference should be picking that up. We don't want to have to update lists of references in multiple tests if someone adds a backup reference to a reference.

Yes, there are definite benefits to having the links in references. Some of the transforms tests were wanting to do things like A == A-ref && A == B && A != C (&& B == B-ref && B-ref == B-ref-ref).

FWIW, the Mozilla reftest infrastructure (last time I looked at it which was years ago) treated each edge in the graph as a test, rather than treating a particular node as a test, and the convention was the first node named the test and the second the reference.

Then we don't really solve half the problems the current system was designed to solve: notably, we don't have any concept of AND at all then (because then a single bug can affect both test and reference, we'll just have a separate failure for the reference not matching). Now with a decade of hindsight, that might be an acceptable trade-off, especially given the rare usage of it, but it is a much larger change.

gsnedders commented 5 years ago

As an aside, is it worthwhile splitting out the "Test Detection" side of this? All that does is cause the manifest to contain more tests and makes it possible to implement various optimisations. It'd be nice to get that at least landed.

gsnedders commented 5 years ago

As an aside, is it worthwhile splitting out the "Test Detection" side of this? All that does is cause the manifest to contain more tests and makes it possible to implement various optimisations. It'd be nice to get that at least landed.

From IRC:

15:17 < jgraham> gsnedders: Is that what users expect? It seems pretty weird to me 15:21 < gsnedders> jgraham: people get very confused by tests like /css/CSS2/floats/float-nowrap-1.html not appearing, because float-nowrap-2.html links to it as a match 15:22 < gsnedders> jgraham: whereas in m-c you get results for both tests, because obviously they're separate tests 15:24 < jgraham> Are they? Is this a general subset of "people are confused about multi-file graphs"

@emilio upstreamed those tests in bugzilla bug 1485669, so let's ask him:

You converted the following reftests to be WPT tests:

!= float-nowrap-1.html float-nowrap-1-notref.html
== float-nowrap-2.html float-nowrap-1.html
== float-nowrap-3.html float-nowrap-3-ref.html
!= float-nowrap-3.html float-nowrap-4.html
== float-nowrap-4.html float-nowrap-4-ref.html
== float-nowrap-5.html float-nowrap-5-ref.html
== float-nowrap-6.html float-nowrap-5-ref.html
== float-nowrap-7.html float-nowrap-1.html
== float-nowrap-8.html float-nowrap-1.html
== float-nowrap-9.html float-nowrap-3-ref.html

This caused float-nowrap-1.html and float-nowrap-4.html to not appear as tests. To be precise, we get the following graph:

float-nowrap-9 html dot

Per the rules summarised in the RFC above, only the top nodes form tests in WPT (as they are the source nodes of the graph).

So, uh:

Did you expect this?
Is this surprising?
Would you rather float-nowrap-1.html was also a test (without changing anything else, if it failed so would -2, -7, and -8 by virtue of inclusion)?

emilio commented 5 years ago

I'm a bit surprised, yeah, though I get why. As a test author I think any test that has a <link rel="match"> or <link rel="mismatch"> should appear as its own test.

jgraham commented 5 years ago

My main objection to splitting the RFC is user fatigue if we make one change to the rules now and a second change in a month.

foolip commented 5 years ago

On test detection, @gsnedders suggests the simple rule that anything that's not determined to be a ref by filename convention is a test. I think that shouldn't require any more pages to be loaded to run the tests, and would only change how results are reported to include more tests. That sounds low risk and strictly like an improvement.

On reftest graph construction, I think the important goal is that it's possible to find all refs and construct the graph by starting at a test file and following links, possibly recursively. In other words, that incoming links don't matter. For ease-of-understanding it also matters what amount of complexity is possible in reftest graphs. However, that seems separable from how the graph is expressed in file(s).

@gsnedders do you think there's value in placing constraints on the reftest graph unless we go all the way to allowing only test-to-ref comparisons? Is there any upside to disallowing cycles if trails of ref-to-ref comparisons are still allowed?

gsnedders commented 5 years ago

I think that shouldn't require any more pages to be loaded to run the tests, and would only change how results are reported to include more tests. That sounds low risk and strictly like an improvement.

In the current wptrunner implementation, that relies on caching behaviour; we don't actually guarantee it (and notably we'll run float-nowrap-1.html and float-nowrap-2.html as separate tests, just hopefully reusing the cached screenshots).

do you think there's value in placing constraints on the reftest graph unless we go all the way to allowing only test-to-ref comparisons?

What sort of constraints? I think we need some to make it easier to explain failures; I'm not sure test-to-ref actually makes much difference here.

foolip commented 5 years ago

@gsnedders I mean is there any reason to try to avoid loops or anything else the reftest graphs as long as it's possible to discover the structure by starting at the test?

foolip commented 5 years ago

I asked @tabatkins about this and he said he's never written more complicated reftests and so doesn't have an opinion either way.

frivoal commented 5 years ago

I'm supportive of the test detection part
I don't think fancy graphs are useful, and even they are in some rare case, that's not enough to justify the system's complexity.
I do think ORs are useful (I've written test cases when allowing alternate behavior was needed)
I also think ANDs are nice to have. They can probably be rewritten differently, but that may make the test quite convoluted
the RFC preserves all that, so I'm ok with it
I do worry a little that the now-invalid graphs are non obvious, so the linter should warn against them, otherwise we may have some silent failures
I don't have a strong opinion about https://github.com/web-platform-tests/rfcs/pull/15#issuecomment-467064748 <meta name="ref-combine" content="AND">. Current syntax is lightweight, but apparently confusing. This is verbose, but presumably less confusing, at least if it is an error that the linter will warn about to have two or more refs without having a ref-combine meta. If the linter doesn't say anything, one of AND and OR is presumably the default, and that's just as confusing as the status quo. I'd probably lean towards the status quo, but that may just be due to familiarity. no strong objection as long as this is linter assisted.

Hexcles commented 5 years ago

Now that power users have expressed their opinions (which IMHO are more important), I just want to add in some perspectives as an implementer of the (supporting) infra (especially the wpt.fyi dashboard).

I'd strongly hope we can completely eliminate non-trivial graphs and no longer need to resort to a proper graph model to describe reftests. Concretely, I agree that:

Any file that has reference(s) and is not explicitly marked as a reference itself is a test (i.e. the "test detection" part of the RFC).
Any test has a flat list of references conceptually (*), all of which can be combined with AND or OR but not both (i.e. the "valid graphs" part).

(*) Although dashboards like wpt.fyi do not care about the syntax of the tests (we only care about the manifest and the JSON report), I personally as a user would prefer that the list of references to be syntactically flat as well (i.e. multiple <link> with a <meta>). The tradeoff is a bit more work (having to repeat frequently used groups), but I think that's less error-prone than having to trace across files.

gsnedders commented 5 years ago

One of the key things that @gsnedders has found is that the current setup is misused very often, presumably because people are bad at understanding what chain of refs is being built up across multiple files.

I wonder if we should just require <meta name="ref-combine" content="AND"> if there's more than one link rel=match/mismatch in a file? i.e., make it required to explicitly state whether AND/OR is intended.

Given there are various advantages of keeping the current ability to chain refs (making sure refs render reasonably, as @fantasai and @frivoal both say), I wonder if we should allow both? That said, that still means we only get the second assertion for one test running with that ref…

I don't know what the right solution here is. :/

gsnedders commented 5 years ago

OK, having given those above several months to respond, my proposal to move this forward:

We simplify the "are you a test or just a node in the graph" to "does the path look like a reference?", given nobody seems to disagree with this.
We require all files with more than one <link rel=match> to have a <meta name=ref-combine> to force users to actually do what they intend (i.e., we throw a linter error for every such case, but maintain the current default of OR).

I don't think we have a clear answer as to whether we want to disallow anything more complex (i.e., mixed AND/OR), but I'm also not so convinced that needs to be addressed now, given they scarcely exist. (In a sense, the majority of the issue caused by them is that they're hard to explain when tests fail, but we've never done that properly and we've seen very few, if any, complaints about that.)

If we move to relying on <meta name=ref-combine>, I start to feel more strongly that we should encode the pass condition in the manifest (maybe only if there's more than a single edge?).

jgraham commented 5 years ago

It seems like from an implementation point of view the ref-combine part of the proposal makes things even worse than now because you have all the same complexities of the current system plus the additional complexity of having to check whether you should be using AND or OR semantics for each file.

Implementation complexity is of course not the only relevant consideration, but there have already been complaints about the current code being over-complex.

foolip commented 5 years ago

@jgraham what are the redundant complexities one could perhaps simplify away? I know there are chains of refs that could represent the same thing as the AND behavior, but is there more?

jgraham commented 5 years ago

The proposal doesn't remove any of the existing features but introduces a new possibility which is that same-file references combine as AND rather than OR. This means that you need all the same logic as today plus logic for single-file AND. Therefore it's strictly more complex than what we already have, without actually allowing anything different. The main advantage is that it makes the same-file behaviour explicit rather than implicit, which should enable authors to more reliably write correct tests.

This tradeoff isn't necessarily a blocker; in general I think people prefer complex implementations that work better for authors than the reverse. But in this specific case people like @gsnedders and @Hexcles have previously complained about the difficulty of writing a correct implementation and validating the wptrunner code so I want to ensure we're all on the same page about what the tradeoffs in the proposal actually are.

In terms of alternatives there seem to be two main proposals:

Require each test define a list of references references combined as AND or OR but not both (i.e. disallow multi-file chains). This is a significant loss of power compared to today, and it's not immediately obvious why you should have to make that choice between AND and OR.
Put the full reftest graph as metadata in the test file. This has the same functionality as today, is faster to process, and simpler to implement and makes authoring fully explicit. The tradeoff is that if you have [A | A' | A''] == B == C then currently you get the full chain defined for each of A, A', and A'' simply by including a ref to B in each of those, and if you add a new requirement C != D all the tests are immediately updated. In this proposal you'd either need to copy the full graph into each file or just have one test include the B == C part and hope people realise that test failing invalidates the results of all the others.

jgraham commented 5 years ago

I guess we could also have an external file like <link rel="reftest-manifest" href="manifest"> where the manifest file would look like (imagine it's actually JSON or YAML or something, if you like)

AND:
  == B
  == D
  != E
OR:
  == F

which could then be reused between tests. In simple cases you could just link to the refs directly so the overhead would only come in some subset of complex situations.

Hexcles commented 5 years ago

I'm not sure I follow completely @gsnedders ' latest proposal:

We simplify the "are you a test or just a node in the graph" to "does the path look like a reference?", given nobody seems to disagree with this.

Example: test-001.html links to test-002.html with a match, which then links to test-ref.html with a match. I understand test-001.html and test-002.html are now both tests (which addresses one of my pain points), but what would test-001.html look like? Does the runner only compare it against test-002.html, or recursively (i.e. test-001.html == test-002.html == test-ref.html)?

If the runner still needs to recursively verify the whole (sub)graph. Then I agree with @jgraham that this actually adds new complexity.

I still prefer to always flatten all the references of a test, which can be combined using AND or OR depending on a meta tag, so that the runner no longer recursively finds references. For the example above, this would mean we have exactly two tests: test-001.html == test-002.html and test-002.html == test-ref.html. If the author wants the AND semantic, they'd have to explicitly link to test-ref.html from test-001.html.

frivoal commented 5 years ago

If we're changing how the tests and their refs must be set up to express different boolean expressions over matching and mismatching, I think we need to be careful about the negative (mistmatch) cases.

A must be different from B, and also must be different from C

is not the same as

A must be different from B which itself must be different from C

The first one is currenty achieved (I believe) by having two mismatch references in the same test, the second by having a chain of mismatch.

fantasai commented 5 years ago

@gsnedders

Given there are various advantages of keeping the current ability to chain refs (making sure refs render reasonably, as @fantasai and @frivoal both say), I wonder if we should allow both? That said, that still means we only get the second assertion for one test running with that ref…

I think it's fine if a reference file with a backup reference ends up getting indexed also as test, fwiw. As long as there's a failure noted somewhere, it can be investigated. There doesn't have to be an explicit chain that fails everything. In a Test == Ref == Ref2 chain, if Ref != Ref2, we don't actually know what's failing to render correctly: whether Test or Ref or Ref2 is failing to render correctly, just something is wrong. That's not too different from Test == Ref, where bad rendering of Ref2 can cause Test to fail even though it should pass. Chaining them up doesn't gain us much, imo. It's just a nice idea. What would be nicer is actual test dependencies: this series of tests is invalid (failing) if you don't pass test Q...

If we move to relying on <meta name=ref-combine>, I start to feel more strongly that we should encode the pass condition in the manifest (maybe only if there's more than a single edge?).

I don't understand this comment.

I would suggest to treat each file with a set of one or more ORed MATCH references and/or ANDed MISMATCH references as a test, which is roughly the semantics we have currently. Multiple MATCH references for a test can be handled through the transitive relationship of matching, but need not be tracked as an AND chain in the system, since this is causing complexity for both authors and tooling and, afaict, isn't needed.

A typical test will have one MATCH reference and zero or more MISMATCH references, all of which are required to pass. A test for behavior in which the spec grants various allowances is the only case where multiple MATCH references would exist. References can have backup references: they basically become tests themselves in this case.

gsnedders commented 5 years ago

Let's try and get a better idea of what we actually have in the graph currently, given some of the questions are about what authors expect. (Going to split this up into multiple comments so we actually get fragments for different sections.)

Currently, the reftest nodes with more than one match (all have no mismatch) reference are:

/css/CSS2/backgrounds/background-color-049.xht
/css/CSS2/backgrounds/background-color-054.xht
/css/CSS2/backgrounds/background-color-070.xht
/css/CSS2/backgrounds/background-color-075.xht
/css/CSS2/backgrounds/background-color-090.xht
/css/CSS2/backgrounds/background-color-095.xht
/css/CSS2/backgrounds/background-color-110.xht
/css/CSS2/backgrounds/background-color-115.xht
/css/CSS2/borders/border-bottom-color-030.xht
/css/CSS2/borders/border-left-color-030.xht
/css/CSS2/borders/border-right-color-030.xht
/css/CSS2/borders/border-top-color-030.xht
/css/CSS2/tables/table-anonymous-objects-115.xht
/css/CSS2/tables/table-anonymous-objects-116.xht
/css/CSS2/tables/table-anonymous-objects-117.xht
/css/CSS2/tables/table-anonymous-objects-118.xht
/css/CSS2/tables/table-anonymous-objects-119.xht
/css/CSS2/tables/table-anonymous-objects-120.xht
/css/CSS2/tables/table-anonymous-objects-121.xht
/css/CSS2/tables/table-anonymous-objects-122.xht
/css/CSS2/values/numbers-units-015.xht
/css/CSS2/zorder/z-index-020.xht
/css/css-fonts/font-default-04.html
/css/css-regions/transforms/regions-transforms-013.html
/css/css-text/hyphens/shy-styling-001.html
/css/css-text/white-space/trailing-ideographic-space-003.html
/css/css-text/white-space/trailing-ideographic-space-004.html
/infrastructure/reftest/reftest_or_0.html

Looking through these:

/css/CSS2/backgrounds/*: these are using an or condition to deal with rounding being undefined (though, uh, I think is relying on 24-bit colour).
/css/CSS2/borders/*: these are using an or condition to allow for support of CSS Color 4 v. CSS 2.1
/css/CSS2/tables/*: these seem bogus, and there should only be one of the two refs (and, uh, that's my fault, but see https://github.com/web-platform-tests/wpt/issues/16064)
- /css/CSS2/values/numbers-units-015.xht: this is using an or condition to test behaviour that depends on font support
- /css/CSS2/zorder/z-index-020.xht: this is using an or condition to deal with the "optionally" behaviour allowed in CSS 2.1 Appendix E.
- /css/css-fonts/font-default-04.html: this is using an or condition to deal with alternate glyphs in fonts
- /css/css-regions/transforms/regions-transforms-013.html: this is using an or condition to deal with, uh, something with regions and monolithic blocks
- /css/css-text/hyphens/shy-styling-001.html: this is using an or condition to deal with different allowed text rendering
- /css/css-text/white-space/*: this is using an or condition to deal with text rendering differences
- /infrastructure/reftest/reftest_or_0.html: we have a test to make sure we do what we claim! yay!

So all of these with the exception of /css/CSS2/tables/* are deliberately alternates.

gsnedders commented 5 years ago

The following have at least one match and at least one mismatch ref (this is almost certainly bogus!):

/css/CSS2/box/rtl-linebreak.xht
/css/CSS2/floats/floats-wrap-top-below-bfc-001l.xht
/css/CSS2/floats/floats-wrap-top-below-bfc-001r.xht
/css/CSS2/floats/floats-wrap-top-below-inline-001l.xht
/css/CSS2/floats/floats-wrap-top-below-inline-001r.xht
/css/CSS2/sec5/grouping-000.xht
/css/CSS2/sec5/grouping-001.xht
/css/CSS2/selectors/grouping-002.xht
/css/CSS2/selectors/universal-selector-001.xht
/css/CSS2/selectors/universal-selector-002.xht
/css/css-transforms/transform-background-006.html
/css/css-transforms/transform-compound-001.html
/css/css-transforms/transform-display-001.html
/css/css-transforms/transform-display-002.html
/css/css-transforms/transform-display-003.html
/css/css-transforms/transform-display-004.html
/css/css-transforms/transform-generated-001.html
/css/css-transforms/transform-generated-002.html
/css/css-transforms/transform-inline-001.html
/css/css-transforms/transform-matrix-005.html
/css/css-transforms/transform-matrix-006.html
/css/css-transforms/transform-matrix-008.html
/css/css-transforms/transform-origin-name-001.html
/css/css-transforms/transform-origin-name-002.html
/css/css-transforms/transform-origin-name-003.html
/css/css-transforms/transform-origin-name-004.html
/css/css-transforms/transform-origin-name-005.html
/css/css-transforms/transform-origin-name-006.html
/css/css-transforms/transform-origin-name-007.html
/css/css-transforms/transform-percent-001.html
/css/css-transforms/transform-percent-002.html
/css/css-transforms/transform-percent-003.html
/css/css-transforms/transform-percent-004.html
/css/css-transforms/transform-percent-005.html
/css/css-transforms/transform-percent-006.html
/css/css-transforms/transform-percent-007.html
/css/css-transforms/transform-percent-008.html
/css/css-transforms/transform-rotate-001.html
/css/css-transforms/transform-rotate-002.html
/css/css-transforms/transform-rotate-003.html
/css/css-transforms/transform-rotate-004.html
/css/css-transforms/transform-rotate-005.html
/css/css-transforms/transform-rotate-006.html
/css/css-transforms/transform-rotate-007.html
/css/css-transforms/transform-table-001.html
/css/css-transforms/transform-table-002.html
/css/css-transforms/transform-table-003.html
/css/css-transforms/transform-table-004.html
/css/css-transforms/transform-table-005.html
/css/css-transforms/transform-table-009.html
/css/css-transforms/transform-table-010.html
/css/css-transforms/transform-table-011.html
/css/css-transforms/transform3d-matrix3d-001.html
/css/css-transforms/transform3d-matrix3d-002.html
/css/css-transforms/transform3d-matrix3d-003.html
/css/css-transforms/transform3d-matrix3d-004.html
/css/css-transforms/transform3d-perspective-001.html
/css/css-transforms/transform3d-perspective-002.html
/css/css-transforms/transform3d-perspective-007.html
/css/css-transforms/transform3d-perspective-009.html
/css/css-transforms/transform3d-rotate3d-002.html
/css/css-transforms/transform3d-rotatex-perspective-003.html
/css/css-transforms/transform3d-rotatex-transformorigin-001.html
/css/css-transforms/transform3d-rotatey-001.html
/css/css-transforms/transform3d-scale-001.html
/css/css-transforms/transform3d-scale-002.html
/css/css-transforms/transform3d-scale-003.html
/css/css-transforms/transform3d-scale-007.html
/css/css-transforms/transform3d-translate3d-001.html
/css/css-transforms/transform3d-translatez-001.html
/css/css-writing-modes/text-combine-upright-value-all-002.html
/css/css-writing-modes/text-combine-upright-value-all-003.html
/css/css-writing-modes/text-combine-upright-value-digits2-002.html
/css/css-writing-modes/text-combine-upright-value-digits2-003.html
/css/css-writing-modes/text-combine-upright-value-digits3-001.html
/css/css-writing-modes/text-combine-upright-value-digits3-002.html
/css/css-writing-modes/text-combine-upright-value-digits3-003.html
/css/css-writing-modes/text-combine-upright-value-digits4-001.html
/css/css-writing-modes/text-combine-upright-value-digits4-002.html
/css/css-writing-modes/text-combine-upright-value-digits4-003.html
/css/css-writing-modes/text-combine-upright-value-none-001.html

I'm not going to go through all of these, but to take a few samples:

/css/CSS2/box/rtl-linebreak.xht, /css/CSS2/floats/*: these seems to have imported all of their conditions from the reftest.list file (where they're separate tests), so these should be and.
everything else in /css/CSS2/* should be and.
/css/css-transforms/*: Having looked through a lot of these, these all look like they should be and.
/css/css-writing-modes/*: Again, should be and.

Notably, we have only two tests that have more than one match and one mismatch:

/css/CSS2/box/rtl-linebreak.xht as mentioned above,
css/css-transforms/transform-compound-001.html has multiple mismatch refs that should be combined with and.

gsnedders commented 5 years ago

This leaves us with only one more category: multiple mismatches. These are:

/css/css-text/shaping/reference/shaping-023-ref.html
/css/css-text/shaping/reference/shaping-024-ref.html
/css/css-text/shaping/reference/shaping-025-ref.html
/css/css-writing-modes/full-width-001.html
/css/css-writing-modes/full-width-002.html
/css/css-writing-modes/full-width-003.html

All of those look like they should be and. (I'm not sure why we don't just have match refs for the writing modes ones, but that's another question!)

gsnedders commented 5 years ago

So it seems like the obvious move here is to make the pass function check that one-or-more match references pass, if there are any, and all mismatch references pass.

Of course, as @jgraham says that actually complicates the current situation (where they are currently all or), but does seem to match authorial intent.

At the moment, we don't seem to have any way to model:

A == B
A != C
A != D

I think A == B != C != A is as far as we can go, but we do have cases where the above is the intent.

So the question then becomes one of "do we allow chaining", given that goes from "one of these must pass and all of these must pass" to "uhhh".

I don't have a strong opinion about chaining, and AIUI nor does @jgraham and @fantasai thinks it doesn't gain us much.

For the curious, https://gist.github.com/gsnedders/93f7e5bbc32847e34bb0ae35b385662a gives all the current chains in WPT, grouped into weakly connected component graphs.

jgraham commented 5 years ago

If we don't allow chaining, it seems like everything gets much easier because suddenly we have a list rather than a graph. I think implcit rules about the way we combine match and mismatch seems bad, given that all the implicit rules we have at the moment are just causing confusion. So I refer back to my earlier strawman where we combine with AND by default and require people to specify OR explicitly like:

<link rel="match" href="ref1-1.html" title=group1>
<link rel="mismatch" href="ref1-2.html" title=group1>
<link rel="match" href="ref2-1.html" title=group2>

fantasai commented 5 years ago

So it seems like the obvious move here is to make the pass function check that one-or-more match references pass, if there are any, and all mismatch references pass.

This was documented in https://wiki.csswg.org/test/format#reference-links so it shouldn't be a new conclusion?

gsnedders commented 5 years ago

Then I guess this was got wrong when we first changed how reftests in WPT work, given the intent there was to match that.

foolip commented 5 years ago

@gsnedders what is the status of this? If this is not in a state where the core team can make a decision and that isn't going to change, can you close it and reopen at a later date? Or can we move this forward at TPAC?

jgraham commented 5 years ago

I think planning to discuss this at TPAC is a reasonable idea.

gsnedders commented 5 years ago

https://gist.github.com/gsnedders/2ee57070569e177d973a6736f7d278bb is a current list (as of a12f37b6a04bdbde0902b0fc36b6a46f4782bd06) of tests with and/or/both.

gsnedders commented 5 years ago

Based on that data, we have 118 tests we currently interpret as only having OR; of those, we have seven tests that have multiple mismatches, and 85 tests that have at least one match and at least one mismatch. These account for the majority of tests with OR currently.

To me, this suggests we should move the pass condition to what the CSS WG originally documented (one match must pass if there is any match, plus all mismatch must pass).

Whether we should drop chaining is a separate concern as to how we do grouping on a per-page basis, IMO; we can certainly add a data-ref-group attribute to the link element that alters the default grouping (so you can do AND matches in page), or have a match OR mismatch (which is impossible on the CSS WG model).

jgraham commented 5 years ago

Yeah, I agree that the current model doesn't match what people thought was happening, so there are some buggy tests as a result. One of the original concerns in this issue is that the implementation complexity is too high. Keeping the existing complexity, but adding new complexity where the tree construction depends on match vs mismatch is going to make that worse.

I think I'd rather move entirely to the model where everything is defined in a single file and you use titles or whatever for grouping rather than going via an intermediary step where we keep adding complexity to the existing model because we don't want to make a bigger decision.

gsnedders commented 5 years ago

I'm not opposed to dropping support for the chaining (given nobody seems overly fussed by dropping this, given we'll still have the failure for the ref not matching its ref, etc.), but it sounds like you're opinion is that changing this is something that should only be done in combination with dropping it?

jgraham commented 5 years ago

I would rather not spend time making the existing model more complex before then throwing out the entire model and replacing it with something different. The implementation is not particlarly easy to update; it requires changes in the manifest code, in wptrunner, and in marionette, in addition to changes to all the tests that are using the old syntax/semantics.

foolip commented 5 years ago

Is there anyone who would argue against removing chaining and doing "AND" by having multiple tests? If there aren't objections and it's just a matter of doing the extra work to convert a chunk of existing tests, I'd like that option very much.

gsnedders commented 5 years ago

OK, done a pretty major revision to the RFC after a bunch of discussion at TPAC; I think this covers what we spoke about and I think there's consensus.

cc/ @jgraham @fantasai @frivoal

I am curious as to whether we really need to support explicit grouping, though. I don't think I've seen any example where we actually need it?

gsnedders commented 4 years ago

@web-platform-tests/wpt-core-team should we, at this point, aim to resolve this within a week? We had "substantive disagreement" a long time ago, and I don't think any new evidence has been presented in a very long time, and I think the current draft RFC has broad consensus?

I know @simonsapin raised at TPAC in the SVG WG some question about the ability to do more complex and/or combinations, which are now in the future work section, but it also seems like (both from the above and from discussions at TPAC) than @fantasai and @frivoal are okay with dropping them until we have any actual evidence of it being needed?

gsnedders commented 4 years ago

@Ms2ger I volunteer as tribute.

gsnedders commented 4 years ago

@web-platform-tests/wpt-core-team should we, at this point, aim to resolve this within a week?

It has now been two weeks; there has been no dissent from anyone either in the core team or otherwise, so let's land this. 🎉

foolip commented 4 years ago

Sweet! Is a tracking to implement this also needed, or how many parts will there be to this?

web-platform-tests / rfcs

Reftest simplification #15