web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
5.03k stars 3.12k forks source link

Testing test262 in different (web platform) agents #8308

Open annevk opened 7 years ago

annevk commented 7 years ago

Over in https://github.com/dpino/gecko-dev @dpino is working on ensuring that various SharedArrayBuffer tests from test262 run across the various agents defined by the web platform. A follow-up goal is to run all test262 tests across the various agents to ensure there's no weird bugs in the various JavaScript engines.

The idea is to host this "wrapper test suite" in web-platform-tests so all user agents can benefit.

If anyone has thoughts, ideas, or concerns that'd be great to hear.

cc @jgraham @domenic @foolip @ljharb @leobalter

(Corresponding issue: https://github.com/dpino/gecko-dev/issues/21.)

foolip commented 7 years ago

Is it important that everyone that uses web-platform-tests also gets test262 as part of it, or would it suffice if the tests are run on the same setup as for wpt.fyi and published either on wpt.fyi or a test262 results dashboard?

annevk commented 7 years ago

As I understand it test262 attempts to be host-agnostic, just like ECMAScript itself. So while the web platform has many agents, other hosts might just have one. So if we want to run those tests in a window, a worker, a shared worker, or a combination thereof (in case of SharedArrayBuffer), etc. I think that has to happen on the web-platform-tests side.

Various JavaScript engines can also run test262 directly, but that doesn't exercise quite the same code paths as running them through web platform agents.

foolip commented 7 years ago

Oh, so you're saying we'd run the tests at least in a window and worker context?

annevk commented 7 years ago

@foolip ideally all agents, including worklets (though only possible for audio worklets I think), service workers, and shared workers. That's the long term goal.

The short term goal is making sure SharedArrayBuffer tests are tested across all agent combinations, which similarly requires this kind of wrapper setup.

annevk commented 7 years ago

Perhaps reading https://gist.github.com/annevk/b15a0a9522d65c98b28fb8c6da9f0ae5 helps.

foolip commented 7 years ago

Thanks, that does help. Seems like a good start would be to pick a browser, write a wrapper for wpt, and run the test against the similar-origin window agent using wpt run. See if there are any differences to the results from the same tests run against the engine JS directly. Then also run against the other agents and see what other differences show up.

Most likely, new bugs will be revealed. Depending on how many bugs, the tradeoff between running the same tests many times vs. finding bugs might look different.

How long does it currently take to run all of the tests?

dpino commented 7 years ago

Hi @foolip . I coded an attempt to do what you suggested at https://github.com/dpino/gecko-dev/pull/2/commits/9641de06bae7bab0039223d2fd010e42c24ccb30

Basically it's a Perl script that prints out a WPT test with a customized list of test262's tests to run. In that commit I'm only supporting a DedicatedWorker, although it could be extended for other types of workers. The main issue with this approach was that it required to write wrappers for things that test262 uses (for instance the assert commands are slightly different than what WPT supports) and more importantly when I tried to build up a long list of tests to run, the whole test timeout.

I think this approach, although it can be interesting to try out test262's in a browser, it does not sound like the right approach, dunno.

Unfortunately the laptop I was using to do this work crashed today so I cannot check how long time takes to run the whole test262 or wpt suite. I will post those numbers once I my laptop gets fixed (luckily in one day or two).

jgraham commented 7 years ago

I think this makes sense. I think WASM might do something similar. The details of how the integration should work are unclear to me; how will web-platform-tests be kept in sync with the test262 tests?

dpino commented 7 years ago

@foolip Running all test262 in my laptop takes around 4 min. Not a very useful information. I suppose you were asking for the time spent running the all tests as part of a CI infrastructure or similar.

$ ./tests/jstests.py build_OPT.OBJ/dist/bin/js test262
[27706|    0|    0| 1041] 100% ======================================>| 233.0s
PASS

I gave a try at your suggestion (a wrapper that relies on wpt run to launch a test262 in the browser). I pushed the changes to a remote branch at: https://github.com/dpino/web-platform-tests/tree/test262-runner

I have several questions regarding web-platform-tests. Ideally the way I think the test262 suite should be run is by opening a browser and run all the tests in the same instantiated browser. With the wrapper above, each launch of a test opens/closes a new browser, therefore taking a very long time to run the whole suite (even more since the suite should run on different agents). Another approach could be to group several 262 tests together into a single WPT. I don't know if it would be possible to have one single instance of a browser where every test is run and the browser communicates back the results to the command shell.

jgraham commented 7 years ago

web-platform-tests generally work with one instance of the browser running multiple tests.

The most obvious way to do this integration would be to generate testharness.js wrappers for the test262 tests and check in the generated files. These would then run like any other testharness.js test. It looks like that's more or less what's on your branch, but you don't add all the files at once, and call wpt run for every test rather than once.

There are more complex solutions we could imagine in which the templates are baked into the server like with .worker.js files. I don't know if that's worthwhile.

dpino commented 6 years ago

Thanks @jgraham for the clarification. Initially I thought web-platforms-test launched a new browser per test, but I was wrong.

I've updated the script quite a bit. Now I just use the script to generate the WPT wrappers from test262 test files and run them external as normal WPT tests.

OTOH, some of the tests were failing or timeout. The issue was that some 262 tests modify builtin objects such as Array and that had a collateral effect on the web-platform-test harnessing code. So I actually need to parse the source of the test and add code to undo the change once the test is over. Anyway, still struggling with this.

annevk commented 6 years ago

Perhaps an alternative approach is to load the 262 test in an <iframe> and then use onload to inspect the result? Might not be as nice though and come to think of it would not work in a worker and such. Seems those kind of tests would be rather hard to do properly with a harness.

dpino commented 6 years ago

@annevk I can give a try to run the test in an iframe, at least for same-origin window, and see if I got more tests passing. Right now launching test262/builtins directory, which is the largest 262 directory, I got 1000 tests failing and 35 timeouts. Maybe some tests fail due to a missing JS shell feature in the browser (not all of them are implemented yet) or so. I would need to look more into the failing tests.

The good thing of running the test in the browser as web-platform-tests is reusing all the infrastructure for running tests and retrieving reports. But everything that has to do with instrumentation (Sellenium/Marionette) is actually not useful for this case IMHO. @jugglinmike told me about https://github.com/bterlson/test262-harness that is a node.js tool for running test262 in the browser (there's also https://github.com/bakkot/test262-web-runner). So maybe a similar tool that uses a web-socket to communicate the results from the browser to a server process could be another approach. I don't know. Does it make sense? For the moment, I'm going to keep trying this approach.

annevk commented 6 years ago

I'm not sure, I'm not familiar enough with all the harnesses. I'm curious if @bakkot has looked into running test262 in a worker environment.

dpino commented 6 years ago

I have first version of the tests running. I reworked the script to run the tests inside an IFrame. Then I added support for other agents: child Window, DedicatedWorker and SharedWorker. ServiceWorker is not supported yet, more on that later.

I used the results of Test262-Web-Runner as a baseline to compare the results I got. I run the tests on Firefox Nightly 59.a1. First of all, here are the results for Test262-Web-Runner:

Test262-Web-Runner

Test Ran Failed
annexB 977/1003 26
built-ins 12743/13446 (skipped 32) 703 + 32
harness 94/94 0
intl402 231/236 5
language 13917/14822 905

And here are the results of the web-platform-tests's wrappers for test262 (only IFrame in this benchmark):

Test Ran Expected results Failed
annexB Ran 2263 tests (1003 parents, 1260 subtests) 2230 33 (FAIL: 33)
built-ins Ran 40188 tests (13478 parents, 26710 subtests) 38748 1440 (FAIL: 1440)
harness Ran 275 tests (94 parents, 181 subtests) 275 0
intl402 Ran 708 tests (236 parents, 472 subtests) 698 10 (FAIL: 10)
language Ran 43243 tests (14898 parents, 28345 subtests) 41559 1684 (FAIL: 1684)

This summary cannot be compared directly with the results of Test262-Web-Runner. By default, 262's tests are executed both in strict mode and non-strict mode, unless a tag (onlyStrict, noStrict) indicates otherwise. For each test in the WPT wrapper, two actual tests are run normally. So when a test fails likely that counts as two failing tests. On the other hand, a test that fails in Test2-Web-Runner counts only once.

So to actually compare the WPT results and Test262-Web-Runner I need to normalize the results using an expression like the following:

$ grep "FAIL IFrame" annexB.output | cut -d : -f 2 | sort -u | wc -l

Here are the normalized results for IFrame:

Test Ran Failed
annexB 1003 26
built-ins 13446 720
harness 94 0
intl402 236 5
language 14822 827

The results are almost the same as Test262-Web-Runner (I just noticed the results for 'language' are much worse, although I used to get better results in other runs. I will look into that **) . Then I started to add support for the other agents. I paste the results for each type of agent:

** 08/01/2018: The values are updated now.

Window

Test Ran Failed
annexB 1003 26
built-ins 13446 720
harness 94 0
intl402 236 5
language 14822 833

Worker

Test Ran Failed
annexB 1003 69
built-ins 13446 1043
harness 94 1
intl402 236 6
language 14822 3827

SharedWorker

Test Ran Failed
annexB 1003 69
built-ins 13446 1059
harness 94 2
intl402 236 6
language 14822 3907

Regarding ServiceWorker, the reason I left it out for the moment is that for the currently supported agents I generate the tests on-the-fly (either an HTML page for IFrame and Window or a JavaScript file for DedicatedWorker and SharedWorker) using a Blob object. However, it's not possible to generate ServiceWorkers on-the-fly for security reasons. One possible work around would be to generate the ServiceWorker files for each test beforehand. The con is that that would duplicate the number of total files but I think it would work.

dpino commented 6 years ago

I fixed the issue that affected the results of the 'language' block test. The values are updated now.

dpino commented 6 years ago

I have pushed a PR with the script to generate the WPT wrappers as well as the harnessing code to run the tests. The PR is not ready to be merged yet, but I think it can be a starting point to get feedback and discuss what's pending to be done. PTAL https://github.com/w3c/web-platform-tests/pull/8980