web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
4.88k stars 3.05k forks source link

Restarting after test failures #13013

Open jugglinmike opened 5 years ago

jugglinmike commented 5 years ago

By default, the WPT CLI responds to "unexpected" results by restarting the browser. In the absence of a test expectations file, this means restarting following every failing tests. That extends the time required to run the tests, but it also improves the fidelity of the results.

The results-collection project allows this behavior when testing Firefox and Chrome, but the TaskCluster configuration in WPT currently disables it. This is one source of discrepancies in results reported by the two systems, so I wanted to get a better understanding of the trade-offs.

For a more direct comparison, I updated TaskCluster to allow restarting on Bocoup's fork of WPT. I triggered two concurrent builds and collected the results (greatly simplified by @jgraham's work on ./wpt tc-download). As a control, I also collected the results from the master commit from which the branch diverged (974489d).

Surprise! It's slower (all durations in minutes)

. without restarting with restarting 1 with restarting 2
average task duration 14.1393108974 32.6964528846 32.7585753205
maximum task duration 54.9900333333 71.11675 87.48785

The maximum is probably more interesting than the average because all the tasks usually run in parallel, and partial result sets aren't useful to us.

In terms of correctness, I found 94 discrepancies in the results between the 3 jobs (listed below). However, even just comparing the two "with restarting" jobs (where the conditions were identical), I found 38 discrepancies. My guess is that those are simply unstable tests. Anecdotally, the one test I chose at random (/FileAPI/url/url-in-tags-revoke.window.html) was verified to be unstable via ./wpt run --verify chrome (though Chromium calls it "slow"). If that's right, then restarting currently affects (and presumably corrects) 56 results.

As of 974489d, there are 29804 tests in WPT. A 50% increase in time-to-results for a 0.2% improvement in correctness hardly seems worthwhile. I'm reluctant to be pragmatic about this, though, since the data is the reason we're here.

As a middle ground, we could design a separate job to identify these kinds of discrepancies and schedule it to run on a reduced interval. Because while we should investigate all of the tests listed below, I think the general problem will be recurrent, so a regular automated reporting process could be worth the initial set-up costs.

94 tests affected by restarting /2dcontext/imagebitmap/createImageBitmap-origin.sub.html /cookies/http-state/chromium-tests.html /cookies/http-state/comma-tests.html /cookies/http-state/domain-tests.html /cookies/http-state/general-tests.html /cookies/http-state/mozilla-tests.html /cookies/http-state/name-tests.html /cookies/http-state/path-tests.html /cookies/http-state/value-tests.html /css/css-transitions/properties-value-inherit-001.html /css/css-transitions/properties-value-inherit-002.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-001e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-001o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-002e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-002o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-002e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-002o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-003e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-003o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-002e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-003e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-003o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-005e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-005o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-006o.html /editing/run/removeformat.html /fetch/api/redirect/redirect-count.any.html /fetch/api/redirect/redirect-count.any.worker.html /fetch/sec-metadata/window-open.tentative.https.sub.html /html/editing/editing-0/spelling-and-grammar-checking/spelling-markers-007.html /inert/inert-node-is-unfocusable.tentative.html /mediacapture-image/MediaStreamTrack-getConstraints-fast.html /mediacapture-streams/MediaDevices-getUserMedia.https.html /mediacapture-streams/MediaStream-audio-only.https.html /offscreen-canvas/the-offscreen-canvas/offscreencanvas.resize.html /orientation-sensor/AbsoluteOrientationSensor.https.html /orientation-sensor/RelativeOrientationSensor.https.html /preload/link-header-preload.html /preload/modulepreload.html /resource-timing/resource_initiator_types.html /resource-timing/single-entry-per-resource.html /service-workers/cache-storage/common.https.html /service-workers/service-worker/ServiceWorkerGlobalScope/update.https.html /service-workers/service-worker/clients-get-client-types.https.html /service-workers/service-worker/clients-get.https.html /service-workers/service-worker/clients-matchall-client-types.https.html /service-workers/service-worker/clients-matchall-exact-controller.https.html /service-workers/service-worker/clients-matchall-include-uncontrolled.https.html /service-workers/service-worker/clients-matchall-order.https.html /service-workers/service-worker/clients-matchall.https.html /service-workers/service-worker/controller-on-reload.https.html /service-workers/service-worker/installing.https.html /service-workers/service-worker/unregister.https.html /speech-api/idlharness.window.html /storage/persisted.https.any.html /storage/persisted.https.any.worker.html /uievents/order-of-events/focus-events/focus-automated-blink-webkit.html /web-locks/ifAvailable.tentative.https.any.html /webaudio/the-audio-api/the-audiocontext-interface/audiocontextoptions.html /webdriver/tests/dismiss_alert/dismiss.py /webdriver/tests/element_clear/clear.py /webdriver/tests/element_click/bubbling.py /webdriver/tests/element_click/click.py /webdriver/tests/element_click/click.py /webdriver/tests/element_click/file_upload.py /webdriver/tests/element_click/interactability.py /webdriver/tests/element_click/interactability.py /webdriver/tests/element_click/navigate.py /webdriver/tests/element_click/scroll_into_view.py /webdriver/tests/element_click/select.py /webdriver/tests/element_click/stale.py /webdriver/tests/element_send_keys/content_editable.py /webdriver/tests/element_send_keys/events.py /webdriver/tests/element_send_keys/file_upload.py /webdriver/tests/element_send_keys/form_controls.py /webdriver/tests/element_send_keys/interactability.py /webdriver/tests/element_send_keys/scroll_into_view.py /webdriver/tests/element_send_keys/send_keys.py /webdriver/tests/element_send_keys/send_keys.py /webdriver/tests/element_send_keys/user_prompts.py /webdriver/tests/fullscreen_window/fullscreen.py /webdriver/tests/get_timeouts/get.py /webdriver/tests/set_timeouts/set.py /webdriver/tests/set_timeouts/user_prompts.py /webdriver/tests/take_element_screenshot/screenshot.py /webrtc/RTCDTMFSender-ontonechange.https.html /webvtt/rendering/cues-with-video/processing-model/dom_override_remove_cue_while_paused.html /webvtt/rendering/cues-with-video/processing-model/selectors/cue/background_shorthand.html /webvtt/rendering/cues-with-video/processing-model/selectors/default_styles/italic_object_default_font-style.html /xhr/anonymous-mode-unsupported.htm
38 apparently-unstable tests /2dcontext/imagebitmap/createImageBitmap-origin.sub.html /FileAPI/url/url-in-tags-revoke.window.html /WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html /css/css-backgrounds/background-334.html /css/css-fonts/matching/fixed-stretch-style-over-weight.html /css/css-fonts/matching/stretch-distance-over-weight-distance.html /css/css-fonts/matching/style-ranges-over-weight-direction.html /css/css-transitions/properties-value-inherit-001.html /css/css-transitions/properties-value-inherit-002.html /css/selectors/selection-image-001.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-contain-svg-005e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-contain-svg-006o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-001e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-003e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-002e.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-004o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-005o.html /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-006e.html /fetch/api/redirect/redirect-count.any.worker.html /fetch/api/redirect/redirect-count.any.worker.html /fetch/sec-metadata/window-open.tentative.https.sub.html /html/rendering/replaced-elements/embedded-content/video-controls-vertical-writing-mode.html /html/semantics/embedded-content/media-elements/track/track-element/track-api-texttracks.html /media-source/mediasource-getvideoplaybackquality.html /navigation-timing/nav2_test_attributes_values.html /navigation-timing/nav2_test_document_open.html /navigation-timing/nav2_test_redirect_server.html /preload/link-header-preload-delay-onload.html /service-workers/service-worker/fetch-event-within-sw.https.html /service-workers/service-worker/postmessage.https.html /web-locks/ifAvailable.tentative.https.any.serviceworker.html /webmessaging/broadcastchannel/basics.html /webrtc/RTCDTMFSender-ontonechange.https.html /webvtt/rendering/cues-with-video/processing-model/selectors/default_styles/italic_object_default_font-style.html
Task duration data task name | without restarting | with restarting 1 | with restarting 2 -------------------------------|---------|---------|--- chrome-dev-reftest-1 | 1786704 | 2628903 | 2944095 chrome-dev-reftest-10 | 1131525 | 1898447 | 1782306 chrome-dev-reftest-2 | 1716507 | 2027971 | 2067457 chrome-dev-reftest-3 | 967347 | 1201099 | 1456538 chrome-dev-reftest-4 | 1228244 | 1952471 | 1781734 chrome-dev-reftest-5 | 420745 | 926223 | 853232 chrome-dev-reftest-6 | 533119 | 1252728 | 1288871 chrome-dev-reftest-7 | 1706721 | 2809130 | 2615798 chrome-dev-reftest-8 | 266471 | 790983 | 840505 chrome-dev-reftest-9 | 1652846 | 2007525 | 2124644 chrome-dev-testharness-1 | 2404826 | 2816719 | 3090452 chrome-dev-testharness-10 | 528018 | 830114 | 829349 chrome-dev-testharness-11 | 1006241 | 1433973 | 1205218 chrome-dev-testharness-12 | 539427 | 886440 | 893287 chrome-dev-testharness-13 | 589197 | 997387 | 1006353 chrome-dev-testharness-14 | 415842 | 774303 | 725762 chrome-dev-testharness-15 | 662322 | 1071594 | 1031145 chrome-dev-testharness-2 | 937366 | 1172458 | 1077443 chrome-dev-testharness-3 | 450946 | 767380 | 850606 chrome-dev-testharness-4 | 1280366 | 1728224 | 1748515 chrome-dev-testharness-5 | 738790 | 966678 | 1017721 chrome-dev-testharness-6 | 524762 | 1050923 | 906709 chrome-dev-testharness-7 | 670950 | 1091197 | 1172657 chrome-dev-testharness-8 | 737848 | 889648 | 925472 chrome-dev-testharness-9 | 1134019 | 1519347 | 1501829 chrome-dev-wdspec-1 | 834389 | 1201624 | 859479 firefox-nightly-reftest-1 | 279736 | 1828148 | 1983629 firefox-nightly-reftest-10 | 192434 | 1625860 | 1186300 firefox-nightly-reftest-2 | 225482 | 3501732 | 3364911 firefox-nightly-reftest-3 | 128444 | 763538 | 1024125 firefox-nightly-reftest-4 | 300035 | 1625634 | 1487570 firefox-nightly-reftest-5 | 85062 | 580326 | 585439 firefox-nightly-reftest-6 | 135314 | 1642978 | 1888648 firefox-nightly-reftest-7 | 295885 | 2785383 | 2956771 firefox-nightly-reftest-8 | 63759 | 1043097 | 976147 firefox-nightly-reftest-9 | 132192 | 1457063 | 1790592 firefox-nightly-testharness-1 | 3299402 | 4267005 | 4704807 firefox-nightly-testharness-10 | 626811 | 3766958 | 2794224 firefox-nightly-testharness-11 | 1226797 | 2870844 | 3369923 firefox-nightly-testharness-12 | 550261 | 3395856 | 2788142 firefox-nightly-testharness-13 | 499227 | 1667823 | 1673383 firefox-nightly-testharness-14 | 481288 | 1926431 | 1558572 firefox-nightly-testharness-15 | 715341 | 4129812 | 3984467 firefox-nightly-testharness-2 | 1186135 | 2996759 | 2821232 firefox-nightly-testharness-3 | 688620 | 2083799 | 1727637 firefox-nightly-testharness-4 | 1615387 | 4179672 | 5249271 firefox-nightly-testharness-5 | 604619 | 3868432 | 4478481 firefox-nightly-testharness-6 | 669177 | 1995694 | 2431377 firefox-nightly-testharness-7 | 946115 | 3868758 | 3986359 firefox-nightly-testharness-8 | 996870 | 2173253 | 1849232 firefox-nightly-testharness-9 | 1388967 | 3065127 | 2843796 firefox-nightly-wdspec-1 | 1915752 | 2209462 | 2104543
Script used to parse durations ```python import gzip import json import os import re import sys def get_durations(directory): durations = {} for filename in os.listdir(directory): name = re.search('wpt-([a-z-]+-\d+)', filename).group(1) with gzip.open(os.path.join(directory, filename)) as handle: data = json.load(handle) durations[name] = data['time_end'] - data['time_start'] return durations def main(directories): durations = [] for directory in directories: durations.append(get_durations(directory)) durations_by_name = {} for name in durations[0].keys(): durations_by_name[name] = [d.get(name, 0) for d in durations] return durations_by_name if __name__ == '__main__': durations = main(sys.argv[1:]) names = durations.keys() names.sort() for name in names: print '%s,%s' % (name, ','.join(str(d) for d in durations[name])) ```
script used to identify discrepancies ```python import gzip import json import os import re import sys def all_equal(values): return all(value == values[0] for value in values[1:]) def group_files(directories): by_name = {} for directory in directories: for filename in os.listdir(directory): name = re.search('wpt-([a-z-]+-\d+)', filename).group(1) if name not in by_name: by_name[name] = [] by_name[name].append(os.path.join(directory, filename)) return by_name def compare_results(results): generics = [] for result in results: generic = { 'test': result['test'], 'status': result['status'], 'subtests': [ { 'name': subtest['name'], 'status': subtest['status'] } for subtest in result['subtests'] ] } generic['subtests'] = sorted( generic['subtests'], lambda a, b: cmp(a['name'], b['name']) ) generics.append(generic) return [] if all_equal(generics) else generics def compare_results_sets(filenames): results = {} for filename in filenames: with gzip.open(filename) as handle: results_list = json.load(handle)['results'] for result in results_list: if result['test'] not in results: results[result['test']] = [] results[result['test']].append(result) discrepancies = [] for testname in results.keys(): c = compare_results(results[testname]) if c: discrepancies.append(c) return discrepancies if __name__ == '__main__': by_name = group_files(sys.argv[1:]) discrepancies = [] for name in by_name.keys(): discrepancies.extend(compare_results_sets(by_name[name])) for discrepancy in discrepancies: print discrepancy[0]['test'] ```
foolip commented 5 years ago

For the 94 tests that were affected, did they tend to go from failing to passing, or was it a mixed bag?

jugglinmike commented 5 years ago

First off, I was a bit sloppy with my methods. The number "94" is the result of comparing the "no restarts" dataset with just one of the "with restarts" dataset. Comparing all three yields a total of 117 discrepancies, 81 of which were consistent between both experiments. Sorry about that!

To see if there was a trend for "improved" or "worsened" results, I assigned a score to each as follows:

79 results (66% of the affected results) were "improved" by this metric. If we only consider the 81 tests that were consistent in both experiments, then 69 results (85% of the affected results) were improved.

It's very likely that some of those 81 "consistent" tests are actually flaky, and we just didn't observe the flakiness in the two experiments.

And one further wrinkle is that in some cases, a lower "score" may be an improvement; it's possible that restarting corrects type I errors.

Tests with scores and labels . | . | . | . | . | . | . --|---|---|---|---|---|-- firefox-nightly-testharness-14 | /cookies/http-state/chromium-tests.html | 0.00 | 0.04 | 0.04 | Improved firefox-nightly-testharness-14 | /cookies/http-state/comma-tests.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-14 | /cookies/http-state/domain-tests.html | 0.00 | 0.41 | 0.41 | Improved firefox-nightly-testharness-14 | /cookies/http-state/general-tests.html | 0.00 | 0.11 | 0.11 | Improved firefox-nightly-testharness-14 | /cookies/http-state/mozilla-tests.html | 0.00 | 0.59 | 0.59 | Improved firefox-nightly-testharness-14 | /cookies/http-state/name-tests.html | 0.00 | 0.03 | 0.03 | Improved firefox-nightly-testharness-14 | /cookies/http-state/path-tests.html | 0.00 | 0.39 | 0.39 | Improved firefox-nightly-testharness-14 | /cookies/http-state/value-tests.html | 0.00 | 0.83 | 0.83 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-001o.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-002e.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-002o.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-002e.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-002o.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-003o.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-003e.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-003o.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-005e.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-006o.html | 0.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-3 | /editing/run/removeformat.html | -1.00 | 0.98 | 0.98 | Improved chrome-dev-reftest-1 | /html/editing/editing-0/spelling-and-grammar-checking/spelling-markers-007.html | 0.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-13 | /mediacapture-image/MediaStreamTrack-getConstraints-fast.html | -1.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-12 | /offscreen-canvas/the-offscreen-canvas/offscreencanvas.resize.html | 0.80 | 1.00 | 1.00 | Improved chrome-dev-testharness-6 | /orientation-sensor/AbsoluteOrientationSensor.https.html | 0.00 | 0.13 | 0.13 | Improved chrome-dev-testharness-6 | /orientation-sensor/RelativeOrientationSensor.https.html | 0.00 | 0.13 | 0.13 | Improved chrome-dev-testharness-8 | /preload/modulepreload.html | 0.89 | 1.00 | 1.00 | Improved chrome-dev-testharness-2 | /resource-timing/single-entry-per-resource.html | 0.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-5 | /service-workers/cache-storage/common.https.html | -1.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-get-client-types.https.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-get.https.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-matchall-client-types.https.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-matchall-exact-controller.https.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-matchall-include-uncontrolled.https.html | -1.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-matchall-order.https.html | -1.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/clients-matchall.https.html | 0.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-8 | /service-workers/service-worker/controller-on-reload.https.html | 0.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-8 | /service-workers/service-worker/installing.https.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /service-workers/service-worker/unregister.https.html | 0.50 | 1.00 | 1.00 | Improved chrome-dev-testharness-7 | /storage/persisted.https.any.html | 0.50 | 1.00 | 1.00 | Improved chrome-dev-testharness-7 | /storage/persisted.https.any.worker.html | 0.50 | 1.00 | 1.00 | Improved firefox-nightly-testharness-8 | /uievents/order-of-events/focus-events/focus-automated-blink-webkit.html | -1.00 | 1.00 | 1.00 | Improved chrome-dev-testharness-4 | /web-locks/ifAvailable.tentative.https.any.html | 0.90 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/dismiss_alert/dismiss.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_clear/clear.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/bubbling.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/click.py | 0.00 | 0.50 | 0.50 | Improved chrome-dev-wdspec-1 | /webdriver/tests/element_click/click.py | 0.50 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/file_upload.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/interactability.py | 0.00 | 0.83 | 0.83 | Improved chrome-dev-wdspec-1 | /webdriver/tests/element_click/interactability.py | 0.00 | 0.67 | 0.67 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/navigate.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/scroll_into_view.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/select.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_click/stale.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/content_editable.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/events.py | 0.00 | 0.80 | 0.80 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/file_upload.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/form_controls.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/interactability.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/scroll_into_view.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/send_keys.py | 0.00 | 1.00 | 1.00 | Improved chrome-dev-wdspec-1 | /webdriver/tests/element_send_keys/send_keys.py | 0.00 | 0.11 | 0.11 | Improved firefox-nightly-wdspec-1 | /webdriver/tests/element_send_keys/user_prompts.py | 0.94 | 1.00 | 1.00 | Improved chrome-dev-wdspec-1 | /webdriver/tests/fullscreen_window/fullscreen.py | 0.00 | 0.25 | 0.25 | Improved chrome-dev-wdspec-1 | /webdriver/tests/get_timeouts/get.py | 0.67 | 1.00 | 1.00 | Improved chrome-dev-wdspec-1 | /webdriver/tests/set_timeouts/set.py | 0.00 | 0.74 | 0.74 | Improved chrome-dev-wdspec-1 | /webdriver/tests/set_timeouts/user_prompts.py | 0.00 | 1.00 | 1.00 | Improved chrome-dev-wdspec-1 | /webdriver/tests/take_element_screenshot/screenshot.py | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-reftest-1 | /webvtt/rendering/cues-with-video/processing-model/dom_override_remove_cue_while_paused.html | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-9 | /xhr/anonymous-mode-unsupported.htm | 0.00 | 1.00 | 1.00 | Improved firefox-nightly-testharness-2 | /fetch/api/redirect/redirect-count.any.html | 1.00 | -1.00 | -1.00 | Worsened firefox-nightly-testharness-2 | /fetch/api/redirect/redirect-count.any.worker.html | 1.00 | -1.00 | -1.00 | Worsened chrome-dev-testharness-7 | /FileAPI/url/url-in-tags-revoke.window.html | -1.00 | -1.00 | -1.00 | Worsened firefox-nightly-testharness-1 | /inert/inert-node-is-unfocusable.tentative.html | 0.33 | 0.17 | 0.17 | Worsened chrome-dev-testharness-10 | /mediacapture-streams/MediaDevices-getUserMedia.https.html | 1.00 | 0.33 | 0.33 | Worsened chrome-dev-testharness-10 | /mediacapture-streams/MediaStream-audio-only.https.html | 0.00 | -1.00 | -1.00 | Worsened firefox-nightly-testharness-8 | /preload/link-header-preload.html | 1.00 | 0.00 | 0.00 | Worsened chrome-dev-testharness-2 | /resource-timing/resource_initiator_types.html | 1.00 | 1.00 | 1.00 | Worsened firefox-nightly-testharness-9 | /service-workers/service-worker/ServiceWorkerGlobalScope/update.https.html | 1.00 | 0.00 | 0.00 | Worsened firefox-nightly-testharness-3 | /speech-api/idlharness.window.html | 0.44 | 0.43 | 0.43 | Worsened chrome-dev-testharness-6 | /webaudio/the-audio-api/the-audiocontext-interface/audiocontextoptions.html | 1.00 | 0.88 | 0.88 | Worsened firefox-nightly-reftest-2 | /webvtt/rendering/cues-with-video/processing-model/selectors/cue/background_shorthand.html | 1.00 | 0.00 | 0.00 | Worsened chrome-dev-testharness-5 | /css/css-transitions/properties-value-inherit-001.html | 0.93 | 0.98 | 0.93 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-cover-svg-001e.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001e.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-001o.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-003e.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004e.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-none-svg-004o.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-002e.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-005o.html | 0.00 | 1.00 | 0.00 | Improved | flaky firefox-nightly-testharness-4 | /webrtc/RTCDTMFSender-ontonechange.https.html | 0.23 | 0.85 | 0.23 | Improved | flaky firefox-nightly-testharness-9 | /2dcontext/imagebitmap/createImageBitmap-origin.sub.html | 0.00 | -1.00 | 0.00 | Worsened | flaky firefox-nightly-reftest-7 | /css/css-backgrounds/background-334.html | 0.00 | 0.00 | 1.00 | Worsened | flaky chrome-dev-reftest-2 | /css/css-fonts/matching/fixed-stretch-style-over-weight.html | 1.00 | 1.00 | 0.00 | Worsened | flaky chrome-dev-reftest-2 | /css/css-fonts/matching/stretch-distance-over-weight-distance.html | 1.00 | 1.00 | 0.00 | Worsened | flaky chrome-dev-reftest-2 | /css/css-fonts/matching/style-ranges-over-weight-direction.html | 1.00 | 1.00 | 0.00 | Worsened | flaky firefox-nightly-testharness-5 | /css/css-transitions/properties-value-inherit-002.html | 0.98 | 0.93 | 0.98 | Worsened | flaky firefox-nightly-reftest-4 | /css/selectors/selection-image-001.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-contain-svg-005e.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-contain-svg-006o.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-004o.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-reftest-1 | /css/vendor-imports/mozilla/mozilla-central-reftests/images3/object-fit-scale-down-svg-006e.html | 0.00 | 0.00 | 1.00 | Worsened | flaky chrome-dev-testharness-2 | /fetch/api/redirect/redirect-count.any.worker.html | 1.00 | 1.00 | -1.00 | Worsened | flaky chrome-dev-testharness-6 | /fetch/sec-metadata/window-open.tentative.https.sub.html | 0.00 | -1.00 | 0.00 | Worsened | flaky firefox-nightly-reftest-7 | /html/rendering/replaced-elements/embedded-content/video-controls-vertical-writing-mode.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-testharness-15 | /html/semantics/embedded-content/media-elements/track/track-element/track-api-texttracks.html | 1.00 | 1.00 | -1.00 | Worsened | flaky firefox-nightly-testharness-11 | /media-source/mediasource-getvideoplaybackquality.html | 0.50 | 0.50 | 0.00 | Worsened | flaky firefox-nightly-testharness-1 | /navigation-timing/nav2_test_attributes_values.html | 1.00 | 1.00 | 0.00 | Worsened | flaky firefox-nightly-testharness-1 | /navigation-timing/nav2_test_document_open.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-testharness-1 | /navigation-timing/nav2_test_redirect_server.html | 0.00 | 0.00 | 1.00 | Worsened | flaky firefox-nightly-testharness-8 | /preload/link-header-preload-delay-onload.html | 1.00 | 1.00 | 0.00 | Worsened | flaky chrome-dev-testharness-8 | /service-workers/service-worker/fetch-event-within-sw.https.html | 1.00 | 1.00 | -1.00 | Worsened | flaky firefox-nightly-testharness-8 | /service-workers/service-worker/postmessage.https.html | 0.75 | 0.75 | 0.50 | Worsened | flaky chrome-dev-testharness-4 | /web-locks/ifAvailable.tentative.https.any.serviceworker.html | 1.00 | 1.00 | 0.90 | Worsened | flaky chrome-dev-testharness-6 | /WebCryptoAPI/wrapKey_unwrapKey/wrapKey_unwrapKey.https.worker.html | 0.92 | 0.92 | 0.92 | Worsened | flaky chrome-dev-testharness-12 | /webmessaging/broadcastchannel/basics.html | 1.00 | 1.00 | 0.80 | Worsened | flaky firefox-nightly-reftest-6 | /webvtt/rendering/cues-with-video/processing-model/selectors/default_styles/italic_object_default_font-style.html | 1.00 | 0.00 | 1.00 | Worsened | flaky
foolip commented 5 years ago

Added the priority:backlog label as this is more of a report/discussion than a concrete suggestion at this point.

jugglinmike commented 5 years ago

After a little more investigation in differences between the two CI environments, I found another case where restarting improved result accuracy: cookies/http-state in Firefox.

On the one hand, these examples make restarting seem more appealing because they improve result accuracy. However, the practice doesn't actually fix problems--it just reduces their effect.

In this latest case, Firefox was leaking state due to an actual conformance issue. Even so, it's far better to assert those expectations explicitly and deterministically. That's what I'm arguing in my patch for the tests, anyway. Restarting between those cookies tests brought Firefox's pass rate up from 8% to 31%, but fixing the underlying problem led to a 73% pass rate (and without making the tests any slower). The only trouble is that finding and implementing the "real" solution will always take much more effort than simply removing --no-restart-after-failure.

foolip commented 5 years ago

If we invert the default, which seems reasonable at least in the absence of expectations data, how would we keep discovering these discrepancies? Maybe set up a weekly run that differs only in this flag?

jugglinmike commented 5 years ago

Yeah, that's what I was thinking, too (though I kind of hid it at the end of the initial report above).

foolip commented 5 years ago

That seems reasonable, certainly if the time saving from making the switch more than makes up for the cost of an extra run per week, which seems likely.

The hard part in practice I guess would be to actually keep caring about discrepancies, since diagnosing/fixing them will be laborious, and sometimes shards cross directory boundaries so nobody will want to take the issue until it's probably their problem.