web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
4.99k stars 3.09k forks source link

Issue with WebKitGTK runs on Taskcluster around 12 of September of 2021 #30834

Closed clopez closed 1 year ago

clopez commented 3 years ago

Nightly runs of WebKitGTK which happen daily on the commits tagged with a tag starting with the prefix epochs/daily started to fail around 12 of September of 2021.

Looking at some of the last runs I can see that test run wpt-webkitgtk_minibrowser-nightly-testharness-7 fails because it reaches the hard-limit of 2 hours per test run and the runner kills it.

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.

I'm unsure at this point what is causing the time limit to be reached, if it is some issue on WebKitGTK that is causing it to be more slow than usual, or it is other thing.

It needs to be investigated.

Some runs:

https://community-tc.services.mozilla.com/tasks/Vro8Zw5ZQsOAEymVr2bN7A/runs/0/logs/public/logs/live.log https://community-tc.services.mozilla.com/tasks/I8qVkINfSgyV8k16LO9DXg/runs/0/logs/public/logs/live.log

foolip commented 3 years ago

@clopez it does look like tests were still running when the timeout was reached. If you look at runs before the failures started, how long do the longest tasks take? If it was just below 2 hours, then presumably the test suite size has slowly grown until the threshold was crossed just now.

clopez commented 3 years ago

@foolip You are right. The previous run of wpt-webkitgtk_minibrowser-nightly-testharness-7, on 11th september, took 1.97 hours (7127.504 seconds) .

So this seems an issue caused because the test suite has grow and it is causing now this testharness-7 test to take more than 2h. In #19724 I raised the number of chunks to avoid this problem, and it looks we need to raise it again

clopez commented 3 years ago

Opened PR to raise the number of chunks to 17 in https://github.com/web-platform-tests/wpt/pull/30949

clopez commented 3 years ago

Unfortunately raising the number of chunks didn't fixed the problem.

After raising the number of chunks on 6f7c2fd96f4b5b68c77a2c4d6bdd5a997101348c the issue keeps happening, but now on the chunk 1

wpt-webkitgtk_minibrowser-nightly-testharness-1 timeouts meanhwile tests are running

118:45.66 INFO Starting runner
118:47.18 TEST_START: /websockets/Create-valid-url-protocol.any.html?wss
118:47.40 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
118:47.40 TEST_START: /websockets/Create-valid-url-protocol.any.worker.html?wss
mem avail: 13567 of 14810 MiB (91 %), swap free:    0 of    0 MiB ( 0 %)
118:47.61 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
118:47.61 TEST_START: /websockets/Create-valid-url.any.html?wpt_flags=h2
118:47.64 wptserve WARNING (f111ea58) Unexpected Error - 
Invalid HTTP/2 preamble.
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2021-09-28 03:24:09.094Z] === Task Finished ===
[taskcluster 2021-09-28 03:24:09.095Z] Unsuccessful task run with exit code: -1 completed in 7228.794 seconds

I will try to debug this further to see if some recent change on WebKitGTK itself has caused some unexpected slowdown when running the tests

clopez commented 3 years ago

I'm starting to suspect the issue has been a change on WPT and not in WebKitGTK because the runs for stable and beta channel for WebKitGTK have also stopped working around the same date (12th/13th of September) and is very very unlikely that a bug on WebKitGTK has been introduced on the 3 channels at the same time (since each channel build is based from different branches).

foolip commented 3 years ago

The change in WPT could be as simple as the addition of many new tests that all timeout. Can you try increasing 7200 to 10800 here to see how long it actually takes?

https://github.com/web-platform-tests/wpt/blob/b0cb818f086983d4710f99052721feb1aa9f91ec/.taskcluster.yml#L61

If it still doesn't finish, then probably it's hanging and would never finish.

gsnedders commented 3 years ago

We seem to still be having problems. See https://github.com/web-platform-tests/wpt/runs/3739235388.

For that run, we're spending 101 minutes (of the 120 minute timeout) on the slowest 10% of the tests it manages to run, spending 65 minutes on the slowest 3%, spending 45 minutes on the slowest 2%.

That slowest 2%:

/encrypted-media/clearkey-mp4-playback-temporary-encrypted-clear-sources.https.html
/encrypted-media/clearkey-mp4-playback-temporary-multikey-sequential-readyState.https.html
/encrypted-media/clearkey-mp4-playback-temporary-multikey-sequential.https.html
/encrypted-media/clearkey-mp4-playback-temporary-two-videos.https.html
/encrypted-media/clearkey-mp4-requestmediakeysystemaccess.https.html
/encrypted-media/drm-expiration.https.html
/encrypted-media/drm-keystatuses-multiple-sessions.https.html
/encrypted-media/drm-keystatuses.https.html
/encrypted-media/drm-mp4-playback-destroy-persistent-license.https.html
/encrypted-media/drm-mp4-playback-persistent-license.https.html
/encrypted-media/drm-mp4-playback-persistent-usage-record-events.https.html
/encrypted-media/drm-mp4-playback-persistent-usage-record.https.html
/encrypted-media/drm-mp4-playback-retrieve-destroy-persistent-license.https.html
/encrypted-media/drm-mp4-playback-retrieve-persistent-license.https.html
/encrypted-media/drm-mp4-playback-retrieve-persistent-usage-record.https.html
/encrypted-media/drm-mp4-playback-temporary-clear-encrypted.https.html
/encrypted-media/drm-mp4-playback-temporary-encrypted-clear-sources.https.html
/encrypted-media/drm-mp4-playback-temporary-encrypted-clear.https.html
/encrypted-media/drm-mp4-playback-temporary-expired.https.html
/encrypted-media/drm-mp4-playback-temporary-multikey-sequential-readyState.https.html
/encrypted-media/drm-mp4-playback-temporary-multikey.https.html
/encrypted-media/drm-mp4-playback-temporary-setMediaKeys-after-src.https.html
/encrypted-media/drm-mp4-playback-temporary-setMediaKeys-after-update.https.html
/encrypted-media/drm-mp4-playback-temporary-setMediaKeys-immediately.https.html
/encrypted-media/drm-mp4-playback-temporary-setMediaKeys-onencrypted.https.html
/encrypted-media/drm-mp4-playback-temporary-waitingforkey.https.html
/encrypted-media/drm-mp4-playback-temporary.https.html
/encrypted-media/drm-mp4-setmediakeys-again-after-playback.https.html
/encrypted-media/drm-mp4-setmediakeys-again-after-resetting-src.https.html
/encrypted-media/drm-mp4-setmediakeys-at-same-time.https.html
/encrypted-media/drm-mp4-setmediakeys-multiple-times-with-different-mediakeys.https.html
/encrypted-media/drm-mp4-setmediakeys-multiple-times-with-the-same-mediakeys.https.html
/encrypted-media/drm-mp4-setmediakeys-to-multiple-video-elements.https.html
/encrypted-media/drm-mp4-waiting-for-a-key.https.html
/html/browsers/windows/noreferrer-window-name.html
/html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html?include=nested-browsing
/html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?include=loading
/html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?include=nested-browsing
/html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html?include=loading
/html/infrastructure/urls/resolving-urls/query-encoding/windows-1251.html?include=loading
/html/infrastructure/urls/resolving-urls/query-encoding/windows-1251.html?include=nested-browsing
/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html?include=loading
/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html?include=nested-browsing
/websockets/Create-extensions-empty.any.html?wpt_flags=h2
/websockets/Create-extensions-empty.any.worker.html?wpt_flags=h2
gsnedders commented 3 years ago

AFAICT, we run using --chunk-type dir_hash (or, rather, we just use the default) on TC but use --chunk-type hash on Azure Pipelines.

Because we run using dir_hash on TC, we run the risk of globally timing out when a lot of tests within a single directory all timeout, as they'll all end up in a single job.

foolip commented 3 years ago

Yep, I think I picked --chunk-type hash for the Azure Pipelines setup because it was the option that gave the most even runtimes. Azure Pipelines only has so much parallelism so that minimizes the overall latency, but IIRC @jgraham has said it's not really a problem in Taskcluster. But maybe this problem would be a reason to make Taskcluster use --chunk-type hash too?

jgraham commented 3 years ago

FWIW, if there are whole groups of tests that are timing out due to unimplemented features, my initial suggestion is to check in metadata files which disable those tests in WebKitGTK (or, if necessary, the entire test directories). Trying to adjust the chunking so that the guaranteed-to-timeout tests are spread out better seems worse than just saving the pointless compute time.

clopez commented 3 years ago

FWIW, if there are whole groups of tests that are timing out due to unimplemented features,

I'm not sure that is the cause.

In the last log many /encrypted-media tests are timing out, and each test is taking 60 seconds to time out

It took 54 minutes in total to run this 54 tests, taking 60 seconds each test
    /encrypted-media/clearkey-mp4-playback-temporary-encrypted-clear-sources.https.html
    /encrypted-media/clearkey-mp4-playback-temporary-multikey-sequential-readyState.https.html
    /encrypted-media/clearkey-mp4-playback-temporary-multikey-sequential.https.html
    /encrypted-media/clearkey-mp4-playback-temporary-two-videos.https.html
    /encrypted-media/clearkey-mp4-requestmediakeysystemaccess.https.html
    /encrypted-media/drm-events.https.html
    /encrypted-media/drm-expiration.https.html
    /encrypted-media/drm-keystatuses-multiple-sessions.https.html
    /encrypted-media/drm-keystatuses.https.html
    /encrypted-media/drm-mp4-playback-persistent-license-events.https.html
    /encrypted-media/drm-mp4-playback-persistent-license.https.html
    /encrypted-media/drm-mp4-playback-persistent-usage-record-events.https.html
    /encrypted-media/drm-mp4-playback-persistent-usage-record.https.html
    /encrypted-media/drm-mp4-playback-retrieve-destroy-persistent-license.https.html
    /encrypted-media/drm-mp4-playback-retrieve-persistent-license.https.html
    /encrypted-media/drm-mp4-playback-retrieve-persistent-usage-record.https.html
    /encrypted-media/drm-mp4-playback-temporary-clear-encrypted.https.html
    /encrypted-media/drm-mp4-playback-temporary-encrypted-clear-sources.https.html
    /encrypted-media/drm-mp4-playback-temporary-encrypted-clear.https.html
    /encrypted-media/drm-mp4-playback-temporary-events.https.html
    /encrypted-media/drm-mp4-playback-temporary-expired.https.html
    /encrypted-media/drm-mp4-playback-temporary-multikey-sequential-readyState.https.html
    /encrypted-media/drm-mp4-playback-temporary-multikey-sequential.https.html
    /encrypted-media/drm-mp4-playback-temporary-multikey.https.html
    /encrypted-media/drm-mp4-playback-temporary-multisession.https.html
    /encrypted-media/drm-mp4-playback-temporary-setMediaKeys-after-src.https.html
    /encrypted-media/drm-mp4-playback-temporary-setMediaKeys-after-update.https.html
    /encrypted-media/drm-mp4-playback-temporary-setMediaKeys-immediately.https.html
    /encrypted-media/drm-mp4-playback-temporary-setMediaKeys-onencrypted.https.html
    /encrypted-media/drm-mp4-playback-temporary-two-videos.https.html
    /encrypted-media/drm-mp4-playback-temporary-waitingforkey.https.html
    /encrypted-media/drm-mp4-playback-temporary.https.html
    /encrypted-media/drm-mp4-setmediakeys-again-after-playback.https.html
    /encrypted-media/drm-mp4-setmediakeys-again-after-resetting-src.https.html
    /encrypted-media/drm-mp4-setmediakeys-at-same-time.https.html
    /encrypted-media/drm-mp4-setmediakeys-multiple-times-with-different-mediakeys.https.html
    /encrypted-media/drm-mp4-setmediakeys-multiple-times-with-the-same-mediakeys.https.html
    /encrypted-media/drm-mp4-setmediakeys-to-multiple-video-elements.https.html
    /encrypted-media/drm-temporary-license-type.https.html
    /html/browsers/windows/noreferrer-window-name.html
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html?include=loading
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html?include=workers
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html?include=nested-browsing
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?include=loading
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?include=workers
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-16le.html?include=nested-browsing
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html?include=loading
    /html/infrastructure/urls/resolving-urls/query-encoding/utf-8.html?include=nested-browsing
    /html/infrastructure/urls/resolving-urls/query-encoding/windows-1251.html?include=nested-browsing
    /html/infrastructure/urls/resolving-urls/query-encoding/windows-1251.html?include=loading
    /html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html?include=loading
    /html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html?include=nested-browsing
    /html/semantics/embedded-content/the-iframe-element
    /iframe_sandbox_navigation_download_allow_downloads.sub.tentative.html

However, AFAIK, EME (encrypted media) is a supported feature in WebKitGTK.

I'm not sure why there is so much red in the wpt.fyi for WebKitGTK dashboard for this kind of tests. I will try to check this with some colleagues

gsnedders commented 3 years ago

I'm not sure that is the cause.

All (almost all?) the WebSockets failures are down to lack of h2 support.

In the last log many /encrypted-media tests are timing out, and each test is taking 60 seconds to time out

I spent about an hour looking through some of these yesterday (primarily of those that also timeout in Safari), and there's a lot of complex call and plenty of functions run within each test that aren't rooted in t.step, so it's quite plausible we're throwing some exception and not noticing.

foolip commented 3 years ago

Are these tests written in a way that an unhandled exception won't cause a harness error? Or will it cause a harness error and keep waiting for pending tests until it's treated as an overall timeout instead?

gsnedders commented 3 years ago

Are these tests written in a way that an unhandled exception won't cause a harness error? Or will it cause a harness error and keep waiting for pending tests until it's treated as an overall timeout instead?

The encrypted media ones? Oh, that's true, a harness error would happen if something threw/rejected.

The super complicated code flow through them makes it a real pain to understand what's happening though, and I don't understand well enough how EME interacts with MSE to really judge where bugs are.

foolip commented 3 years ago

@jgraham said on the Matrix chat on next steps here:

FWIW my "short term fix" proposal would be a) Try increasing the number of chunks again and see if the issue goes away, and if not, b) try the hash chunking just for WebKitGTK. Ultimately neither of those are especially satisfying solutions compared to anything that avoids each run having several hours unproductively waiting for a timeout.

I'm on macOS and guess I can't run WebKitGTK, but if I had it locally, the thing I'd try is taking one EME test and trying to figure out where it times out by adding assert_unreached at various places that should be reached to find the first place that isn't reached and instead times out.

clopez commented 3 years ago

Instead of raising the number of chunks is maybe better to raise the 2h time limit. I checked locally that the test run will end in 2.2 hours approximately. So raising it to 3h is perhaps a good idea.

If we only raise the number of chunks we still risk scheduling several tests that take 1 minute per test to timeout on the same chunk like all the /encrypted-media ones. I'm investigating what causes the failure on this tests, because those seem to pass with WebKitTestRunner locally when run as LayoutTests in WebKit. But I still don't have any conclusion.

jgraham commented 3 years ago

That sounds good, do you want to make the patch to raise the timeout?

clopez commented 3 years ago

That sounds good, do you want to make the patch to raise the timeout?

Sure!.. here is https://github.com/web-platform-tests/wpt/pull/31094 (i'm also lowering the number of chunks back to 16 since that doesn't seem to be helpful)

jgraham commented 3 years ago

OK, I've merged that, let's see if it works.

clopez commented 3 years ago

It worked.

Some observations:

Let's keep this issue open for the moment, I plan to continue checking future runs, to see how the timing evolves and to see if there is something that can be done to make some of this tests run faster (or even disable them if they are testing some feature that is not supported)

mrobinson commented 1 year ago

This is possibly related to #37883.

past commented 1 year ago

Things have improved after #37883, so I'm closing this. Feel free to file a new issue if it happens again.

clopez commented 1 year ago

The next episode of this is at #38723