web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
4.98k stars 3.09k forks source link

Safari runs missing for week of May 15th, 2023 #40085

Closed DanielRyanSmith closed 11 months ago

DanielRyanSmith commented 1 year ago

No stable Safari results have been available since May 15th. wpt.fyi run status is not showing any recent invalid Safari runs. Not quite sure of the reason here. Creating this issue for visibility.

CC @gsnedders

gsnedders commented 1 year ago

Azure Pipelines seems to have been having problems; the test jobs have been often failing with the test agents going away or stopping responding. Not much we can do here.

DanielRyanSmith commented 1 year ago

Adds a heads-up, this is still a problem and no stable aligned runs (test results on the same hash for Chrome, Edge, Firefox, & Safari) have been produced since May 13th.

foolip commented 1 year ago

The most recent run is now from May 23.

foolip commented 1 year ago

Edge is failing too, I've filed https://github.com/web-platform-tests/wpt/issues/40300. But that Edge failing shouldn't affect Safari results being uploaded and vice versa.

jgraham commented 1 year ago

Edge is fixed, but Safari is still broken to the point that it's effectively breaking Interop scoring (we're getting maybe one update a week). I've submitted https://github.com/web-platform-tests/wpt/pull/40362 to see if retries work/help, but in any case we need to address the underlying problem. Do we have contacts on the Azure side who could investigate?

gsnedders commented 1 year ago

We're getting a lot of:

[error]We stopped hearing from agent Azure Pipelines 10. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

…and there's nothing to suggest that somehow we're actually stopping the agent from running, or somehow ending up with Safari spinning or something. So theoretically something could've changed such that we're now starving the agent process of CPU, but it seems unlikely for that to have suddenly started?

foolip commented 1 year ago

@mustjab can you help with a contact on the Azure Pipelines team if @gsnedders needs it for debugging this issue?

mustjab commented 1 year ago

Best way to start would be to file an issue with azure-pipelines-agent team: https://github.com/microsoft/azure-pipelines-agent/issues/new/choose. I've looked through their open issues and didn't find any recent Mac issues, but this might be a related one: https://github.com/microsoft/azure-pipelines-agent/issues/3994

foolip commented 1 year ago

Thanks @mustjab!

jgraham commented 1 year ago

I've filed a new bug on Azure. Please let me know if I got any of the details wrong, or missed something important.

mustjab commented 1 year ago

@jgraham Looks like they've asked us to open issues on agent team instead, did you get a chance to file it?

https://github.com/microsoft/azure-pipelines-agent/issues/4313

jgraham commented 1 year ago

I filed https://github.com/actions/runner-images/issues/7754

past commented 1 year ago

@mustjab do you have any more context on the ongoing investigation that you can share?

mustjab commented 1 year ago

Checked with the agent team and they haven't looked at your issue yet, but they mentioned that there was a similar report that they already resolved, are you still seeing this happen in WPT runs?

past commented 1 year ago

It is still happening, here is one case from today.

gsnedders commented 1 year ago

I filed actions/runner-images#7754

From there, the internal issue has been resolved, and it seems like things have been much better over the last few days (comparable to where we were on the macos-12 images).

gsnedders commented 1 year ago

it's gone back to being less reliable, but as mentioned in the other issue:

fix is to be delivered around mid-august (reason for being better right now is not very clear)

past commented 1 year ago

The fix seems to be deployed and working reasonably well now, shall we close this?

gsnedders commented 1 year ago

I've alas been still kicking them manually a whole bunch, not sure it is working that great. But was planning on trying to follow up sometime soon.

gsnedders commented 1 year ago

See this filtered view of builds—there's still a fair bit of red (and white!) there, even since the new images went live.

past commented 12 months ago

@mustjab any further updates on the effort to fix this? The link from Sam's comment above still shows frequent failures.

gsnedders commented 12 months ago

@past A fair percentage of the failures are Edge, or caused by macOS bugs. https://wpt.fyi/runs?label=master&label=experimental&aligned and https://wpt.fyi/runs?label=master&label=stable&aligned both show plenty of recent aligned runs, so I'm not too concerned at this point. I vote we close this?

past commented 11 months ago

Sounds good to me, we can open a new one if needed.