web-platform-tests / results-collection

Other
41 stars 46 forks source link

Collection from Safari Technology Preview interrupted while enabling automation #624

Closed jugglinmike closed 5 years ago

jugglinmike commented 5 years ago

We failed to collect results from Safari Technology Preview on Friday, 2018-11-02. The subsequent attempts over the weekend failed as well. The error occurs while enabling automation:

1673:1737: execution error: System Events got an error: Can’t get window "This Safari window is remotely controlled by an automated test." of process "Safari Technology Preview". (-1728)

My initial thought is that some previously-opened process has not been properly cleaned up. In that case, the short-term fix could be as simple as a system restart. Even then, I would like to understand the cause better so that we can avoid this interruption in the future.

foolip commented 5 years ago

Are you using Safari TP 68, or 67? I've had some trouble with 68 in https://github.com/web-platform-tests/wpt/issues/13800 which I've yet to diagnose.

However, it seems like this problem is likely caused by some of the AppleScript used to enable automation, and with Safari 12 and TP that is not necessary. Just getting rid of that in favor of sudo safaridriver --enable should do the trick. Please see the Azure Pipelines setup for what I have found to be the minimum workable enabling steps.

Assuming this is a quickfix then fixing it SGTM, but overall I'd say the lack of any results from Edge is a more pressing concern.

jugglinmike commented 5 years ago

This was indeed a quick fix.

Visually, this is the state of the machine while the STP collection attempts were failing:

2018-11-05-macos-interrupted

Between that and the logs reported above, it was clear that a process from a previous collection attempt was unexpectedly persisting, and that process was interfering with subsequent attempts.

Manually closing the browser would fix the issue, but I wanted to avoid running in to this again, so I looked a little further.

Our first "miss" was 2018-11-02. On that day, we were collecting results for WPT at revision 5d4871b4. Collection from STP for chunk 10 of 20 completed successfully, but collection from chunk 11 failed with the following logs:

2018-11-02 01:16:59,977 INFO validate-wpt-results wpt-run:stdout   ▶ FAIL [expected PASS] /css/css-text/white-space/control-chars-099.html
2018-11-02 01:16:59,977 INFO validate-wpt-results wpt-run:stdout   └   → /css/css-text/white-space/control-chars-099.html 1c093a60bdeeed3ead868af122fe53fc12b9877e
2018-11-02 01:16:59,977 INFO validate-wpt-results wpt-run:stdout /css/css-text/white-space/reference/control-chars-000-ref.html 1c093a60bdeeed3ead868af122fe53fc12b9877e
2018-11-02 01:16:59,978 INFO validate-wpt-results wpt-run:stdout Testing 1c093a60bdeeed3ead868af122fe53fc12b9877e != 1c093a60bdeeed3ead868af122fe53fc12b9877e
2018-11-02 01:16:59,978 INFO validate-wpt-results wpt-run:stdout 
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

This was most likely caused by a transient network outage. That's beyond our control, but we can at least take steps to recover more gracefully. I've extended the script which enables automation to begin by terminating any existing browser processes.

To be clear: network outages will continue to disrupt the current "chunk". This change isolates failures to that chunk. We'll use our established process (i.e. retrying individual chunks via the Buildbot web interface) to recover in those cases.

foolip commented 5 years ago

Neat screenshot, thanks @jugglinmike!