web-platform-tests / wpt

Test suites for Web platform specs — including WHATWG, W3C, and others
https://web-platform-tests.org/
Other
5k stars 3.1k forks source link

TaskCluster sometimes fails due to network issues #21529

Closed stephenmcgruer closed 4 years ago

stephenmcgruer commented 4 years ago

There is an ongoing Mozilla infrastructure issue that is causing network failures in TaskCluster runs when attempting to fetch Firefox testing profiles:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Max retries exceeded with url: /mozilla-central/archive/tip.zip/testing/profiles/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5d88c94f50>: Failed to establish a new connection: [Errno 110] Connection timed out',))

This is just a tracking issue to link to from blocked PRs; I will attempt to post updates here, but see https://status.mozilla.org/ for the latest information on the outage.

imbstack commented 4 years ago

Tried this out on projects/taskcluster-imaging/global/images/docker-worker-gcp-community-googlecompute-2020-04-24t05-52-00z and it appears to work. I'll update this bug after we land the patch and have an official image with the fix (should happen tomorrow first thing) but if you want to try that image out, it should be functionally the same to the one we release.

Thanks again for all of the help!

imbstack commented 4 years ago

Ok, looks like this will be pushed out Monday morning instead. I'll update here when that happens.

imbstack commented 4 years ago

Images with this fix are deployed now. My testing of them so far in our workers seems to indicate the workaround fixes things. Please let me know if this either breaks something else or doesn't fix your initial issue!

foolip commented 4 years ago

Should the new images have led to a fix without any extra work on our part?

https://community-tc.services.mozilla.com/tasks/QpzzvdbIQlSWamTGOyX4kg is a recent failure due to network issues, 10 hours ago.

Hexcles commented 4 years ago

It should. That error came from a different place with a different exception (Connection broken) but it's possible that it was also a connection reset.

imbstack commented 4 years ago

Is the connection broken issue happening as frequently as the reset issues from before? Also did those errors seem to reduce in frequency this week? We don't have any empirical data to show any success other than our direct testing before release.

Hexcles commented 4 years ago

Anecdotally, I think it's a lot better this week. We are trying to chase down the root cause internally, too. I'll follow up here once we have an update. Thanks again, everyone!

On Thu, Apr 30, 2020 at 4:34 PM Brian Stack notifications@github.com wrote:

Is the connection broken issue happening as frequently as the reset issues from before? Also did those errors seem to reduce in frequency this week? We don't have any empirical data to show any success other than our direct testing before release.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/web-platform-tests/wpt/issues/21529#issuecomment-622095123, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6ZDACAVWAERBWRVFJUDDRPHOEVANCNFSM4KOGPWSA .

stephenmcgruer commented 4 years ago

Tentatively switching to roadmap since we seem to be in a better place re connection resets.

stephenmcgruer commented 4 years ago

@Hexcles can you summarize the outcome of the internal investigations (if appropriate), and then we can close this out?

Hexcles commented 4 years ago

Unfortunately, the internal investigation has somewhat stalled.

The current consensus is that something is wrong with Docker's iptables configurations w.r.t. NAT into the containers. The "workaround" we applied should have been the default configuration. This workaround is being applied on a ad-hoc basis in various places.

Hexcles commented 4 years ago

Since there is nothing actionable on our side, I'm closing this issue.