Closed stephenmcgruer closed 4 years ago
Tried this out on projects/taskcluster-imaging/global/images/docker-worker-gcp-community-googlecompute-2020-04-24t05-52-00z
and it appears to work. I'll update this bug after we land the patch and have an official image with the fix (should happen tomorrow first thing) but if you want to try that image out, it should be functionally the same to the one we release.
Thanks again for all of the help!
Ok, looks like this will be pushed out Monday morning instead. I'll update here when that happens.
Images with this fix are deployed now. My testing of them so far in our workers seems to indicate the workaround fixes things. Please let me know if this either breaks something else or doesn't fix your initial issue!
Should the new images have led to a fix without any extra work on our part?
https://community-tc.services.mozilla.com/tasks/QpzzvdbIQlSWamTGOyX4kg is a recent failure due to network issues, 10 hours ago.
It should. That error came from a different place with a different exception (Connection broken
) but it's possible that it was also a connection reset.
Is the connection broken
issue happening as frequently as the reset issues from before? Also did those errors seem to reduce in frequency this week? We don't have any empirical data to show any success other than our direct testing before release.
Anecdotally, I think it's a lot better this week. We are trying to chase down the root cause internally, too. I'll follow up here once we have an update. Thanks again, everyone!
On Thu, Apr 30, 2020 at 4:34 PM Brian Stack notifications@github.com wrote:
Is the connection broken issue happening as frequently as the reset issues from before? Also did those errors seem to reduce in frequency this week? We don't have any empirical data to show any success other than our direct testing before release.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/web-platform-tests/wpt/issues/21529#issuecomment-622095123, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6ZDACAVWAERBWRVFJUDDRPHOEVANCNFSM4KOGPWSA .
Tentatively switching to roadmap
since we seem to be in a better place re connection resets.
@Hexcles can you summarize the outcome of the internal investigations (if appropriate), and then we can close this out?
Unfortunately, the internal investigation has somewhat stalled.
The current consensus is that something is wrong with Docker's iptables configurations w.r.t. NAT into the containers. The "workaround" we applied should have been the default configuration. This workaround is being applied on a ad-hoc basis in various places.
Since there is nothing actionable on our side, I'm closing this issue.
There is an ongoing Mozilla infrastructure issue that is causing network failures in TaskCluster runs when attempting to fetch Firefox testing profiles:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Max retries exceeded with url: /mozilla-central/archive/tip.zip/testing/profiles/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5d88c94f50>: Failed to establish a new connection: [Errno 110] Connection timed out',))
This is just a tracking issue to link to from blocked PRs; I will attempt to post updates here, but see https://status.mozilla.org/ for the latest information on the outage.