TaskCluster sometimes fails due to network issues

stephenmcgruer commented 4 years ago

There is an ongoing Mozilla infrastructure issue that is causing network failures in TaskCluster runs when attempting to fetch Firefox testing profiles:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Max retries exceeded with url: /mozilla-central/archive/tip.zip/testing/profiles/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5d88c94f50>: Failed to establish a new connection: [Errno 110] Connection timed out',))

This is just a tracking issue to link to from blocked PRs; I will attempt to post updates here, but see https://status.mozilla.org/ for the latest information on the outage.

stephenmcgruer commented 4 years ago

https://status.mozilla.org/ reports that everything is green now. If your PR failed due to this issue, you should be able to close and re-open the PR to trigger the checks to re-run.

Hexcles commented 4 years ago

I also saw some 5xx errors from this URL earlier this week. This has been happening relatively often.

I think the testing profile needs to be treated in the same way as #21194 (which is also more technically correct), with retry mechanism. @jgraham

Hexcles commented 4 years ago

I think we have a more general problem regarding network. Anecdotally, I'm seeing more failures due to network errors recently (though I don't have any real stats). I think it makes sense to implement a generic retry mechanism for all the downloads.

stephenmcgruer commented 4 years ago

^ there is some data, @Hexcles . I just triaged 11 blocked Chromium exports, and 8 of them had red tasks due to network failure downloading Chrome. Will see how many of the 8 fail again on retry...

stephenmcgruer commented 4 years ago

(I'm going to stop linking PRs to this issue because it just creates useless spam, but generally note that this is causing big pain for me sheriffing Chromium exports atm).

For a retry mechanism, we already have download_url_to_descriptor in run_tc.py, that might just work for Chrome too.

stephenmcgruer commented 4 years ago

For the record: I ran a urllib2.urlopen(...); resp.read() loop on my machine for a while and couldn't get dl.google.com to flake, so it seems related to the TaskCluster infrastructure somehow.

stephenmcgruer commented 4 years ago

May not be relevant, but it was pointed out to me that the flake happens during SSL work:

Traceback (most recent call last):
  File "./tools/ci/run_tc.py", line 486, in <module>
    main()
  File "./tools/ci/run_tc.py", line 478, in main
    setup_environment(args)
  File "./tools/ci/run_tc.py", line 320, in setup_environment
    install_chrome(args.channel)
  File "./tools/ci/run_tc.py", line 155, in install_chrome
    f.write(resp.read())
  File "/usr/lib/python2.7/socket.py", line 355, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 607, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 384, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 772, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 659, in read
    v = self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer

Could there be something wrong with TaskCluster's cert or HTTPS stack?

Hexcles commented 4 years ago

For a retry mechanism, we already have download_url_to_descriptor in run_tc.py, that might just work for Chrome too.

Sweet!

Could there be something wrong with TaskCluster's cert or HTTPS stack?

Although I wanted to say not likely, the flake rate is really high. Note that these machines are on GCP, so they really shouldn't have a hard time reaching Google...

Hexcles commented 4 years ago

I just triaged 11 blocked Chromium exports, and 8 of them had red tasks due to network failure downloading Chrome. Will see how many of the 8 fail again on retry...

This is definitely higher than my anecdotal data. Is it getting worse somehow?

jgraham commented 4 years ago

@djmitche Do you have any idea what might be going on here?

djmitche commented 4 years ago

couldn't get dl.google.com to flake, so it seems related to the TaskCluster infrastructure somehow.

I think you are referring to the worker infrastructure here, as from what I can tell the issue in https://github.com/web-platform-tests/wpt/issues/22879 is downloading from dl.google.com. I can't see any problems after clicking a few other linked PRs above, and the comments earlier in this issue seem to be about something unrelated, so #22879 is all I've got to go on.

The proj-wpt/ci workers run in Google's cloud (GCE), so I'd be surprised to find there were any network issues between them and dl.google.com. Is it possible that there's some abuse-prevention mechanism involved?

Could there be something wrong with TaskCluster's cert or HTTPS stack?

This confuses me, because as far as I can see the issues aren't occurring in talking to Taskcluster.

Hexcles commented 4 years ago

@stephenmcgruer was most likely referring to our Docker image (e.g. having out-of-date root CAs or OpenSSL), but I don't think such issues would end up causing transient failures.

And yes we are talking about connecting to external network (dl.google.com in recent examples, and mozilla.org in earlier incidents) from within a Docker worker here.

stephenmcgruer commented 4 years ago

Edit: Robert summarized what I typed much clearer in the comment above :)

This confuses me, because as far as I can see the issues aren't occurring in talking to Taskcluster.

I think some terminology is misaligned here; we tend to say 'TaskCluster' to refer to anything that happens in the TaskCluster checks, including any worker tasks they contain/etc.

I can't see any problems after clicking a few other linked PRs above

Sorry, I didn't capture specific links to the taskcluster groups and was aggressively retrying those PRs to unblock Chromium exports (you may notice some of them were close/opened - our retry mechanism - many times before they merged).

stephenmcgruer commented 4 years ago

We now have a 3-retry policy for downloading Chrome, and it is still failing for some PRs. See https://github.com/web-platform-tests/wpt/pull/22959#issuecomment-613942905

This is bad. I think we have two big action items here:

Get a better handle on scope of impact. We can use the GitHub API to scrape all PRs for all checks, but would then need to dive into TaskCluster results to pull logs to see results.
Figure out why this is failing. Unfortunately I have very little idea how to do that, but I guess I'll spend today poking at it...

djmitche commented 4 years ago

We ran into something vaguely like this with downloads from a Debian server (not one of the things automatically mirrored into EC2) for Firefox CI. Debian's server was blacklisting entire /24's of EC2's public address space when abused by a single /32, and that added up to some nontrivial portion of EC2 not allowed to access the server. If I recall, that abuse mechanism just blackholed traffic, so we got connection timeouts, but different tools do different things and resetting the connection after establishing it is not unthinkable.

If that's the case, then some caching on the workers might help. From https://github.com/web-platform-tests/wpt/blob/4de32a3ad4cc9b02c1cd07040a68956128a9b1f9/tools/ci/run_tc.py#L152-L156 it looks like the download is unconditional right now, and I don't see any caches defined in a random task I clicked on.

The cert angle is possible, too. I vaguely recall that sometimes cert validation errors show up with this error message, but that recollection needs verifying. I don't know what kind of CA bundle is in that docker image -- maybe it's old? If dl.google.com is hosted on multiple endpoints, it may be that some of those have a newer cert than others, and that newer cert fails validation. This idea, at least, should be relatively easy to verify, perhaps by zeroing out the CA store in a docker container using the same image as the tasks, and trying to connect to dl.google.com. If the error message is not "connection reset by peer", then this idea is no good.

For the record: I ran a urllib2.urlopen(...); resp.read() loop on my machine for a while and couldn't get dl.google.com to flake, so it seems related to the TaskCluster infrastructure somehow.

It may be worth re-trying this using the docker image used in the tasks, to identify any issues particular to Python-2.7 or other properties of that image.

stephenmcgruer commented 4 years ago

I've reached out to the dl.google.com team internally to ask about any outages or if we might be being rate limited.

Hexcles commented 4 years ago

I did some local testing, and couldn't reproduce the failure. Although it still does not rule out the possibility in https://github.com/web-platform-tests/wpt/issues/21529#issuecomment-614037541, it's becoming increasingly likely that we are hitting some kind of rate limit. I'm going to add exponential backoff to the retry logic to make it more robust, as well as update the Docker image just in case.

Hexcles commented 4 years ago

(looping back from an internal channel)

@djmitche does Taskcluster use Cloud NAT on GCE? If so, do we manually configure NAT addresses? https://cloud.google.com/nat/docs/overview#specs-addresses-ports Another possibility is that we might be running out of outbound NAT connections.

djmitche commented 4 years ago

@imbstack can you answer that ^^?

imbstack commented 4 years ago

It doesn't look like we're using Cloud NAT in the project with the wpt workers.

stephenmcgruer commented 4 years ago

In https://community-tc.services.mozilla.com/tasks/ObeRTrD-SQGhkdTeKpdE0A/runs/0/logs/https%3A%2F%2Fcommunity-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FObeRTrD-SQGhkdTeKpdE0A%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log we just saw a similar error, but when trying to install something via pip:

  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/legacy_resolve.py", line 282, in _get_abstract_dist_for
    abstract_dist = self.preparer.prepare_linked_requirement(req)
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/operations/prepare.py", line 482, in prepare_linked_requirement
    hashes=hashes,
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/operations/prepare.py", line 287, in unpack_url
    hashes=hashes,
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/operations/prepare.py", line 159, in unpack_http_url
    link, downloader, temp_dir.path, hashes
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/operations/prepare.py", line 303, in _download_http_url
    for chunk in download.chunks:
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/utils/ui.py", line 160, in iter
    for x in it:
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_internal/network/utils.py", line 39, in response_chunks
    decode_content=False,
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_vendor/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_vendor/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/test/web-platform-tests/_venv2/local/lib/python2.7/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
ProtocolError: ("Connection broken: error(104, 'Connection reset by peer')", error(104, 'Connection reset by peer'))
CRITICAL:tools.wpt.utils:('/home/test/web-platform-tests/_venv2/bin/pip', 'install', '--prefer-binary', '-r', u'/home/test/web-platform-tests/tools/wpt/../wptrunner/requirements.txt') exited with return code 2
CRITICAL:tools.wpt.utils:Collecting html5lib==1.0.1
  Downloading html5lib-1.0.1-py2.py3-none-any.whl (117 kB)

Could be a red-herring, of course.

djmitche commented 4 years ago

There's a high probability that's a red herring, it's true -- perhaps a pypi node was rebooted while you were connected to it, etc. But that traceback is in the HTTP receive codepath, meaning that at least a few RTT's to the server had been successful (to set up TLS, at least, and depending on timing to send the HTTP request).

Would it be helpful to use urllib3 in the chrome downloads, in hopes of getting more informative errors?

Hexcles commented 4 years ago

Here's another recent example with a different symptom: https://community-tc.services.mozilla.com/tasks/Xz4dfEvHRsSsItvXi8dfPw/runs/0/logs/https%3A%2F%2Fcommunity-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FXz4dfEvHRsSsItvXi8dfPw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log#L558

Traceback (most recent call last):
  File "./wpt", line 12, in <module>
    wpt.main()
  File "/home/test/web-platform-tests/tools/wpt/wpt.py", line 167, in main
    rv = script(*args, **kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 804, in run
    **kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 774, in setup_wptrunner
    setup_cls.setup(kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 184, in setup
    self.setup_kwargs(kwargs)
  File "/home/test/web-platform-tests/tools/wpt/run.py", line 346, in setup_kwargs
    browser_binary=kwargs["binary"],
  File "/home/test/web-platform-tests/tools/wpt/browser.py", line 644, in install_webdriver
    self.version(browser_binary), dest)
  File "/home/test/web-platform-tests/tools/wpt/browser.py", line 632, in install_webdriver_by_version
    unzip(get(url).raw, dest)
  File "/home/test/web-platform-tests/tools/wpt/utils.py", line 78, in unzip
    fileobj = seekable(fileobj)
  File "/home/test/web-platform-tests/tools/wpt/utils.py", line 62, in seekable
    return BytesIO(fileobj.read())
  File "/home/test/web-platform-tests/_venv2/lib/python2.7/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/test/web-platform-tests/_venv2/lib/python2.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(4463224 bytes read, 2339868 more expected)', IncompleteRead(4463224 bytes read, 2339868 more expected))

Hexcles commented 4 years ago

For the record (I forgot to link to the issue from the PR), I landed #22983 trying to make the retry more robust (ironically, the failure above happened right afterwards in a different network transaction).

djmitche commented 4 years ago

That looks like a partial download -- more than just headers at any rate. For that to occur, rather than a timeout, requires some positive traffic from the remote side (or from some intervening box on the network): either a RST or a FIN. But, that could be an unrelated error, too.

stephenmcgruer commented 4 years ago

https://community-tc.services.mozilla.com/tasks/JgGhdIl1Rvqo0VfhdPowZw/runs/0/logs/https%3A%2F%2Fcommunity-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FJgGhdIl1Rvqo0VfhdPowZw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log contains another failed DL of webdriver; this one uses urllib3 apparently (based on the stack) but I don't see much more than another connection being reset.

djmitche commented 4 years ago

So for webdriver at least, the errors seem to be resetting during the connection, vs. some more permanent inability to connect.

stephenmcgruer commented 4 years ago

@imbstack - want to double-check something I'm not sure about. Above, @Hexcles asked about Cloud NAT and about running out of NAT connections. You replied saying you didn't think Cloud NAT was being used. Does that also imply that you think we are not running out of NAT connections?

Hexcles commented 4 years ago

Looking at worker logs, it seems that they all have public IPs (example) which implies no NAT.

imbstack commented 4 years ago

Yeah, I think we have the most simple networking setup in that project. I admit I'm not an expert in gcloud networking though. If there are other things you'd like me to check, I'm happy to help!

I didn't dig very deep into the precise meaning of these metrics but the network performance dashboard for packet loss looks pretty unremarkable for that project over the last 6 weeks.

Are there other sources of metrics at the vpc level that you'd like me to collect? One thing I was thinking is that we could make some subnets for your workers and then turn on flow logs for them. Would that be helpful?

stephenmcgruer commented 4 years ago

Keeping up the record from our side; on Friday afternoon, @Hexcles was able to SSH into a TaskCluster VM (well, into the Docker container running on the VM). They were able to easily reproduce the issue by pointing curl at dl.google.com, and have captured a tcpdump and some traceroutes for the Google-internal team to look at. Will update when we get a reply back from them.

Hexcles commented 4 years ago

So it turns out there's no RST at all in the captured packets.

Looking at the packet lengths, they are much bigger than MTU, which suggests TCP segmentation offloading is in effect. It's possible that this could contribute to the networking issues. @imbstack are you aware of any special NIC setup on the hosts?

To further investigate, we can try disable offloading. However, we don't have the right permissions to disable that inside Docker. We need to run sudo ethtool -K eth0 rx off gso off. I think this might require docker containers to have SYS_ADMIN capability.

djmitche commented 4 years ago

@imbstack is going to look more into the machine configuration. Can you share the tcpdump with me?

imbstack commented 4 years ago

We don't explicitly set anything about networking in our images afaict. It is pretty much an image from gcloud with docker and our worker stuff on top.

So with that the settings on an instance from one of these images are:

Features for ens4:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: on [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]

I'll try to reproduce this behavior shortly both on the host directly and in a container. I'd like to dig a bit deeper and show what exactly is wrong before we start changing configuration for all of our workers.

Hexcles commented 4 years ago

I'd like to dig a bit deeper and show what exactly is wrong before we start changing configuration for all of our workers.

Definitely! I was thinking about changing the setting on only the WPT workers temporarily (perhaps for a few hours) and collecting some before/after stats.

This is the script I used for reproducing the issue and capturing packets: https://gist.github.com/Hexcles/f957d7406036ff20e5397ab94fb5e70c (Note that for some reason I found it easier to reproduce the issue right after the container became interactive.)

And I pushed two pcaps to a repo... https://github.com/Hexcles/temp capture.pcap was from curl, which used HTTP/2 underneath, whereas capture.pcap2 was from the script above. Note that neither contains any RST package, but the errors were both "connection reset by peer" from OpenSSL.

djmitche commented 4 years ago

Summarizing those:

capture (curl)

Starts off like a normal TLS connection: SYN/SYN+ACK/ACK, a few chunks of data back and forth to negotiate TLS, and then transmitting data:

22:02:00.113253 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [S], seq 3561708667, win 29200, options [mss 1460,sackOK,TS val 3467721199 ecr 0,nop,wscale 7], length 0                        
22:02:00.114736 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [S.], seq 1378248528, ack 3561708668, win 65535, options [mss 1420,sackOK,TS val 346350862 ecr 3467721199,nop,wscale 8], length 0
22:02:00.114767 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [.], ack 1, win 229, options [nop,nop,TS val 3467721201 ecr 346350862], length 0
22:02:00.123492 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [P.], seq 1:518, ack 1, win 229, options [nop,nop,TS val 3467721209 ecr 346350862], length 517
22:02:00.123772 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [.], ack 518, win 261, options [nop,nop,TS val 346350872 ecr 3467721209], length 0
22:02:00.124282 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [P.], seq 1:3797, ack 518, win 261, options [nop,nop,TS val 346350872 ecr 3467721209], length 3796
22:02:00.124304 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [.], ack 3797, win 288, options [nop,nop,TS val 3467721210 ecr 346350872], length 0
22:02:00.125373 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [P.], seq 518:598, ack 3797, win 288, options [nop,nop,TS val 3467721211 ecr 346350872], length 80

..and that connection is still going when the capture ends, with about half of the chrome download transmitted:

22:02:00.458801 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [.], seq 35818456:35819864, ack 860, win 261, options [nop,nop,TS val 346351203 ecr 3467721540], length 1408
22:02:00.458806 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [.], ack 36497112, win 3844, options [nop,nop,TS val 3467721545 ecr 346351207,nop,nop,sack 1 {35818456:35819864}], length 0
22:02:00.458803 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [.], seq 36497112:36553432, ack 860, win 261, options [nop,nop,TS val 346351207 ecr 3467721544], length 56320
22:02:00.458826 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [.], ack 36553432, win 3610, options [nop,nop,TS val 3467721545 ecr 346351207], length 0
22:02:00.458926 IP 74.125.31.91.443 > 172.17.0.3.43940: Flags [.], seq 36553432:36609752, ack 860, win 261, options [nop,nop,TS val 346351207 ecr 3467721544], length 56320
22:02:00.458958 IP 172.17.0.3.43940 > 74.125.31.91.443: Flags [.], ack 36609752, win 4069, options [nop,nop,TS val 3467721545 ecr 346351207], length 0

So, I don't see any error here at all. Did something go wrong in the capture script? Wrong capture file?

capture2 (download.py)

This seems to start off at the very end of a previous connection:

21:52:49.992448 IP 172.17.0.3.52476 > 140.82.112.3.443: Flags [P.], seq 1129557557:1129557588, ack 1599830371, win 2199, options [nop,nop,TS val 3092645239 ecr 2576707487], length 31
21:52:49.993243 IP 140.82.112.3.443 > 172.17.0.3.52476: Flags [P.], seq 1:32, ack 31, win 30, options [nop,nop,TS val 2576708588 ecr 3092645239], length 31
21:52:49.993263 IP 172.17.0.3.52476 > 140.82.112.3.443: Flags [.], ack 32, win 2199, options [nop,nop,TS val 3092645240 ecr 2576708588], length 0
21:52:49.993314 IP 140.82.112.3.443 > 172.17.0.3.52476: Flags [F.], seq 32, ack 31, win 30, options [nop,nop,TS val 2576708588 ecr 3092645239], length 0
21:52:49.998060 IP 172.17.0.3.52476 > 140.82.112.3.443: Flags [R.], seq 31, ack 33, win 2199, options [nop,nop,TS val 3092645244 ecr 2576708588], length 0

Since tcpdump didn't see the beginning of this connection, it doesn't know what the initial offset is, so we don't know how much data changed hands over the connection. seq/ack values after the first line are relative to the values in the first line. However, the first packet is a small PSH from the client to the server, which is certainly unexpected late in a download, but is smaller than . The server responds with 32 bytes, ACKing the 32 bytes from the local side. The local side ACKs those 32 bytes from the server. Then the server sends a FIN, and we reply with a RST. So that's a "Connection Reset by Peer" but the local side was the one to send the reset! This is an "abortive release". So, that's weird.

It then sets up a normal connection using a new source port:

21:52:52.028837 IP 172.17.0.3.45462 > 172.217.7.174.443: Flags [S], seq 3529545878, win 29200, options [mss 1460,sackOK,TS val 287187870 ecr 0,nop,wscale 7], length 0
21:52:52.029746 IP 172.217.7.174.443 > 172.17.0.3.45462: Flags [S.], seq 157312410, ack 3529545879, win 65535, options [mss 1420,sackOK,TS val 3776832354 ecr 287187870,nop,wscale 8], length 0
21:52:52.029768 IP 172.17.0.3.45462 > 172.217.7.174.443: Flags [.], ack 1, win 229, options [nop,nop,TS val 287187871 ecr 3776832354], length 0
21:52:52.030050 IP 172.17.0.3.45462 > 172.217.7.174.443: Flags [P.], seq 1:518, ack 1, win 229, options [nop,nop,TS val 287187871 ecr 3776832354], length 517

continues normally for 333428μs, then after 68554329 bytes on the TCP connection (where the file itself is currently 68352454 bytes, or 201875 fewer bytes -- this might be a different build than when you ran these tests, though!):

21:52:52.547995 IP 172.217.7.174.443 > 172.17.0.3.45462: Flags [FP.], seq 68539927:68554329, ack 789, win 265, options [nop,nop,TS val 3776832872 ecr 287188389], length 14402
21:52:52.548017 IP 172.17.0.3.45462 > 172.217.7.174.443: Flags [.], ack 68554330, win 2940, options [nop,nop,TS val 287188389 ecr 3776832872], length 0
21:52:52.550522 IP 172.17.0.3.45462 > 172.217.7.174.443: Flags [F.], seq 789, ack 68554330, win 7142, options [nop,nop,TS val 287188392 ecr 3776832872], length 0
21:52:52.550634 IP 172.217.7.174.443 > 172.17.0.3.45462: Flags [.], ack 790, win 265, options [nop,nop,TS val 3776832875 ecr 287188392], length 0

the server pushes a last data packet, with FIN set. Local ACKs the data, sends a FIN of its own, and the server ACKs that FIN. So, that all looks like it finished downloading the file and closed the connection. Then it starts right up with a new one:

21:52:53.369939 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [S], seq 3745701162, win 29200, options [mss 1460,sackOK,TS val 287189211 ecr 0,nop,wscale 7], length 0
21:52:53.370739 IP 172.217.7.174.443 > 172.17.0.3.45468: Flags [S.], seq 3755206089, ack 3745701163, win 65535, options [mss 1420,sackOK,TS val 860271704 ecr 287189211,nop,wscale 8], length 0
21:52:53.370765 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [.], ack 1, win 229, options [nop,nop,TS val 287189212 ecr 860271704], length 0
21:52:53.371155 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [P.], seq 1:518, ack 1, win 229, options [nop,nop,TS val 287189212 ecr 860271704], length 517
21:52:53.371299 IP 172.217.7.174.443 > 172.17.0.3.45468: Flags [.], ack 518, win 261, options [nop,nop,TS val 860271705 ecr 287189212], length 0
21:52:53.372599 IP 172.217.7.174.443 > 172.17.0.3.45468: Flags [P.], seq 1:3787, ack 518, win 261, options [nop,nop,TS val 860271706 ecr 287189212], length 3786
21:52:53.372612 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [.], ack 3787, win 288, options [nop,nop,TS val 287189214 ecr 860271706], length 0

and the capture ends while that is still in progress, about 13MiB into the file:

21:52:53.646658 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [.], ack 13648177, win 8736, options [nop,nop,TS val 287189488 ecr 860271980], length 0
21:52:53.646631 IP 172.217.7.174.443 > 172.17.0.3.45468: Flags [.], seq 12410545:12413361, ack 789, win 265, options [nop,nop,TS val 860271973 ecr 287189480], length 2816
21:52:53.646661 IP 172.17.0.3.45468 > 172.217.7.174.443: Flags [.], ack 13648177, win 8736, options [nop,nop,TS val 287189488 ecr 860271980,nop,nop,sack 1 {12410545:12413361}], length 0

local run (download.py, run from my dev host in ec2, also with python2)

This didn't end in an error an looked pretty normal; just for comparison..

18:30:41.979427 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [S], seq 159894160, win 62727, options [mss 8961,sackOK,TS val 4212013565 ecr 0,nop,wscale 7], length 0
18:30:41.985657 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [S.], seq 3672184286, ack 159894161, win 60192, options [mss 1380,sackOK,TS val 1892568680 ecr 4212013565,nop,wscale 8], length 0
18:30:41.985682 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 1, win 491, options [nop,nop,TS val 4212013571 ecr 1892568680], length 0
18:30:41.985914 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [P.], seq 1:518, ack 1, win 491, options [nop,nop,TS val 4212013572 ecr 1892568680], length 517
18:30:41.992153 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], ack 518, win 240, options [nop,nop,TS val 1892568686 ecr 4212013572], length 0
18:30:41.999243 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [P.], seq 1:3787, ack 518, win 240, options [nop,nop,TS val 1892568694 ecr 4212013572], length 3786
18:30:41.999266 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 3787, win 462, options [nop,nop,TS val 4212013585 ecr 1892568694], length 0
18:30:42.000200 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [P.], seq 518:598, ack 3787, win 462, options [nop,nop,TS val 4212013586 ecr 1892568694], length 80

...

18:30:42.992161 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [FP.], seq 68458028:68464940, ack 789, win 244, options [nop,nop,TS val 1892569686 ecr 4212014569], length 6912
18:30:42.992170 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67608646, win 12311, options [nop,nop,TS val 4212014578 ecr 1892569679,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.992179 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67608646:67612900, ack 789, win 244, options [nop,nop,TS val 1892569686 ecr 4212014571], length 4254
18:30:42.992188 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67614318, win 12311, options [nop,nop,TS val 4212014578 ecr 1892569686,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.992195 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67614318:67617154, ack 789, win 244, options [nop,nop,TS val 1892569686 ecr 4212014571], length 2836
18:30:42.992202 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67798658, win 12311, options [nop,nop,TS val 4212014578 ecr 1892569686,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993493 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67798658:67800076, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993503 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67800076, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993532 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67800076:67801494, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993539 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67802912, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993544 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67802912:67804330, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993549 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67849706, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993771 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67849706:67851124, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993780 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67852542, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993788 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67852542:67853960, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993793 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67853960, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993826 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67853960:67858214, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 4254
18:30:42.993835 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67859632, win 12311, options [nop,nop,TS val 4212014579 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993877 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67859632:67863886, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 4254
18:30:42.993886 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67863886, win 12311, options [nop,nop,TS val 4212014580 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.993892 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67863886:67865304, ack 789, win 244, options [nop,nop,TS val 1892569688 ecr 4212014573], length 1418
18:30:42.993897 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67917770, win 12311, options [nop,nop,TS val 4212014580 ecr 1892569688,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.994231 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67917770:67920606, ack 789, win 244, options [nop,nop,TS val 1892569689 ecr 4212014574], length 2836
18:30:42.994240 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 67922024, win 12311, options [nop,nop,TS val 4212014580 ecr 1892569689,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.994273 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 67922024:67929114, ack 789, win 244, options [nop,nop,TS val 1892569689 ecr 4212014574], length 7090
18:30:42.994283 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68046808, win 12311, options [nop,nop,TS val 4212014580 ecr 1892569689,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.995259 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68046808:68049644, ack 789, win 244, options [nop,nop,TS val 1892569689 ecr 4212014575], length 2836
18:30:42.995277 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68051062, win 12311, options [nop,nop,TS val 4212014581 ecr 1892569689,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.995289 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68051062:68052480, ack 789, win 244, options [nop,nop,TS val 1892569690 ecr 4212014575], length 1418
18:30:42.995299 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68052480, win 12311, options [nop,nop,TS val 4212014581 ecr 1892569690,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.995301 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68052480:68053898, ack 789, win 244, options [nop,nop,TS val 1892569690 ecr 4212014575], length 1418
18:30:42.995307 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68174428, win 12311, options [nop,nop,TS val 4212014581 ecr 1892569690,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.996251 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68174428:68175846, ack 789, win 244, options [nop,nop,TS val 1892569691 ecr 4212014576], length 1418
18:30:42.996263 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68175846, win 12311, options [nop,nop,TS val 4212014582 ecr 1892569691,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.996280 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68175846:68178682, ack 789, win 244, options [nop,nop,TS val 1892569691 ecr 4212014576], length 2836
18:30:42.996290 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68232566, win 12311, options [nop,nop,TS val 4212014582 ecr 1892569691,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.996649 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68232566:68236820, ack 789, win 244, options [nop,nop,TS val 1892569691 ecr 4212014576], length 4254
18:30:42.996661 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68238238, win 12311, options [nop,nop,TS val 4212014582 ecr 1892569691,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.996682 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68238238:68241074, ack 789, win 244, options [nop,nop,TS val 1892569691 ecr 4212014576], length 2836
18:30:42.996690 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68361604, win 12311, options [nop,nop,TS val 4212014582 ecr 1892569691,nop,nop,sack 3 {68428250:68464941}{68421160:68422578}{68364440:68419742}], length 0
18:30:42.997761 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68361604:68364440, ack 789, win 244, options [nop,nop,TS val 1892569692 ecr 4212014577], length 2836
18:30:42.997772 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68419742, win 12311, options [nop,nop,TS val 4212014583 ecr 1892569692,nop,nop,sack 2 {68428250:68464941}{68421160:68422578}], length 0
18:30:42.998149 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68419742:68421160, ack 789, win 244, options [nop,nop,TS val 1892569692 ecr 4212014577], length 1418
18:30:42.998158 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68422578, win 12311, options [nop,nop,TS val 4212014584 ecr 1892569692,nop,nop,sack 1 {68428250:68464941}], length 0
18:30:42.998207 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68422578:68426832, ack 789, win 244, options [nop,nop,TS val 1892569692 ecr 4212014578], length 4254
18:30:42.998215 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68426832, win 12311, options [nop,nop,TS val 4212014584 ecr 1892569692,nop,nop,sack 1 {68428250:68464941}], length 0
18:30:42.998226 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], seq 68426832:68428250, ack 789, win 244, options [nop,nop,TS val 1892569692 ecr 4212014578], length 1418
18:30:42.998236 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [.], ack 68464941, win 12311, options [nop,nop,TS val 4212014584 ecr 1892569692], length 0
18:30:43.053216 IP 172.31.37.205.60114 > 216.58.217.46.443: Flags [F.], seq 789, ack 68464941, win 12311, options [nop,nop,TS val 4212014639 ecr 1892569692], length 0
18:30:43.059506 IP 216.58.217.46.443 > 172.31.37.205.60114: Flags [.], ack 790, win 244, options [nop,nop,TS val 1892569754 ecr 4212014639], length 0

Sorry that's so long, but I wanted to capture the FINs in both directions. It looks like the server sends a FIN but then has to double back to re-send a bunch of earlier stuff. Only when the local end has received everything it wanted does it finally send its own FIN, which the server ACKs and everyone goes home happy.

Summary

I think the only evidence of a problem here is in the "weird" first few packets in capture2, which are part of a connection I can't see all of. In my local testing, I noticed that pkill tcpdump could easily kill tcpdump before it had dumped the last few packets, so perhaps that's why capture is missing anything interesting? Similarly, tcpdump & takes a while to start up, and thus misses some initial packets, too. Might I suggest changing download.py to just do a single download and using something like this (using the standard solution to any race condition, sleep!)

ok=0
while [ $ok = 0 ]; do
  sudo tcpdump -w capture.pcap ip host dl.google.com  & sleep 1; python2 download.py; ok=$?; sleep 1; sudo pkill tcpdump
done

Anecdotally, I saw a bit more packet loss between my EC2 host and dl.google.com, reflected in more sack, but it didn't seem enough to cause issues and is pretty much expected as some kind of RED algorithm is probably at play.

Hexcles commented 4 years ago

Thanks for the detailed investigation, @djmitche !

Regarding capture.pcap, yes I didn't see any error in the captured packets, either. curl failed due to curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 and the download was only about half the size. This might be a red herring (the error might have occurred on a higher layer, e.g. TLS or HTTP/2). I suggest not investigating this one further and focus on replicating the prod issue as closely as possible (i.e. HTTP/1.1 using urllib2).

And great point about the startup delay and forceful termination of tcpdump! Let me add some sleep and make sure the capture is complete. Also capture2.pcap was unnecessarily big and confusing because it actually contained a few successful downloads; let me move tcpdump into the while loop instead.

Hexcles commented 4 years ago

Alright, with the new script, I finally managed to capture some RST packets :) https://github.com/Hexcles/temp/blob/master/capture3.pcap

djmitche commented 4 years ago

Interesting! I turned off the relative sequence numbers here.

It starts off pretty normally:

21:01:17.641556 IP 172.17.0.3.47834 > 172.217.203.136.443: Flags [S], seq 2158082846, win 29200, options [mss 1460,sackOK,TS val 3535710005 ecr 0,nop,wscale 7], length 0
21:01:17.642454 IP 172.217.203.136.443 > 172.17.0.3.47834: Flags [S.], seq 525833901, ack 2158082847, win 65535, options [mss 1420,sackOK,TS val 3225189697 ecr 3535710005,nop,wscale 8], length 0
21:01:17.642490 IP 172.17.0.3.47834 > 172.217.203.136.443: Flags [.], ack 525833902, win 229, options [nop,nop,TS val 3535710006 ecr 3225189697], length 0
21:01:17.642876 IP 172.17.0.3.47834 > 172.217.203.136.443: Flags [P.], seq 2158082847:2158083364, ack 525833902, win 229, options [nop,nop,TS val 3535710007 ecr 3225189697], length 517

and things seem to go along swimmingly until about half of the data has been transferred, then:

21:01:17.917681 IP 172.217.203.136.443 > 172.17.0.3.47834: Flags [.], seq 555214821:555216229, ack 2158083635, win 265, options [nop,nop,TS val 3225189973 ecr 3535710281], length 1408
21:01:17.917734 IP 172.17.0.3.47834 > 172.217.203.136.443: Flags [.], ack 555216229, win 4130, options [nop,nop,TS val 3535710282 ecr 3225189973], length 0
21:01:17.917690 IP 172.217.203.136.443 > 172.17.0.3.47834: Flags [.], seq 555216229:555221861, ack 2158083635, win 265, options [nop,nop,TS val 3225189973 ecr 3535710281], length 5632
21:01:17.917742 IP 172.17.0.3.47834 > 172.217.203.136.443: Flags [.], ack 555221861, win 4087, options [nop,nop,TS val 3535710282 ecr 3225189973], length 0
21:01:17.917903 IP 172.217.203.136.443 > 172.17.0.3.47834: Flags [R], seq 555221861, win 0, length 0

It turns out the exception is telling the truth: the remote end is resetting the connection. So I guess the next step is to figure out where that's coming from. Is it from dl.google.com? If not is it from some piece of (presumably google) hardware in the middle? Or from something in the host's network configuration that is then emulating an RST?

I'll run that script from my dev host for a little while. Did you have an idea about how many times you ran it before it failed?

Hexcles commented 4 years ago

I'll run that script from my dev host for a little while. Did you have an idea about how many times you ran it before it failed?

On a freshly started interactive task (using the WPT Docker image), it usually fails within ~5 retries, based on my ~10 experiments with various different IPs. However, if I wait for say 10min and then run the script, I can't observe any failure even after dozens of retries.

djmitche commented 4 years ago

Thanks! I ran it a few hundred times on my ec2 host. I'm trying running it in a bare python:2 docker container now, on the same EC2 host.

imbstack commented 4 years ago

Really great work, @Hexcles. Thanks for making it way easier to track this down.

I was unable to get this to reproduce after running that script for quite a while on a bare host in gcloud based on the machine image that our workers run on.

However after a mere two tries inside a docker container of ubuntu:latest it failed with a nice [Errno 104] Connection reset by peer. So that nicely rules out both your wpt docker image and the rest of the host configuration. It points to something to do with our docker setup. We'll look into that shortly.

Before I saw this reproduce I was looking at your download script I noticed that download_url_to_descriptor doesn't backoff between retries afaict. Maybe until we get this figured out you could decrease the impact on yourselves by adding a couple more retries and a backoff? Totally understand if that doesn't make sense on your end but we do want to make sure we're out of your way as much as possible.

Hexcles commented 4 years ago

Woohoo! Progress!

Before I saw this reproduce I was looking at your download script I noticed that download_url_to_descriptor doesn't backoff between retries afaict.

I noticed that, too; and we do exponential backoff now. 4e6974bfdfe6171d3f35c615ebc1a3c815b852ae Unfortunately, it doesn't seem to help much.

I can try increasing the initial delay and the number of retries.

djmitche commented 4 years ago

Brian did some further investigation and was able to reproduce quite frequently -- https://community-tc.services.mozilla.com/tasks/groups/UyRo436cTXC-Zez_lMgzPQ (the green are the reproductions). So, that's pretty common!

Some random googling revealed this article. It's a slow start, but gets to some details about conntrack labeling packets as INVALID and not translating, and that leading to RSTs. It doesn't quite match what we're seeing (the author saw RSTs from the client, while we see it from the server; and the author's network was set up to forward un-translated packets from client to server, whereas in our case such a packet would be dropped.

However,

bstack@wpt-testing:~$ sudo conntrack -S
cpu=0           found=0 invalid=338 ignore=152363 insert=0 insert_failed=0 drop=0 early_drop=0 error=2 search_restart=127

so there are invalid packets. Those might be just random internet scanning noise, but they might be evidence of something real here.

A few thoughts on how to follow up:

reproduce this in a situation where we can capture tcpdump on both sides of the FORWARDING chain
watch conntrack events when this reproduces

imbstack commented 4 years ago

Just to update we've reproduce on a host where we captured traces inside the container and on the host as well. We're looking into them will update more tomorrow. Sorry again this is causing troubles. I hope that increase in retries and wait time will help for the time being!

djmitche commented 4 years ago

Well, there are a bunch of things wrong there. Top three:

First, toward the end of the download, the sequence numbers pushed from dl.google.com break into two streams: Screenshot from 2020-04-23 17-06-14 (x axis is seconds, y axis is sequence number of the first octet in each packet from dl.google.com). Looking at the packets, each is answered with a DUP-ACK, which is an ACK containing a sack option for a sequence range earlier than the ack value. Basically it's a way of saying "thanks, but you sent me that a long time ago". So the lower line here were all duplicate segments. I'm guessing that a bunch of packets got delayed 2-3ms and retransmitted, and when the delayed packets finally arrived they were duplicates.

We actually got four copies of each packet, for the most part. For example, here's a data segment:

22:26:19.413470  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413483 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413487 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413490 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 40898: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832

(74.x is dl.google.com, 10.138.x is the host IP, and 172.17.x is the container IP). So from the first to the second packet, we see the SNAT reverse translation occurring. The remaining packets are all copies of one another as they pass through docker0, veth, etc. In the below I've separated distinct packets with a newline, and labeled each.

Second, the kernel failed to do its SNAT thing a few times at the very end, eventually killing the connection:

A26
22:26:19.414064  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414073 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414075 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414076 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 26818: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752

Low-seq segment L18 (not forwarded)
22:26:19.414077  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 2884: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34355910:34358726, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 2816

RST for that
22:26:19.414091 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A27
22:26:19.414157  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414174 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414177 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414180 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 53570: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504

Low-seq segment L19 (not forwarded)
22:26:19.414182  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34358726:34360134, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414186 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A28
22:26:19.414188  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414192 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414193 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414194 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 5698: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632

Low-seq segment L20 (not forwarded)
22:26:19.414194  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34360134:34361542, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414198 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A29
22:26:19.414248  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414256 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414257 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414258 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 21186: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120

A30 (only seen inside docker container)
22:26:19.414311 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 56386: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35663942:35720262, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 56320

A31 (only seen inside docker container)
22:26:19.414354 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 8514: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35720262:35728710, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 8448

A32 (only seen inside docker container)
22:26:19.414406 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 2882: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35728710:35731526, ack 789, win 265, options [nop,nop,TS val 2940395389 ecr 3057320826], length 2816

Low-seq segment L21 (not forwarded)
22:26:19.414565  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34361542:34367174, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 5632

RST for that
22:26:19.414572 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

Low-seq segment L22 (not forwarded)                                                                                                                                                                                                                                                                                            
22:26:19.414742  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34367174:34368582, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414746 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

ACK for A32
22:26:19.416605 02:42:ac:11:00:02 > 02:42:4b:3a:62:d8, ethertype IPv4 (0x0800), length 66: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416608  In 02:42:ac:11:00:02 ethertype IPv4 (0x0800), length 68: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416608   P 02:42:ac:11:00:02 ethertype IPv4 (0x0800), length 68: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416626 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 68: 10.138.0.12.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0

RST replying to ACK
22:26:19.416715  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 10.138.0.12.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416732 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416734 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416735 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 54: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0

Here the Axx packets have "normal" data segments and the Lxx packets have data segments with low sequence numbers -- Basically Axx is the high line of dots in the graph, and Lxx is the low line. Up until now, the kernel has been happily forwarding both kinds of packets to the container, so L17 was forwarded just fine. A26 is forwarded just fine. But L18 is not, and instead the kernel replies to dl.google.com with an RST. This repeats through L22, with A27-A29 being forwarded just fine in the interim. Finally, there's an ack for A32 sent to dl.google.com, and dl.google.com sensibly replies with an RST since we've already sent it a bunch of RSTs. This final RST from dl.google.com is what causes the ECONNRESET that we see in the socket API.

Third, a few packets arrived in the docker container without being captured by the tcpdump outside the container. Just before the RST was received, A30-A32 are only seen in the internal dump. So clearly something weird happened that caused pcap to miss those packets.

Long story short, there's a network hiccup that should be well within TCP's ability to recover, but the kernel doesn't handle it well and eventually gets confused and shuts things down.

Hexcles commented 4 years ago

So what's our takeaway here?

IIUC, this sounds somewhat similar to https://github.com/moby/libnetwork/issues/1090 where suprious retransmits out of the TCP window cannot be NAT'ed into the container and the host considers them invalid hence sending RST.

Shall we try one of the workarounds on the host?

imbstack commented 4 years ago

Hey @Hexcles,

Yeah, I think it is time to just try one of the workarounds you mentioned. I'm in the process of baking an image with the offloading settings you mentioned disabled now. If it helps in my testing, you can try using it asap.

p.s. for my own learning for next time, how did you figure to point the blame at TCP segmentation offloading? Is there a troubleshooting doc you found it in?

imbstack commented 4 years ago

Ah, now that I read the bug you linked to tonight I see what they suggest to do there. I'll bake an image with that and send it your way tonight or tomorrow!

Hexcles commented 4 years ago

@imbstack oh I'm sorry that wasn't clear. I was referring to the workarounds in https://github.com/moby/libnetwork/issues/1090 , which seems to directly tackle the the symptom that we are seeing -- either accepting the spurious retransmits into the Docker container (hopefully it will handle them better) or dropping them completely on the host to stop the host from sending RSTs.

web-platform-tests / wpt