weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.61k stars 668 forks source link

DNS resolution fails for some resolvers that issue A and AAAA queries in parallel #2288

Open timcharper opened 8 years ago

timcharper commented 8 years ago

In short, this feature request is motivated by certain DNS clients (IE Go DNS) appearing to be doing the wrong thing. When I run docker push, the Go DNS lib is sending AAAA and A in parallel. Even though Weave DNS (as of 1.5.1) is responding correctly with No answer, if AAAA comes back first, then docker push aborts with a DNS resolution failure and doesn't wait for the A response which comes literally nanoseconds later. I've confirmed this by using tcpdump, and have filed the following issue: https://github.com/docker/docker/issues/22673 . However, I'm unable to find anywhere in an RFC where Go's DNS client behavior is wrong.

The closest thing to an authoritative document that I could find WRT to DNS IP6/4 parallel query behavior is this memo https://www.ietf.org/rfc/rfc4472.txt, which says this:

First, let us consider generic implications of unnecessary queries for AAAA records: when looking up all the records in the DNS, AAAA records are typically tried first, and then A records. These are done in serial, and the A query is not performed until a response is received to the AAAA query. Considering the misbehavior of DNS servers and load-balancers, as described in Section 3.1, the lookup delay for AAAA may incur additional unnecessary latency, and introduce a component of unreliability.

One option here could be to do the queries partially in parallel; for example, if the final response to the AAAA query is not received in 0.5 seconds, start performing the A query while waiting for the result. (Immediate parallelism might not be optimal, at least without information-sharing between the lookup threads, as that would probably lead to duplicate non-cached delegation chain lookups.)

An additional concern is the address selection, which may, in some circumstances, prefer AAAA records over A records even when the node does not have any IPv6 connectivity [WIP-RDP2004]. In some cases, the implementation may attempt to connect or send a datagram on a physical link [WIP-R2006], incurring very long protocol time-outs, instead of quickly falling back to IPv4.

RFC 6147 lays out a list of DNS extensions and mentions the rules for synthesizing AAAA records when one is not available. It would be nice if Weave DNS implemented this and returned 6to4 response for AAAA queries for weave DNS entries; that would help resolve issues with seemingly poorly behaving DNS clients. It seems like Go's DNS resolution behavior, while it's causing me problems here, may not be wrong, and there are probably other reasons it was implemented as it was.

Another potential solution here would be to delay IPv6 no-answer responses. Or, ask the Go authors to please modify / fix their DNS client behavior.

rade commented 8 years ago

My reading of the Go DNS parallel query code is that if either the A or AAAA query fail or return NODATA then the answers from the the other should still be returned.

Perhaps the docker client ends up using the system resolver on your system for some reason, and that resolver is faulty?

timcharper commented 8 years ago

Huh, from what I've read, you can control the DNS resolver Go compiled programs will use with the environment variable GODEBUG=netdns=go or GODEBUG=netdns=cgo.

Here's the output from running the following command:

$ while sleep 1; do GODEBUG=netdns=go+1 docker push "registry.weave.local:5000/grafana:2.6.0"; done
... snip ...
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
Put http://registry.weave.local:5000/v1/repositories/grafana/: dial tcp: lookup registry.weave.local: no such host
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: GODEBUG setting forcing use of Go's resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Retrying in 2 seconds
713084fc5ed4: Retrying in 2 seconds
5f70bf18a086: Retrying in 2 seconds
917c0fc99b35: Retrying in 2 seconds

Note the sporadic Put http://registry.weave.local:5000/v1/repositories/grafana/: dial tcp: lookup registry.weave.local: no such host. That happens every 10-20 pushes.

I've run tcpdump and monitored it via Wireshark to watch the network traffic. There is absolutely zero packet drop. I can ping registry.weave.local all day long and there isn't a single packet drop. There is no evidence that we have any packet drops anywhere, except WRT to DNS resolution from docker related interactions. I have confirmed that the times that docker push fails, both the AAAA record and the A record were returned properly by Weave DNS, but AAAA comes first. Every time it succeeds, the A record comes first.

I get the same failure when I force the cgo resolver:

$ while sleep 1; do GODEBUG=netdns=cgo+1 docker push "registry.weave.local:5000/grafana:2.6.0"; done
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
Put http://registry.weave.local:5000/v1/repositories/grafana/: dial tcp: lookup registry.weave.local: no such host
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132

¯(ツ)

So weird.

timcharper commented 8 years ago

I'm running Weave 1.5.1 and it does seem to be responding to queries properly.

$ nslookup registry.weave.local
Server:     172.30.31.51
Address:    172.30.31.51#53

Name:   registry.weave.local
Address: 10.42.128.8

$ nslookup -query=AAAA registry.weave.local
Server:     172.30.31.51
Address:    172.30.31.51#53

*** Can't find registry.weave.local: No answer
timcharper commented 8 years ago

Something tremendously strange is happening here. I modified the hosts DNS settings such that it would no longer include Weave DNS, and it would still succeed at times.

$ while sleep 1; do GODEBUG=netdns=cgo+1 docker push "registry.weave.local:5000/grafana:2.6.0"; done
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
Head http://registry.weave.local:5000/v2/grafana/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: dial tcp: lookup registry.weave.local: no such host
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
Put http://registry.weave.local:5000/v1/repositories/grafana/: dial tcp: lookup registry.weave.local: no such host
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
go package net: using cgo DNS resolver
The push refers to a repository [registry.weave.local:5000/grafana]
782a3ab2ff84: Layer already exists
713084fc5ed4: Layer already exists
5f70bf18a086: Layer already exists
917c0fc99b35: Layer already exists
2.6.0: digest: sha256:d1a24ffb6667d288af81ddafbc14681ea7119274a42b0d00842633851031a886 size: 1132
^C
$ nslookup -query=AAAA registry.weave.localServer:      172.30.11.41
Address:    172.30.11.41#53

** server can't find registry.weave.local: NXDOMAIN

$ nslookup -query=A registry.weave.local
Server:     172.30.11.41
Address:    172.30.11.41#53

** server can't find registry.weave.local: NXDOMAIN

$ ping registry.weave.local
ping: unknown host registry.weave.local
timcharper commented 8 years ago

I still get failures if I put the ip4 entry in /etc/hosts. This is most likely a docker problem and may not even be resolved by the solution I propose. Feel free to close.

timcharper commented 8 years ago

I've switched to use mesos-DNS instead of weave-DNS to point to the docker registry, and it seems to be working reliably. I'm going to run wireshark to see if I can figure out why; it may be that mesos-DNS delays AAAA records to reduce the race condition.

On my issue submitted to Docker, somebody reported seeing this AAAA before A record behavior with Dockerhub itself. So, evidence is shaping up.

awh commented 8 years ago

@timcharper thanks for keeping us apprised!

rade commented 8 years ago

@timcharper any news on this?

timcharper commented 8 years ago

Hi, @rade . I've switched to use mesos-DNS over weave-DNS and it seems to fail with much less frequency. I haven't run a wireshark to see exactly how mesos-DNS is responding. The Docker folks seem to agree that there is a bug in the resolver somewhere.

Weave-DNS is working without any issues for other processes. It's just docker push and docker pull, for some reason.

rade commented 8 years ago

@timcharper If you determine the difference in behaviour then we can conceivably adapt weaveDNS - it wouldn't be the first time that we work around a quirk in Docker.

timcharper commented 8 years ago

@rade I've attempted to determine the behavioral difference. Here's what I've learned:

At any rate, even still, when I push with registry.weave.local, I still get random DNS lookup failures from docker engine. When I push with registry.marathon.mesos, it works every time, without fail. This discrepancy is tremendously clear. The specific cause I cannot infer... perhaps the bug manifests with 1 second TTL?

bboreham commented 6 years ago

I wonder if this was caused by the kernel race condition discussed at #3287