weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 668 forks source link

DNS lookup timeouts due to races in conntrack #3287

Open dcowden opened 6 years ago

dcowden commented 6 years ago

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

brb commented 6 years ago

@dcowden

Based on the provided traces (https://github.com/weaveworks/weave/files/1975806/foo.pcap.gz), the following is happening:

  1. The glibc resolver uses the same UDP socket for parallel queries (A and AAA). As UDP is a connectionless protocol, calling connect(2) does not send any packet => no entry is created in the conntrack hash table.
  2. The kube-dns service is accessible via VIP which is backed by iptables DNAT rules. The relevant ones from the nat table in your case:
-A KUBE-SERVICES -d 100.64.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
<..>
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-JILKODJ63HVFF6B2
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-LFXGESA25DLV4HVG
<..>
-A KUBE-SEP-JILKODJ63HVFF6B2 -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 100.117.128.12:53
-A KUBE-SEP-LFXGESA25DLV4HVG -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 100.99.128.9:53
  1. During DNAT translation, the kernel calls the relevant netfilter hooks in the following order:
    1. nf_conntrack_in: creates conntrack hash object, adds it to the unconfirmed entries list.
    2. nf_nat_ipv4_fn: does the translation, updates the conntrack tuple.
    3. nf_conntrack_confirm: confirms the entry, adds it to the hash table.
  2. The two parallel UDP requests (518 and 524 in the pcap) race for the entry confirmation. Additionally, they end up using different DNS endpoints. 518 wins the race, while 524 looses. Due to the latter, insert_failed counter is incremented (check with conntrack -S) and the request is dropped => you get the timeout.

As I mentioned above, the --random-fully flag does not help here, as it's only for SNAT which is not the culprit in your case.

@Quentin-M

As you use the ipvs backend, I'm curious to see your iptables-save output.

brb commented 6 years ago

@dcowden

I suspect this triggers the DNAT issue.

Could you verify this by checking insert_failed counter value with conntrack -S?

dcowden commented 6 years ago

@brb Will do, but i will not be able to do it until later this week. That said, I can't imagine that your analysis is wrong-- what do you think is the fix? and/or can you suggest workarounds?

I'm honestly shocked that most of the internet isnt' saying 'well kubernetes is great, but you're going to have packet loss issues'.

We do not use UDP for much other than DNS, so one idea i've been thinking about is to somehow run kube-dns as a daemonset with hostNetwork=true-- thus removing some of the NAT. But i think that'd be hard to do with kops, because kops bundles the kube-dns manifests, and we'd have to override them.

And even so, that'd be a workaround ( albeit it is very reasonable to assume that DNS is the only UDP protocol that would expose this race condition so frequently).

Another workaround, based on your analysis, would be to avoid using a VIP, and instead configure pods to use the individual pods with a round robin cluster ip A records. I'm not sure if that configuration is possible.

bboreham commented 6 years ago

@brb huge kudos for figuring that out!

@dcowden when I used to do electronic trading for a living I would find that people live with the most egregious network problems and never think "this is really broken". A tiny minority of people care enough to look at what is really going on.

Also the issue is sensitive to what exact technologies you use - for instance at Weaveworks we write most services in Go so they don't use the glibc resolver.

run kube-dns as a daemonset

so it would be on every host, and could be addressed using the host's own IP? I've seen discussions along those lines; unfortunately changing resolv.conf to point at that IP requires a Kubernetes change.

instead configure pods to use the individual pods with a round robin cluster ip A records.

Not really following this suggestion. AFAIK resolv.conf has to have the IP addresses of servers, not DNS names. If we could get Kubernetes to keep the IP addresses of kube-dns pods static across restarts, that would be plausible, but not currently a feature.

dcowden commented 6 years ago

so it would be on every host, and could be addressed using the host's own IP? I've seen discussions along those lines; unfortunately changing resolv.conf to point at that IP requires a Kubernetes change.

We are already using a hack that updates resolv.conf on pod start in our container entry point to add option single_request-reopen we would need to use that in combination with the downward api to inject the host ip. It stinks but it would work maybe?

Not really following this suggestion. AFAIK resolv.conf has to have the IP addresses of servers, not DNS names. If we could get Kubernetes to keep the IP addresses of kube-dns pods static across restarts, that would be plausible, but not currently a feature.

yeah you're right, there would be no way to assign static ips to the pods to make this work.

dcowden commented 6 years ago

@brb yes it appears to be the case. below is the output on the same host on which the tests above ran.

/home/weave # conntrack -S
cpu=0       searched=630288 found=15093196 new=1346365 invalid=34 ignore=647629 delete=1408965 delete_list=1408867 insert=1344752 insert_failed=92 drop=0 early_drop=0 error=0 search_restart=0 
cpu=1       searched=846871 found=28126666 new=1919780 invalid=74 ignore=650870 delete=1855000 delete_list=1854877 insert=1921172 insert_failed=107 drop=0 early_drop=0 error=0 search_restart=0 
cpu=2       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=3       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=4       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=5       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=6       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=7       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=8       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=9       searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=10      searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=11      searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=12      searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=13      searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
cpu=14      searched=0 found=0 new=0 invalid=0 ignore=0 delete=0 delete_list=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 
brb commented 6 years ago

@dcowden What is your CentOS and kernel vsn?

Quentin-M commented 6 years ago

For what’s it’s worth, I use the latest Container Linux, and yes, insert_failed increases systematically by 1 every time I send a DNS request.

dcowden commented 6 years ago

@brb

[root@ip-172-25-83-254 ~]# more /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@ip-172-25-83-254 ~]# uname -a
Linux ip-172-25-83-254.colinx.com 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Quentin-M commented 6 years ago

I would just like to add here that the single-request(-reopen) workaround does not work with Alpine-based containers, as musl does not support the option (see below). Unfortunately, Alpine Linux is the base image for 90% of our infrastructure.

src/network/resolvconf.c

                if (!strncmp(line, "options", 7) && isspace(line[7])) {
                        p = strstr(line, "ndots:");
                        if (p && isdigit(p[6])) {
                                p += 6;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->ndots = x > 15 ? 15 : x;
                        }
                        p = strstr(line, "attempts:");
                        if (p && isdigit(p[9])) {
                                p += 9;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->attempts = x > 10 ? 10 : x;
                        }
                        p = strstr(line, "timeout:");
                        if (p && (isdigit(p[8]) || p[8]=='.')) {
                                p += 8;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->timeout = x > 60 ? 60 : x;
                        }
                        continue;
                }

src/network/lookup.h

struct resolvconf {
        struct address ns[MAXNS];
        unsigned nns, attempts, ndots;
        unsigned timeout;
};

I have reached out on the freenode's #musl channel, but unfortunately it does not seem like there is much desire to add support for the option:

[16:19] <dalias> why not fix the bug causing it?
[16:20] <dalias> sprry
[16:20] <dalias> the option is not something that can be added, its contrary to the lookup architecture
[17:39] <dalias> quentinm, thanks for the report. i just don't know any good way to work around it on our side without nasty hacks
[17:40] <dalias> the architecture is not designed to support sequential queries
Quentin-M commented 6 years ago

@dcowden @bboreham @brb @dcowden @xiaoxubeii

For what it's worth: I simply switched a two-nodes cluster that was broken (5s latency for every single curl, except when single-request was used), from the latest weave to calico 2.6, and the issue went away immediately. None of my pods experience the DNS issue where AAAA packets would get dropped anymore.

I will be happy to grant access to a cluster where the issue is present if that means we will get some help 💯

dcowden commented 6 years ago

@Quentin-M thanks for the report. We'll try this next. For now we're working around but-- annoying to say the least! Our problem is that calico doesnt support encryption on the cluster overlay. weave does this better than any of the others, so i hope we can keep using weave!

Quentin-M commented 6 years ago

@dcowden @bboreham @brb @dcowden @xiaoxubeii

Another very interesting note, when FASTDP is disabled (but encryption is still on), the issue also disappear. I tested this on 4 clusters, with regular and jumbo MTUs.

bboreham commented 6 years ago

How exactly did you disable fastdp?

brb commented 6 years ago

Another very interesting note, when FASTDP is disabled

My guess is that due to slower nature of the sleeve mode races are less likely to happen, but not completely unavoidable.

brb commented 6 years ago

@Quentin-M

For what it's worth: I simply switched a two-nodes cluster that was broken (5s latency for every single curl, except when single-request was used), from the latest weave to calico 2.6, and the issue went away immediately.

That's interesting. Do you use the IP-in-IP tunneling with Calico?

Quentin-M commented 6 years ago

@brb @bboreham

How exactly did you disable fastdp?

Once, I simply dropped the following in the Weave's manifest, used reset and let Kubernetes do a roll Weave. Later, I did the same thing but also killed all the pods. And another time, I edited the manifest, then killed all the nodes, letting new identical ones come back, with fresh configuration/networking, re-scheduling all the pods. Every time, I verified using weave --local status connections.

        - name: WEAVE_NO_FASTDP
          value: "true"

That's interesting. Do you use the IP-in-IP tunneling with Calico?

Yes, IP-in-IP set to always. Happy to drop the manifest if necessary.

My guess is that due to slower nature of the sleeve mode races are less likely to happen, but not completely unavoidable.

That was one of my ideas too, yeah.. Calico is supposedly "pretty fast" as well, even in IPIP (I believe it is done in the kernel too), but the timing might be just different enough to avoid it. Or, the problem is different.

Thank you.

Quentin-M commented 6 years ago

When a single pod is used to wget/curl a target, a tc policy that delay every other DNS datagram by, say, 10ms seems to alleviate the issue entirely: netem gap 2 delay 10ms reorder 100%. However, this may not work much when multiple pods are making requests as the policy applies to the whole node and therefore may not induce delay between the two parallel A/AAAA datagrans coming out of a single pod, but between two A requests of different pods. This actually may not be true and work properly depending on how SNAT/DNAT/conntrack operates, but I am not expert enough.

Another interesting rule is to add random delay to every single DNS datagrams going out, but this does not work 100% of the time, even with a single pod making requests, as the two A/AAA datagrams may be sent with delays that are close enough to each other that the race still happens. There might be a smart thing to do here to make it work reliably.. Maybe rate control.

The traffic shaping may be applied to DNS requests only using filters, but due to the low-level nature of the issue, the drops may also happen to any of traffic on the network.. We are for example about to migrate major graphite/statsd clusters, that sent a high volume of UDP datagrams, and I am worried the issue will also occur there, but become much more problematic, especially as the datagrams will have to be shaped on the ingress side.

Quentin-M commented 6 years ago

Here is the workaround we are about to start using: https://github.com/Quentin-M/weave-tc/blob/master/weave-tc.sh, which seem to reduce the likelihood of the race significantly. Using it is as simple as adding the following container to the weave DaemonSet:

        - name: weave-tc
          image: 'qmachu/weave-tc:0.0.1'
          securityContext:
            privileged: true
          volumeMounts:
            - name: xtables-lock
              mountPath: /run/xtables.lock
            - name: lib-tc
              mountPath: /lib/tc
bboreham commented 6 years ago

Is there really nothing that the weave team can do

What we're doing is gathering data to understand the issue(s) and analyzing it. Sorry if this comes across as "nothing".

dcowden commented 6 years ago

@quentin-m holy cow, man we will try that solution out and see if it works for us. What side affects should we watch out for?

It's been a long time since I have read a shell script that was so far over my head.. that's some highly impressive work!

thomaschaaf commented 6 years ago

@Quentin-M I am getting No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory) does the host need to have something installed as well? What should lib-tcpoint to on the host? Maybe you can provide your deployment set yaml for me to compare :)

Quentin-M commented 6 years ago

@thomaschaaf Absolutely!

I mount /run/xtables.lock and /lib/tc. Pareto should already be on the host, it is part of iproute2, which is essentially the same everywhere.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: weave-net
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: system:weave-net
  namespace: kube-system
rules:
  - apiGroups:
      - ''
    resources:
      - pods
      - namespaces
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - networking.k8s.io
    resources:
      - networkpolicies
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: system:weave-net
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:weave-net
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: weave-net
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  name: system:weave-net
  namespace: kube-system
rules:
  - apiGroups:
      - ''
    resourceNames:
      - weave-net
    resources:
      - configmaps
    verbs:
      - get
      - update
  - apiGroups:
      - ''
    resources:
      - configmaps
    verbs:
      - create
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  name: system:weave-net
  namespace: kube-system
roleRef:
  kind: Role
  name: system:weave-net
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: weave-net
    namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: weave-net
  namespace: kube-system
  labels:
    k8s-app: weave-net
spec:
  selector:
    matchLabels:
      k8s-app: weave-net
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        k8s-app: weave-net
    spec:
      containers:
        - name: weave
          command:
            - /home/weave/launch.sh
          env:
            - name: WEAVE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: weave-password
                  key: password
            - name: WEAVE_MTU
              value: '8912'
            - name: IPALLOC_RANGE
              value: '172.16.0.0/16'
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          image: 'weaveworks/weave-kube:2.3.0'
          livenessProbe:
            httpGet:
              host: 127.0.0.1
              path: /status
              port: 6784
            initialDelaySeconds: 30
          securityContext:
            privileged: true
          volumeMounts:
            - name: weavedb
              mountPath: /weavedb
            - name: cni-bin
              mountPath: /host/opt
            - name: cni-bin2
              mountPath: /host/home
            - name: cni-conf
              mountPath: /host/etc
            - name: dbus
              mountPath: /host/var/lib/dbus
            - name: lib-modules
              mountPath: /lib/modules
            - name: xtables-lock
              mountPath: /run/xtables.lock
        - name: weave-npc
          args: ['--metrics-addr=0.0.0.0:6781']
          env:
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          image: 'weaveworks/weave-npc:2.3.0'
          securityContext:
            privileged: true
          volumeMounts:
            - name: xtables-lock
              mountPath: /run/xtables.lock
        - name: weave-tc
          image: 'qmachu/weave-tc:0.0.1'
          securityContext:
            privileged: true
          volumeMounts:
            - name: xtables-lock
              mountPath: /run/xtables.lock
            - name: lib-tc
              mountPath: /lib/tc
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      securityContext:
        seLinuxOptions: {}
      serviceAccountName: weave-net
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
        - key: CriticalAddonsOnly
          operator: Exists
      volumes:
        - name: weavedb
          hostPath:
            path: /var/lib/weave
        - name: cni-bin
          hostPath:
            path: /opt
        - name: cni-bin2
          hostPath:
            path: /home
        - name: cni-conf
          hostPath:
            path: /etc
        - name: dbus
          hostPath:
            path: /var/lib/dbus
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
        - name: lib-tc
          hostPath:
            path: /lib/tc
---
apiVersion: v1
kind: Secret
metadata:
  name: weave-password
  namespace: kube-system
type: Opaque
data:
  password: {{ .weave.password }}
thomaschaaf commented 6 years ago

@Quentin-M For some reason /lib/tc does not exist on my nodes. (Debian Jessie) installed with kops using k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08.

Quentin-M commented 6 years ago

@thomaschaaf According to https://packages.debian.org/jessie/amd64/iproute2/filelist, you would be using /usr/lib/tc/ instead (and pareto is well in there).

dcowden commented 6 years ago

@bboreham do you have any more insights on this issue? It seems like every day i come across another thread talking about dns timeouts here or there. It feels like a 'dirty little secret' at this point :)

bboreham commented 6 years ago

No, no particular insight. I'm trying to cross-fertilise the conversations in the hope someone shows up and says "this is all very clear to me".

dcowden commented 6 years ago

@bboreham i see, yes, that's the open source slogan right? "given enough eyes, every problem is trivial" Thanks for your continued work. Let me know if there's something I can test that would be helpful.

I'll try @Quentin-M 's fix and report back.

jsravn commented 6 years ago

so it would be on every host, and could be addressed using the host's own IP? I've seen discussions along those lines; unfortunately changing resolv.conf to point at that IP requires a Kubernetes change.

You can do this already with --resolv-conf passed to kubelet. Run a dnsmasq daemonset that proxies all dns queries to kube-dns using host networking, and listening on all interfaces. This reduces the DNS problems substantially.

bboreham commented 6 years ago

As I understand it, --resolv-conf is a single setting for all pods, thus removing the ability to find services in the same namespace as the current pod.

That is what I meant by "requires a Kubernetes change" - to change the DNS server address without giving up any other features. If you don't need those features it's an option.

klausenbusk commented 6 years ago

As I understand it, --resolv-conf is a single setting for all pods, thus removing the ability to find services in the same namespace as the current pod.

If you just need to change the dns server ip you can use --cluster-dns.

jsravn commented 6 years ago

As I understand it, --resolv-conf is a single setting for all pods, thus removing the ability to find services in the same namespace as the current pod.

The generated search domains and options are preserved. resolv-conf only parses the nameservers afaik. That's how we set it up.

bboreham commented 6 years ago

What DNS IP do you use that always resolves to the local host?

klausenbusk commented 6 years ago

What DNS IP do you use that always resolves to the local host?

You can probably use the local docker bridge ip (172.17.0.1)

jsravn commented 6 years ago

Address of the docker interface. This is probably setup dependent. I think you could use any interface on the host that is routable from pods (so not the loopback).

dcowden commented 6 years ago

@jsravn I would like to learn more about your setup. Do you by chance use kops?

I would like to see your dnsmasa daemon set manifest if you are willing to post it. My understanding is that kops already runs dnsmasq as a container in it's default kube-dns pod, so we would have to figure out how to disable that in a way that doesn't get undone when we use kops to update the cluster.

jsravn commented 6 years ago

@dcowden You wouldn't touch the kube-dns pod, it still runs dnsmasq. The local dnsmasq caches all local queries on the node - benefits being the cache will be localised, you can bypass kube-dns completely if you want for external queries (we do this), and it's more resilient to outages. I can't give you the exact daemonset at the moment, but it shouldn't be so hard, you need to setup hostNetworking and configure dnsmasq to listen on the local docker bridge. The trickier part is configuring kubelet with -resolv-conf, since that won't be easy in hosted solutions like GKE. In this case, it would be nice if k8s had a runtime API for configuring the DNS setup (which it doesn't afaik). You could probably do it with a custom iptables rule to intercept dns requests and transparently route to your local dnsmasq via dnat - this would be done as part of the daemonset. That is feeling pretty hacky though.

(Apologies if I've taken this issue off topic - feel free to contact me on kubernetes slack if you want to discuss further ideas)

dcowden commented 6 years ago

@jsravn thanks for this tip. I hadn't thought of this approach, but it has a number of benefits-- for example, it makes it much more straightforward to work in a split-dns corporate environment.

jaredallard commented 6 years ago

So, as far as I can tell from this thread, there isn't really a solution yet aside from some of these workarounds is there?

bboreham commented 6 years ago

Not only is there not a solution, we don't know which of the various theories about the problem is most important in practice.

jaredallard commented 6 years ago

@bboreham Understandable, we've been migrating to Kubernetes and haven't had really any CNI work for us. Every single one appears to either have high latency or kube-dns issues. Just a bit frustrating since clearly other people are able to make kubernetes work. Hopefully we're able to diagnose which theory is most "important" and/or what has been causing these issues.

dcowden commented 6 years ago

@jaredallard I agree with your assessment. For us, using standard network doesnt work because we require encryption between nodes-- which is hard to set up on bare metal, vs weave, that 'just works'.

While technically a workaround, I believe that the dnsmasq solution provided by @jsravn is technically the right answer. In our case, we have split dns and all kinds of weird stuff. At some point, its best to simply let the bare metal layer handle it. I think there's fairly decent evicence that people's SNAT/DNAT problems are pretty much all DNS, so i think running a dnsmasq process on each node makes sense, and should probably be the 'right way', as long as you're still using CNI.

Of course as you pointed out, I agree that if you can avoid CNI, that's probably the 'right choice'-- it removes a whole layer of stuff to deal with.

Quentin-M commented 6 years ago

@jaredallard My weave-tc work around is simple enough to use and fixes the problem for us entirely.

jaredallard commented 6 years ago

@Quentin-M Does it solve just latency or issues with kube-dns as well? We've pretty much gotten rid of all issues with latency on calico w/ ip-in-ip, but kubedns doesn't work when it gets a lot of hits.

Quentin-M commented 6 years ago

This particularly solves the kernel race condition inside conntrack that drops parallel A/AAAA packets, leading to static 5s latency on each DNS query, regardless of coredns/kubedns/powerdns...

Quentin-M commented 6 years ago

Just posted a little write-up about our journey troubleshooting this issue there: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/, including our workaround.

tj13 commented 6 years ago

@Quentin-M can it run on non-Weave network? our environment is ovs + openshift.

Quentin-M commented 6 years ago

Can do, the network interface in the script must be set appropriately. It can’t work on a network interface where traffic is already encrypted, it has to be set above that layer (e.g. eth0 is not OK for Weave, but the weave0 interface is OK).

huyqut commented 6 years ago

@Quentin-M Hi, I have the same problem as @thomaschaaf :

No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory)

However, I'm using CentOS 7 and there's no iproute2 package. What should I do in this case?

Edit: Found out it was in /usr/lib64/tc instead of /usr/lib/tc.

Quentin-M commented 6 years ago

Hi,

On CentOS, pareto.dist is in /usr/lib64/tc and provided by the iproute package. The mount needs to be adapted accordingly. Ref: https://centos.pkgs.org/7/centos-x86_64/iproute-4.11.0-14.el7.x86_64.rpm.html