DNS lookup timeouts due to races in conntrack

dcowden commented 6 years ago

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

brycesteinhoff commented 6 years ago

We're experiencing this same issue, with failed inserts into the conntrack table increasing.

@Quentin-M, I've tried deploying your Docker container in our Weave pods, but the issue remains.

Quentin-M commented 6 years ago

Have you adapted the solution to the relevant ports / interfaces?

brycesteinhoff commented 6 years ago

@Quentin-M Thanks for the quick reply!

I just noticed after I posted that your shell script was marking traffic destined for 5353. I've changed that to 53 as we're seeing problems with standard DNS, and will continue to monitor. So far it seems it may be better, but I still see some delay (~2.5s) on some requests.

Our interface is called "weave" also, so I left that the same.

I've not fully dived in to understand your script; need to familiarize myself with tc. Are there any other aspects I should consider adjusting?

brb commented 6 years ago

Just to update, I've submitted two patches to fix the conntrack races in the kernel - http://patchwork.ozlabs.org/patch/937963/ (accepted) and http://patchwork.ozlabs.org/patch/952939/ (waiting for a review).

If both are accepted, then the timeout cases due to the races will be eliminated for those who run only one instance of a DNS server, and for others - the timeout hit rate should decrease.

To completely eliminate when |DNS server| > 1 is a non-trivial task and is still WIP.

bboreham commented 6 years ago

Do we envisage setting NF_NAT_RANGE_PROTO_RANDOM_FULLY inside Weave Net? If not I would re-title this issue to match the broader problem.

brb commented 6 years ago

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

jaygorrell commented 6 years ago

This thread was immensely helpful - thanks for all who contributed. Simply adding the trailing . was the easiest for most of my cases and works great. The one thing I'm still not fully understanding is how internal (ie. .default) dns lookups would sometimes fail. I can try the trailing dot but this is largely around external lookups that go through kube-dns, right?

I would have expected something like service.default. to fail without specifying the full fqdn since this would skip search domains but it appears to be working fine -- though by working, I don't necessarily mean it avoids the timeout problem. If it can't be resolved externally does it then revert back to the search?

bboreham commented 6 years ago

Adding a trailing dot reduces the chance of failure since it reduces the number of lookups by (typically) 5x. It doesn't prevent any underlying problem.

DNS resolvers vary, and they are linked into your client program, so I don't know which one you are using. However I would expect a fully-qualified name like service.default. to never hit the search list.

jaygorrell commented 6 years ago

That was just trying curl from the container. Good point though... it probably doesn't follow the same rules.

brb commented 5 years ago

The second kernel patch to mitigate the problem got accepted (context: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts) and it is out in Linux 5.0-rc6.

Please test it and report whether it has reduced the timeout hit rate. Thanks.

krzysztof-bronk commented 5 years ago

That's good news, thank you guys for all your investigative work so far.

I'm still a bit unclear as to what solution applies to what case, or more importantly, which cases do not have solutions yet. Let me (re)state some of the findings gathered from various blogs and other github issues - please correct me if any of it is wrong as of what we know today.

The issue exists for both SNAT and DNAT. The issue exists for both UDP and TCP. conntrack -S counts failed insertions for both UDP and TCP, so the number of packets there might mean 5 seconds delay in case of DNS and 1, 3, etc. seconds for TCP retransmission

To mitigate the issue, one can for example use single-request-reopen in resolv.conf, if the container image uses glibc (which rules out Alpine), or use weave-tc to introduce microdelays for DNS packets. Disabling ipv6 or using FQDNs are quite niche solutions so let's leave them for now.

But both solutions are for DNS (weave-tc being UDP only on top), external TCP connections will still have a problem. Admittedly DNS virtual IP is probably the most used "service" in the cluster - and the topic of this issue.

The 2 fixes in the kernel solve the issue but only if you run a single DNS pod (or one per node with pods only connecting to that local one). I think weave-tc also does not guarantee 100% effectiveness in multiple pod case.

By the way, which kernel version contains the 1st fix? I understand the second is 5.0+. And more importantly, do those fixes work both for SNAT, DNAT, both TCP and UDP?

In other words, given that moving to kernel 5.0+ is quite the leap for some, does it mean, in the simplest terms, that even if you introduce all possible mentioned workarounds, without those 2 kernel fixes, there is still a problem when 2+ containers connect to google.com at the same time?

(I'm excluding "workarounds" such as not using overlay networks at all, although as I understand that would actually work)

chris530 commented 5 years ago

Launched https://github.com/Quentin-M/weave-tc as a DS in k8, and it immediately fixed the issue.

bboreham commented 5 years ago

there is still a problem when 2+ containers connect to google.com at the same time?

[EDIT: I was confused so scoring out this part. See later comment too.] ~Those (TCP) connections are never a problem, because they will come from unique source ports.~

The problem [EDIT: in this specifc GitHub issue] comes when certain DNS clients make two simultaneous UDP requests with identical source ports (and the destination port is always 53), so we get a race.

The best mitigation is a DNS service which does not go via NAT. This is being worked on in Kubernetes, basically one per node and disabling NAT for on-node connections.

krzysztof-bronk commented 5 years ago

But isn't there a race condition in that source port uniqueness algorithm during SNAT, regardless of protocol and affecting different pods on the same host in the same way as the dns UDP client issue within one? Basically as in https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

bboreham commented 5 years ago

Sorry, yes, there is a different race condition to do with picking unique outgoing ports for SNAT.

If you are actually encountering this please open a new issue giving the details.

krzysztof-bronk commented 5 years ago

Thank you for the response. Indeed I'm seeing insert_failed despite implementing several workarounds and I'm note sure whether it's TCP, UDP, SNAT or DNAT. We can't bump the kernel yet.

If I understood correctly the SNAT case should be mitigated by the "random fully" flag, but Weave never went on with it? I think kubelet and kube-proxy would need those as well anyway, I don't know where things stand there.

There is one more head scratching case for me which is how all those cases fare when one uses NodePort. Isn't there a similar conntrack problem if NodePort forwards to cluster ip?

bboreham commented 5 years ago

the "random fully" flag, but Weave never went on with it?

We investigated the problem reported here, and developed fixes to that problem. If someone reports symptoms that are improved by "random fully" then we might add that. We have finite resources and have to concentrate on what is actually reported (and within that set, on paying customers).

Or, since it's Open Source, anyone else can do the investigation and contribute a PR.

krzysztof-bronk commented 5 years ago

I understand :) I was merely trying to comprehend where things stand with regards to the different races and available mitigations, since there exist several blog posts and several github issues with a massive amount of comments to parse.

From my understanding of all of it, even with 2 kernel fixes and dns workarounds and iptables flags there is still an issue at least with multipod -> Cluster IP multipod connection, and without kernel 5.0 or "random fully" also an issue with simple multipod -> External IP connection.

But yeah, I'll raise a new issue if that proves true and impactful enough for us in production. Thank you

Krishna1408 commented 5 years ago

@Quentin-M @brb We are using weave as well for our CNI and I tried to use the workaround mentioned by @Quentin-M. But I am getting error:

No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory)

I am using debian: 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux

And I have mounted on /usr/lib/tc

Can you please correct where I am getting wrong ?

    spec:
      containers:
      - name: weave-tc
        image: 'qmachu/weave-tc:0.0.1'
        securityContext:
          privileged: true
        volumeMounts:
          - name: xtables-lock
            mountPath: /run/xtables.lock
          - name: usr-lib-tc
            mountPath: /usr/lib/tc

      volumes:
      - hostPath:
          path: /usr/lib/tc
          type: ""
        name: usr-lib-tc

Edit: In the container specs, VolumeMount us-lib-tc needs update. It should be /lib/tc instead of /usr/lib/tc

hairyhenderson commented 5 years ago

@Krishna1408 If you change mountPath: /usr/lib/tc to mountPath: /lib/tc it should work. It needs to be mounted in /lib/tc inside the container, but it's (usually) /usr/lib/tc on the host.

Krishna1408 commented 5 years ago

Hi @hairyhenderson thanks a lot, it works for me :)

phlegx commented 4 years ago

@brb May I ask if the problem (5 sec DNS delay) is solved with the 5.x Kernel? Have you have some more details and feedback from people already?

brb commented 4 years ago

@phlegx It depends which race condition you hit. The first two out of three got fixed in the kernel, and someone reported a success (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-463275851).

However, not much can be done from the kernel side about the third race condition. See my comments in the linked issue.

bboreham commented 4 years ago

I will repeat what a few others have said in this thread: the best way forward, if you have this problem, is “node-local dns”. Then there is no NAT on the DNS requests from pods and so no race condition.

Support for this configuration is slowly improving in Kubernetes and installers.

phlegx commented 4 years ago

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

insoz commented 4 years ago

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

You mean the Linux 5.x is kernel 5.x ?

thockin commented 4 years ago

I just wanted to pop in and say thanks for this excellent and detailed explanation. 2 years since it was filed and 1 year since it was fixed, some people still hit this issue, and frankly the DNAT part of it had me baffled.

It took a bit of reasoning but as I understand it - the client sends multiple UDP requests on the same {source IP, source port, dest IP, dest port, protocol} and one just gets lost. Since clients are INTENTIONALLY sending them in parallel, the race is exacerbated.

DerGenaue commented 4 years ago

I was able to solve the issue by using the SessionAffinity feature by kubernetes: Configuring the kube-dns service in the kube-system namespace from None to: service.spec.sessionAffinity: ClientIP resolved it basically immediately on our cluster. I can't tell how long it will last, though; I expect the next kubernetes upgrade to revert that setting. I'm pretty sure that this shouldn't have any problematic side-effects; but I cannot tell for sure.

This solution makes all DNS request packets from one pod be delivered to the same kube-dns pod, thus eliminating the problem that the conntrack DNAT race condition causes (the race condition still exists, it just doesn't have any effect anymore).

bboreham commented 4 years ago

@DerGenaue ~~as far as I can tell sessionAffinity only works with proxy-mode userspace, which will slow down service traffic to an extent that some people will not tolerate~~.

thockin commented 4 years ago

Session affinity should work fine in iptables, but you still have the race the first time any pod starts sending DNS, any time the chosen backend dies, and (if you use a lot of DNS) you get no balancing.

It's kind of hacky, but a fair mitigation for many people.

On Sun, Apr 12, 2020 at 2:58 AM Bryan Boreham notifications@github.com wrote:

@DerGenaue https://github.com/DerGenaue as far as I can tell sessionAffinity only works with --proxy-mode userspace, which will slow down service traffic to an extent that some people will not tolerate.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weaveworks/weave/issues/3287#issuecomment-612590759, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVD7FJB5JX3GDGOZXHLRMGGDHANCNFSM4E473DHQ .

DerGenaue commented 4 years ago

I checked the kube-proxy code and the iptables version generates sessionAffinity just fine. I don't think any single pod will ever do so many DNS requests to cause any problems in this regard. Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution.

thockin commented 4 years ago

NodeLocal DNS avoids this problem, yes, by avoiding conntrack. But we have definitely experienced a single pod that issues DNS 2 queries in parallel (A and AAAA) and triggers this race.

On Sun, Apr 12, 2020 at 4:27 PM DerGenaue notifications@github.com wrote:

I checked the code and the iptables version generates sessionAffinity just fine. I don't think any single pod will ever do so many DNS requests to cause any problems in this regard. Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

elmiedo commented 4 years ago

Hi. Why are you not implement dnsmasq instead of working with usual dns clients? Dnsmasq is able to send dns query to every dns-server from it config file simultaneously. You just will receive the fastest reply.

bboreham commented 4 years ago

@elmiedo it is uncommon to have the opportunity to change DNS client - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things.

thockin commented 4 years ago

Again: The Kubernetes node-local DNS cache effort is trying to bypass these problems by using NOTRACK for connections from pods to the local cache, then using TCP exclusively from the local cache to upstream resolvers.

On Fri, May 22, 2020 at 3:20 AM Bryan Boreham notifications@github.com wrote:

@elmiedo https://github.com/elmiedo it is uncommon to have the opportunity to change DNS client - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weaveworks/weave/issues/3287#issuecomment-632616839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVDQ4TU4YIJY656AYTLRSZGW5ANCNFSM4E473DHQ .

chengzhycn commented 2 years ago

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

@brb Thanks for your excellent explains. But there is a little doubt that confused me. I viewed the glibc source codes, it used send_dg to send A and AAAA queries via UDP in parallel. But it is just called sendmmsg, seems like send two UDP packets in one thread(doesn't match the condition different thread). Is there any misunderstanding by me above？Looking forward to your reply. :)

axot commented 2 years ago

https://elixir.bootlin.com/linux/v5.14.14/source/net/socket.c#L2548 Same question if it is possible to run in a different CPU by cond_resched().

weaveworks / weave