weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

weave network-wide arrest; TCP writes blocked forever when congestion control kicks in; lacking timeouts #3795

Open raulk opened 4 years ago

raulk commented 4 years ago

What you expected to happen?

I expect socket writes on the weave control plane to time out and fail by setting https://golang.org/pkg/net/#TCPConn.SetWriteDeadline, instead of blocking forever if the send buffer is congested.

What happened?

Weave's internal design revolves heavily around single-threaded event loops (called “actor loops”). Some of these actor loops are responsible for gossip; one thing they gossip about is IPAM allocations.

In our setup at the Testground project, we have ~120 Kubernetes nodes, each running weave as a DaemonSet, and gossiping with others about IPAM allocations.

We deploy test workloads consisting of as many as 1,500 pods at once. As they come up, and weave allocates IPs, IPAM gossip spikes.

In some cases, a TCP window will get congested, and the Allocator actor loop will block on a write forever. This happened to us here:

Goroutine stuck sending a ring update ``` goroutine 146 [IO wait, 10 minutes]: internal/poll.runtime_pollWait(0x7fd626e7eb68, 0x77, 0xffffffffffffffff) /usr/local/go/src/runtime/netpoll.go:184 +0x55 internal/poll.(*pollDesc).wait(0xc0003da618, 0x77, 0x14f00, 0x14fea, 0xffffffffffffffff) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45 internal/poll.(*pollDesc).waitWrite(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:96 internal/poll.(*FD).Write(0xc0003da600, 0xc003190000, 0x21369, 0x21369, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:276 +0x2a9 net.(*netFD).Write(0xc0003da600, 0xc003190000, 0x21369, 0x21369, 0xc000aa5aa8, 0x447b7c, 0x22000) /usr/local/go/src/net/fd_unix.go:220 +0x4f net.(*conn).Write(0xc001796360, 0xc003190000, 0x21369, 0x21369, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:196 +0x68 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*lengthPrefixTCPSender).Send(0xc001c692d0, 0xc00314c000, 0x21365, 0x22000, 0x21365, 0xc00314c000) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/protocol_crypto.go:110 +0xe1 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*LocalConnection).sendProtocolMsg(...) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection.go:410 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*LocalConnection).SendProtocolMsg(0xc0017ec6e0, 0x1ca5005, 0xc0030fc000, 0x21364, 0x213c0, 0x0, 0x3) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection.go:127 +0xbe github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*gossipChannel).relayUnicast(0xc000272700, 0xa0897fc131a, 0xc0030fc000, 0x21364, 0x213c0, 0x213c0, 0x447dae) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/gossip_channel.go:108 +0xce github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*gossipChannel).GossipUnicast(0xc000272700, 0xa0897fc131a, 0xc003026000, 0x21336, 0x22000, 0xc003026000, 0x1) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/gossip_channel.go:77 +0x18f github.com/weaveworks/weave/ipam.(*Allocator).sendRingUpdate(0xc00038a420, 0xa0897fc131a) /go/src/github.com/weaveworks/weave/ipam/allocator.go:814 +0xe8 github.com/weaveworks/weave/ipam.(*Allocator).donateSpace(0xc00038a420, 0x1ce900001ce80000, 0xa0897fc131a) /go/src/github.com/weaveworks/weave/ipam/allocator.go:918 +0x37d github.com/weaveworks/weave/ipam.(*Allocator).OnGossipUnicast.func1() /go/src/github.com/weaveworks/weave/ipam/allocator.go:588 +0x306 github.com/weaveworks/weave/ipam.(*Allocator).actorLoop(0xc00038a420, 0xc000438120, 0xc0002f2060) /go/src/github.com/weaveworks/weave/ipam/allocator.go:698 +0x203 created by github.com/weaveworks/weave/ipam.(*Allocator).Start /go/src/github.com/weaveworks/weave/ipam/allocator.go:183 +0x18f ```

That stuck goroutine seizes up the entire actor loop, which has various nasty downstream effects:

  1. Status checks start failing, because the status check sends a command to the Allocator actor loop, which is blocked on the above write.
Status check goroutine stuck ``` goroutine 5008 [chan send, 5 minutes]: github.com/weaveworks/weave/ipam.NewStatus(0xc00038a420, 0xc010000000, 0x4, 0xc0036c2848) /go/src/github.com/weaveworks/weave/ipam/status.go:51 +0xe7 main.HandleHTTP.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/weaveworks/weave/prog/weaver/http.go:314 +0x10f main.HandleHTTP.func4.1(0x1c95360, 0xc00063e8c0, 0xc00bc42300) /go/src/github.com/weaveworks/weave/prog/weaver/http.go:359 +0x44 net/http.HandlerFunc.ServeHTTP(0xc000607bc0, 0x1c95360, 0xc00063e8c0, 0xc00bc42300) /usr/local/go/src/net/http/server.go:2007 +0x44 github.com/weaveworks/weave/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000225340, 0x1c95360, 0xc00063e8c0, 0xc00bc42300) /go/src/github.com/weaveworks/weave/vendor/github.com/gorilla/mux/mux.go:159 +0x104 github.com/weaveworks/weave/common.(*loggingHandler).ServeHTTP(0xc001a661f0, 0x1c95360, 0xc00063e8c0, 0xc000024900) /go/src/github.com/weaveworks/weave/common/http.go:13 +0x139 net/http.(*ServeMux).ServeHTTP(0x2a5ede0, 0x1c95360, 0xc00063e8c0, 0xc000024900) /usr/local/go/src/net/http/server.go:2387 +0x1bd net/http.serverHandler.ServeHTTP(0xc001aaa000, 0x1c95360, 0xc00063e8c0, 0xc000024900) /usr/local/go/src/net/http/server.go:2802 +0xa4 net/http.(*conn).serve(0xc0026ac6e0, 0x1c9be60, 0xc006aeaf00) /usr/local/go/src/net/http/server.go:1890 +0x875 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2928 +0x384 ```
  1. Incoming messages get queued up in the TCP receive buffers, as the actor loop is not consuming them off the wire.
  2. Stemming from (2), peers sending us gossip will eventually block on writing to the locked-up peer, and the above will repeat in their Weave peers.
  3. In a short amount of time, this triggers a domino effect across the entire weave network, where many peers get blocked on the IPAM subsystem.

We have verified with ss on the host that indeed TCP send buffers were exhausted.

Suggested fix

Allowing network writes to time out, by setting write deadlines on the connection.

TCP Congestion control is one reason where a write could stall forever unless a timeout is set, but there are other reasons too. Right now the system is not only extremely vulnerable to these errors, but they can trigger a network-wide arrest.

Goroutine dump

See attached. You will also notice other goroutines waiting for the actor loop. The IPAM subsystem is essentially frozen.

weave.goroutines.log

bboreham commented 4 years ago

Thanks for your report. Is it the same as https://github.com/weaveworks/mesh/issues/125 ?

raulk commented 4 years ago

@bboreham Looks like it! But this issue shows the nasty downstream effects that not setting write deadlines can have on the entire system :-( Bumping up the priority on this would be greatly appreciated -- would make weave more resilient to intermittent network issues.

YanzheL commented 4 years ago

I have a similar issue but I'm unsure whether it is related to this.

When I use BBR congestion control with the following sysctl settings on one of kubernetes nodes.

net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

The node will block all TCP (and even UDP DNS queries) connections between pods in the cluster several minutes after system boot.

For example, the cluster IP of CoreDNS service is 10.0.0.10. Usually, I can query an in-cluster DNS name by executing dig @10.0.0.10 some-service-name.default.svc.cluster.local on host. It returns valid IPs

But if I enable BBR congestion control with fq qdisc, the DNS query blocks and exceeds timeout. TCP connections will be blocked too, which show No route to host. However, ICMP queries (ping some cluster IP) can return valid responses.

If I reset sysctl params to the following values and perform a system reboot, then this issue disappears.

net.core.default_qdisc = fq_codel
net.ipv4.tcp_congestion_control = cubic

This behavior is reproducible, I tested it on two hosts with similar system environment.

I'm completely unfamiliar with TCP congestion control and weavenet. I just feel like it is related to the issue you posted here.

Environment Info:

OS: Ubuntu 18.04.4 Weavenet: 2.6.4 Kubernetes: 1.17.4 Docker: 19.03.11 kube-proxy is working in IPVS mode.