weaveworks / scope

Monitoring, visualisation & management for Docker & Kubernetes
https://www.weave.works/oss/scope/
Apache License 2.0
5.87k stars 714 forks source link

Pipe server goes nuts sometimes #3684

Closed bboreham closed 4 years ago

bboreham commented 5 years ago

This is observed in multi-tenant mode; all replicas are spinning with high CPU consumption in multitenant.(*consulPipeRouter).watchAll(). Consul is also showing high CPU.

Logs at the start of the high-cpu period:

<app> ERRO: 2019/09/16 11:03:52.726035 Error CASing pipes/15107-pipe-7240779112489848809: Error: Pipe pipes/15\
107-pipe-7240779112489848809 has existing connection to 10.244.202.251:4444
<app> ERRO: 2019/09/16 11:03:52.727727 Error CASing pipes/15107-pipe-7240779112489848809: Error: Pipe pipes/15\
107-pipe-7240779112489848809 has existing connection to 10.244.202.251:4444
<app> ERRO: 2019/09/16 11:03:52.730799 Error CASing pipes/15107-pipe-7240779112489848809: Error: Pipe pipes/15\
107-pipe-7240779112489848809 has existing connection to 10.244.202.251:4444
<app> ERRO: 2019/09/16 11:03:52.732553 Error CASing pipes/15107-pipe-7240779112489848809: Error: Pipe pipes/15\
107-pipe-7240779112489848809 has existing connection to 10.244.202.251:4444
<app> ERRO: 2019/09/16 11:03:52.734154 Error CASing pipes/15107-pipe-7240779112489848809: Error: Pipe pipes/15\
107-pipe-7240779112489848809 has existing connection to 10.244.202.251:4444
<app> INFO: 2019/09/16 11:03:54.047200 Get pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:54.683783 Creating pipe pipes/15428-pipe-8723926611138769060
<app> INFO: 2019/09/16 11:03:54.768161 Release pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:55.193087 Deleting pipe pipes/15428-pipe-8723926611138769060
<app> INFO: 2019/09/16 11:03:56.174861 Get pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:57.005606 Creating pipe pipes/15428-pipe-8723926611138769060
<app> INFO: 2019/09/16 11:03:57.080772 Release pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:57.314012 Get pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:57.413836 Release pipes/15428-pipe-8723926611138769060:probe
<app> INFO: 2019/09/16 11:03:57.631618 Deleting pipe pipes/15428-pipe-8723926611138769060

the creating/deleting cycle continues, with different tenant IDs appearing from time to time. The 'error CASing' message reappear every so often, sometimes up 2-3 minutes between batches.

According to monitoring data we have experienced periods of high CPU in the pipe service for months.

bboreham commented 4 years ago

Possibly related: today a single pipe server with 10,000 goroutines.

Top of goroutine dump:

goroutine profile: total 8989
8958 @ 0x433320 0x442e7b 0x4ad297 0x4adb5c 0xb86e87 0x4615c1
#   0x4ad296    io.(*pipe).Read+0xe6                                /usr/local/go/src/io/pipe.go:50
#   0x4adb5b    io.(*PipeReader).Read+0x4b                          /usr/local/go/src/io/pipe.go:127
#   0xb86e86    github.com/weaveworks/scope/common/xfer.(*pipe).CopyToWebsocket.func2+0x86  /go/src/github.com/weaveworks/scop\
e/common/xfer/pipes.go:141

5 @ 0x433320 0x442e7b 0xb84198 0x16f255f 0x16f2ab4 0x6ea7c4 0x7a99db 0x988b6c 0x6ea7c4 0x988cfe 0x6ea7c4 0x989189 0x6ea7c4 0x6\
edc14 0x6e95b5 0x4615c1
#   0xb84197    github.com/weaveworks/scope/common/xfer.(*pipe).CopyToWebsocket+0x297                   /go/src/github.com/wea\
veworks/scope/common/xfer/pipes.go:160
#   0x16f255e   github.com/weaveworks/scope/app.handlePipeWs.func1+0x50e                        /go/src/github.com/weaveworks/\
scope/app/pipes.go:69
#   0x16f2ab3   github.com/weaveworks/scope/app.requestContextDecorator.func1+0xb3                  /go/src/github.com/weavewo\
rks/scope/app/router.go:46
#   0x6ea7c3    net/http.HandlerFunc.ServeHTTP+0x43                                 /usr/local/go/src/net/http/server.go:2007
#   0x7a99da    github.com/weaveworks/scope/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0xda          /go/src/github.com\
/weaveworks/scope/vendor/github.com/gorilla/mux/mux.go:107
#   0x988b6b    github.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware.Tracer.Wrap.func1+0x8b   /go/src/gi\
thub.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware/http_tracing.go:25
#   0x6ea7c3    net/http.HandlerFunc.ServeHTTP+0x43                                 /usr/local/go/src/net/http/server.go:2007
#   0x988cfd    github.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware.Instrument.Wrap.func1+0x16d  /go/sr\
c/github.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware/instrument.go:49
#   0x6ea7c3    net/http.HandlerFunc.ServeHTTP+0x43                                 /usr/local/go/src/net/http/server.go:2007
#   0x989188    github.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware.Log.Wrap.func1+0x2a8     /go/src/gi\
thub.com/weaveworks/scope/vendor/github.com/weaveworks/common/middleware/logging.go:36
#   0x6ea7c3    net/http.HandlerFunc.ServeHTTP+0x43                                 /usr/local/go/src/net/http/server.go:2007
#   0x6edc13    net/http.serverHandler.ServeHTTP+0xa3                                   /usr/local/go/src/net/http/server.go:2\
802
#   0x6e95b4    net/http.(*conn).serve+0x874                                        /usr/local/go/src/net/http/server.go:1890

3 @ 0x433320 0x42e0ba 0x42d685 0x4ced65 0x4cfcdf 0x4cfcc1 0x5b334f 0x5c7578 0x565c2a 0xb7e968 0xb7eaba 0xb7f866 0xb7fd8f 0xb85\
cbf 0xb86ccd 0x4615c1
#   0x42d684    internal/poll.runtime_pollWait+0x54                             /usr/local/go/src/runtime/netpoll.go:184
#   0x4ced64    internal/poll.(*pollDesc).wait+0x44                             /usr/local/go/src/internal/poll/fd_poll_runtim\
e.go:87
#   0x4cfcde    internal/poll.(*pollDesc).waitRead+0x1ce                            /usr/local/go/src/internal/poll/fd_poll_ru\
ntime.go:92
#   0x4cfcc0    internal/poll.(*FD).Read+0x1b0                                  /usr/local/go/src/internal/poll/fd_unix.go:169
#   0x5b334e    net.(*netFD).Read+0x4e                                      /usr/local/go/src/net/fd_unix.go:202
#   0x5c7577    net.(*conn).Read+0x67                                       /usr/local/go/src/net/net.go:184
#   0x565c29    bufio.(*Reader).Read+0x269                                  /usr/local/go/src/bufio/bufio.go:226
#   0xb7e967    github.com/weaveworks/scope/vendor/github.com/gorilla/websocket.(*Conn).readFull+0x77       /go/src/github.com\
/weaveworks/scope/vendor/github.com/gorilla/websocket/conn.go:617
#   0xb7eab9    github.com/weaveworks/scope/vendor/github.com/gorilla/websocket.(*Conn).advanceFrame+0x89   /go/src/github.com\
/weaveworks/scope/vendor/github.com/gorilla/websocket/conn.go:641
#   0xb7f865    github.com/weaveworks/scope/vendor/github.com/gorilla/websocket.(*Conn).NextReader+0x55     /go/src/github.com\
/weaveworks/scope/vendor/github.com/gorilla/websocket/conn.go:780
#   0xb7fd8e    github.com/weaveworks/scope/vendor/github.com/gorilla/websocket.(*Conn).ReadMessage+0x2e    /go/src/github.com\
/weaveworks/scope/vendor/github.com/gorilla/websocket/conn.go:852
#   0xb85cbe    github.com/weaveworks/scope/common/xfer.(*pingingWebsocket).ReadMessage+0xae            /go/src/github.com/wea\
veworks/scope/common/xfer/websocket.go:109
#   0xb86ccc    github.com/weaveworks/scope/common/xfer.(*pipe).CopyToWebsocket.func1+0x3c          /go/src/github.com/weavewo\
rks/scope/common/xfer/pipes.go:120

Tail of logs:

<app> INFO: 2019/12/16 08:24:57.876959 Creating pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:24:57.960942 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:24:58.200017 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:24:58.443435 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:24:58.700782 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:24:58.970222 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:24:59.286143 Deleting pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:03.267505 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:04.841362 Creating pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:04.924074 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:06.006844 Deleting pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:08.756865 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:10.479201 Creating pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:10.552417 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:11.241154 Deleting pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:14.396069 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:14.866810 Creating pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:14.949131 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:15.180797 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:15.272817 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:15.344041 Deleting pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:16.470852 Get pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:17.273131 Creating pipe pipes/14815-pipe-661242507181636894
<app> INFO: 2019/12/16 08:25:17.355450 Release pipes/14815-pipe-661242507181636894:probe
<app> INFO: 2019/12/16 08:25:17.854341 Deleting pipe pipes/14815-pipe-661242507181636894
bboreham commented 4 years ago

I tracked this down to the handling of EOF on the non-websocket side of the conversation in the probe.

Can reproduce easily by clicking on the 'describe' button on a Kubernetes object.