weaveworks / common

Libraries used in multiple Weave projects
Other
129 stars 92 forks source link

Make gRPC connection timeout configurable #261

Closed liguozhong closed 1 year ago

liguozhong commented 1 year ago

In the upstream project loki(https://github.com/grafana/loki/), we encountered a grpc timeout of 120s, and it could not be fixed by loki. We checked that the timeout of grpc 120s was configured here in common.

defaultServerOptions.connectionTimeout:120 * time.Second,

    server:
      http_listen_port: 3100
      grpc_listen_port: 9095
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
      http_server_idle_timeout: 600s

This config interrupts loki's query, and loki's other timeouts are set to 5 minutes, but at 2 minutes, loki cancels.

We need to make this config modifiable to adapt to the brute force scan mode logging system of loki


level=info ts=2022-10-17T06:28:23.653973829Z caller=handler.go:174 org_id=fake msg="slow query detected" method=GET host=loki-loki-distributed-query-frontend.loki-research-x4.svc.cluster.local:3100 path=/loki/api/v1/query_range time_taken=2m1.965813572s param_direction=backward param_end=1665987976954000000 param_limit=1000 param_query="{evtType=\"evt_ttsd\"} | json" param_start=1665966376954000000 param_step=10000ms
level=warn ts=2022-10-17T06:28:23.654024222Z caller=logging.go:86 traceID=55bbc9c1308e9842 orgID=fake msg="GET /loki/api/v1/query_range?direction=backward&end=1665987976954000000&limit=1000&query=%7BevtType%3D%22evt_ttsd%22%7D+%7C+json&start=1665966376954000000&step=10000ms (500) 2m1.9659356s Response: \"query index: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp 10.100.190.157:9095: i/o timeout\\\"\\n\" ws: false; Accept-Encoding: gzip; Connection: close; User-Agent: Grafana/9.1.5; "

level=error ts=2022-10-17T06:28:23.654071531Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = query index: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.100.190.157:9095: i/o timeout\"\n"

and keeps repeating the timeout issue. 
liguozhong commented 1 year ago
var defaultServerOptions = serverOptions{
    maxReceiveMessageSize: defaultServerMaxReceiveMessageSize,
    maxSendMessageSize:    defaultServerMaxSendMessageSize,
    connectionTimeout:     120 * time.Second,
    writeBufferSize:       defaultWriteBufSize,
    readBufferSize:        defaultReadBufSize,
}

// ConnectionTimeout returns a ServerOption that sets the timeout for
// connection establishment (up to and including HTTP/2 handshaking) for all
// new connections.  If this is not set, the default is 120 seconds.  A zero or
// negative value will result in an immediate timeout.
//
// This API is EXPERIMENTAL.
func ConnectionTimeout(d time.Duration) ServerOption {
    return newFuncServerOption(func(o *serverOptions) {
        o.connectionTimeout = d
    })
}

// handleRawConn forks a goroutine to handle a just-accepted connection that
// has not had any I/O performed on it yet.
func (s *Server) handleRawConn(rawConn net.Conn) {
    if s.quit.HasFired() {
        rawConn.Close()
        return
    }
    rawConn.SetDeadline(time.Now().Add(s.opts.connectionTimeout))
    conn, authInfo, err := s.useTransportAuthenticator(rawConn)
liguozhong commented 1 year ago

@bboreham help review please, thanks. we need this pr for loki

bboreham commented 1 year ago

A connection timeout of 2 minutes is very unusual. Do you have more info why this is necessary?

bboreham commented 1 year ago

To expand,

"transport: Error while dialing dial tcp 10.100.190.157:9095: i/o timeout\"\n"

This means that the underlying TCP stack could not establish a connection in 2 minutes. Establishing a connection is just 3 packets - unless your endpoints are a very long way apart, this typically means that the far end has gone down or is so busy that it cannot action any more work. And in that situation, raising the timeout does not improve things.

liguozhong commented 1 year ago

@bboreham hi,thanks for your review.

The timeout of loki is generally 5 minutes or even 15 minutes. We have too many timeouts that need to be changed. 1: grafana datasource proxy timeout 2: nginx timeout 3: promql engine timeout 4: querier timeout 5: grpc timeout

Also now we find weave/common/server.grpc connection timeout

more info here: https://grafana.slack.com/archives/CEPJRLQNL/p1665738667866879

liguozhong commented 1 year ago

A connection timeout of 2 minutes is very unusual. Do you have more info why this is necessary?

hi,loki metrics querier is really slow. and always set 5m~15m timeout.

liguozhong commented 1 year ago

raising the timeout

hi,loki user need to raising the timeout

liguozhong commented 1 year ago

A connection timeout of 2 minutes is very unusual. Do you have more info why this is necessary?

metrics always return xxx seconds ,but log is slow.

liguozhong commented 1 year ago

I try to help https://grafana.slack.com/team/U03RN8FQH47

bboreham commented 1 year ago

None of your replies is persuading me you want a longer connection timeout.

The connection phase is before any work is done on the query. Even if it takes 10 minutes to execute the query, you want the connection to complete in milliseconds. If this is not happening, your server is likely overloaded, so it is just not accepting new connections.

liguozhong commented 1 year ago

None of your replies is persuading me you want a longer connection timeout.

The connection phase is before any work is done on the query. Even if it takes 10 minutes to execute the query, you want the connection to complete in milliseconds. If this is not happening, your server is likely overloaded, so it is just not accepting new connections.

@bboreham Thank you so much for such a quick response👍。 In some cases, we want the server to continue to wait even if it is overloaded.

For example, loki queier pod sends 10 grpc client requests to one index-gateway pod at the same time. 1: loki queier set a timeout of 10m 2: index-gateway set grpc_server_max_concurrent_streams: 1. 3: index-gateway takes 1m per request to complete.

Since loki queier's grpc_server_connection_timeout=2m is immutable. At the same time, loki queier initiates requests at the same time. Since grpc_server_max_concurrent_streams=1, loki queier needs to wait in line. At this time, up to 2 querier request can complete. After 2 queries are over, 2m has already arrived, so the remaining 8 querier request will fail due to the timeout of grpc_server_connection_timeout = 2m. Even if loki sets a timeout of 10m, the timeout will occur at 2m.

but: If grpc_server_connection_timeout can be changed to 10m, so that 10 queriers can queue up to 10 to complete the request.

Of course, the example I gave is a bit extreme, but this is the current problem encountered by loki slack users(here:https://grafana.slack.com/archives/CEPJRLQNL/p1665738667866879).

I expect weaveworks/common make the grpc_server_connection_timeout parameter configurable. I also think 2m is a great default. But in some cases, it is hoped to open up some degrees of freedom to the upstream project. thanks for your review.

bboreham commented 1 year ago

If you can show that this happens, I would be interested. gRPC multiplexes multiple requests over the same connection, so I don't think what you describe will happen.

bboreham commented 1 year ago

Also, thanks for the link to the Slack conversation https://grafana.slack.com/archives/CEPJRLQNL/p1665738667866879. In that conversation:

the timeout error sometimes appears right after the query starts, just in a few seconds.

I cannot imagine how this can be related to a 2-minute timeout.

Surely something is going wrong, but it will be necessary to debug exactly what, and then fix that. Suggest you move discussion to wherever that issue is filed, and I will close this PR.

liguozhong commented 1 year ago

@bboreham Hi, thank you for your suggestion. I will first check and debug in detail in the loki project, confirm the problem clearly, and then give feedback.

You can close this issue first.

Thank you so much for this week's review and detailed explanation of grpc details.