zalando / skipper

An HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress
https://opensource.zalando.com/skipper/
Other
3.11k stars 350 forks source link

Unavailable Redis ring blocks requests on ratelimited route during Skipper startup #1690

Closed AlexanderYastrebov closed 3 years ago

AlexanderYastrebov commented 3 years ago

Describe the bug

Unavailable Redis ring blocks requests on ratelimited route during Skipper startup for an interval around 10-20 s which leads to clients timeout waiting for a response.

Skipper stores ratelimiters in a registry using settings tuple as a key and lazily initialized ratelimiter on the first request: https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/registry.go#L70-L81 https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/ratelimit.go#L422-L440 https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/cluster.go#L15-L27

Redis ratelimiter constructor performs ring availability check with a backoff: https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/redis.go#L146-L152

The availability check takes 10-20s after that nil ratelimiter is returned which leads to voidRatelimit being used for the settings tuple. Since void ratelimiter is remembered in the registry for the given settings tuple permanently - ratelimiting is disabled permanently as well until Skipper restart, even if Redis ring later becomes available.

To Reproduce

  1. Start Skipper but do not start Redis

    $  bin/skipper -inline-routes='* -> clusterRatelimit("group", 10, "1m") -> status(204) -> <shunt>' -enable-ratelimits -enable-swarm -swarm-redis-urls=localhost:6379
    ...
    redis: 2021/01/15 22:15:47 ring.go:320: ring shard state changed: Redis<localhost:6379 db:0> is down
  2. Perform client requests

    $ echo "GET http://localhost:9090/" | vegeta attack -duration=10s -timeout=1s -rate=10/s | vegeta report
    Requests      [total, rate, throughput]         100, 10.10, 0.00
    Duration      [total, attack, wait]             10.9s, 9.9s, 1s
    Latencies     [min, mean, 50, 90, 95, 99, max]  1s, 1s, 1s, 1.001s, 1.001s, 1.001s, 1.001s
    Bytes In      [total, mean]                     0, 0.00
    Bytes Out     [total, mean]                     0, 0.00
    Success       [ratio]                           0.00%
    Status Codes  [code:count]                      0:100  
    Error Set:
    Get "http://localhost:9090/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Note 100% request failure (all requests performed in 10s before Redis ring availability check finishes) due to configured client timeout. Note Redis ring availability check in Skipper logs:

[APP]INFO[0088] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0088] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0089] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0090] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0091] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0092] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0095] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]INFO[0099] Failed to ping redis, retry with backoff: redis: all ring shards are down 
[APP]ERRO[0099] Failed to connect to redis: redis: all ring shards are down 

Note client timeouts in Skipper logs:

127.0.0.1 - - [15/Jan/2021:22:17:14 +0100] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1" 11769 localhost:9090 - -
[APP]INFO[0099] Client request: context canceled             
127.0.0.1 - - [15/Jan/2021:22:17:14 +0100] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1" 11670 localhost:9090 - -
[APP]INFO[0099] Client request: context canceled             
  1. Start Redis

    $ docker run --rm -p 6379:6379 redis

    Note Redis ring becomes available in Skipper logs

    redis: 2021/01/15 22:21:08 ring.go:320: ring shard state changed: Redis<localhost:6379 db:0> is up
  2. Perform client requests

    $ echo "GET http://localhost:9090/" | vegeta attack -duration=10s -timeout=1s -rate=10/s | vegeta report
    Requests      [total, rate, throughput]         100, 10.10, 10.10
    Duration      [total, attack, wait]             9.901s, 9.9s, 761.683µs
    Latencies     [min, mean, 50, 90, 95, 99, max]  580.956µs, 857.938µs, 829.573µs, 931.892µs, 983.863µs, 1.923ms, 2.606ms
    Bytes In      [total, mean]                     0, 0.00
    Bytes Out     [total, mean]                     0, 0.00
    Success       [ratio]                           100.00%
    Status Codes  [code:count]                      204:100  
    Error Set:

    Note 100% requests pass regardless of ratelimit configured.

Expected behavior

  1. Requests are not blocked when Redis ring is unavailable
  2. When Redis ring becomes available ratelimits apply

Observed behavior

See reproduction above.

AlexanderYastrebov commented 3 years ago

Ring availability check in constructor does not bring much value as ring may go down anytime later and AllowContext allows request in such case: https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/redis.go#L223

I think the solution may be just to remove ring availability check (or move it into registry constructor which would then delay or prevent instance startup when ring is unavailable).

szuecs commented 3 years ago

Great finding, but the pr should better add this to possible status checks IMO.

AlexanderYastrebov commented 3 years ago

pr should better add this to possible status checks

The problem is I don't see a good place for ring availability check. Delaying or preventing Skipper startup when ring is not available does not feel like a good idea but rather a cascading failure.

szuecs commented 3 years ago

@AlexanderYastrebov then we could break the ratelimit feature while doing and update, update won't be blocked and nobody will see it until it is too late.

szuecs commented 3 years ago

Let's merge the PR. I am also refactoring out the redis client and then we can do a non lazy init of the client and init connectivity at startup.