Closed AlexanderYastrebov closed 3 years ago
Ring availability check in constructor does not bring much value as ring may go down anytime later and AllowContext
allows request in such case:
https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/redis.go#L223
I think the solution may be just to remove ring availability check (or move it into registry constructor which would then delay or prevent instance startup when ring is unavailable).
Great finding, but the pr should better add this to possible status checks IMO.
pr should better add this to possible status checks
The problem is I don't see a good place for ring availability check. Delaying or preventing Skipper startup when ring is not available does not feel like a good idea but rather a cascading failure.
@AlexanderYastrebov then we could break the ratelimit feature while doing and update, update won't be blocked and nobody will see it until it is too late.
Let's merge the PR. I am also refactoring out the redis client and then we can do a non lazy init of the client and init connectivity at startup.
Describe the bug
Unavailable Redis ring blocks requests on ratelimited route during Skipper startup for an interval around 10-20 s which leads to clients timeout waiting for a response.
Skipper stores ratelimiters in a registry using settings tuple as a key and lazily initialized ratelimiter on the first request: https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/registry.go#L70-L81 https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/ratelimit.go#L422-L440 https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/cluster.go#L15-L27
Redis ratelimiter constructor performs ring availability check with a backoff: https://github.com/zalando/skipper/blob/727ad0b18f6c8299b1835c02e33fcc66691291e9/ratelimit/redis.go#L146-L152
The availability check takes 10-20s after that
nil
ratelimiter is returned which leads tovoidRatelimit
being used for the settings tuple. Since void ratelimiter is remembered in the registry for the given settings tuple permanently - ratelimiting is disabled permanently as well until Skipper restart, even if Redis ring later becomes available.To Reproduce
Start Skipper but do not start Redis
Perform client requests
Note 100% request failure (all requests performed in 10s before Redis ring availability check finishes) due to configured client timeout. Note Redis ring availability check in Skipper logs:
Note client timeouts in Skipper logs:
Start Redis
Note Redis ring becomes available in Skipper logs
Perform client requests
Note 100% requests pass regardless of ratelimit configured.
Expected behavior
Observed behavior
See reproduction above.