Closed AlexanderYastrebov closed 2 years ago
To reduce probability we may defer queue closing for one route update cycle or a pre-configured delay. The proper fix would be to close queue on completion of in-flight requests which probably requires https://github.com/zalando/skipper/issues/202
A simpler reproducer:
cat /tmp/lifo.eskip
r1: * -> backendLatency("30s") -> lifo(10, 10, "10s") -> status(204) -> <shunt>;
bin/skipper -routes-file=/tmp/lifo.eskip
curl -v localhost:9090
truncate --size=0 /tmp/lifo.eskip
[APP]INFO[0000] Expose metrics in codahale format
[APP]INFO[0000] support listener on :9911
[APP]INFO[0000] proxy listener on :9090
[APP]INFO[0000] TLS settings not found, defaulting to HTTP
[APP]INFO[0000] route settings, reset, route: r1: * -> backendLatency("30s") -> lifo(10, 10, "10s") -> status(204) -> <shunt>
[APP]INFO[0000] route settings received
[APP]INFO[0000] route settings applied
[APP]INFO[0009] route settings, update, deleted id: r1
[APP]INFO[0009] route settings received
[APP]INFO[0009] route settings applied
[APP]ERRO[0034] Unknown error for route based LIFO: queue closed for host localhost:9090
127.0.0.1 - - [11/Feb/2022:15:54:25 +0100] "GET / HTTP/1.1" 500 0 "-" "curl/7.58.0" 30001 localhost:9090 - -
and curl output
< HTTP/1.1 500 Internal Server Error
< Server: Skipper
< Date: Fri, 11 Feb 2022 14:54:55 GMT
< Transfer-Encoding: chunked
<
* Connection #0 to host localhost left intact
Mitigated by #1953
Describe the bug
Users observe 500 response errors and
Unknown error for route based LIFO: queue closed for host
error in Skipper logs.The error response is returned from https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/filters/scheduler/lifo.go#L295 https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/filters/scheduler/lifo.go#L310-L313 and originates in the jobqueue.
It seems that queue got closed by scheduler post-processor on route update: https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/scheduler/scheduler.go#L311-L318 while in-flight requests try to get slot from the queue (i.e. a race condition between ongoing requests and lifo scheduler post-processor).
existingKeys
contains lifo filter names that are derived from route id: https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/scheduler/scheduler.go#L248-L249To Reproduce
and 500s in vegeta report