zalando / skipper

An HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress
https://opensource.zalando.com/skipper/
Other
3.11k stars 350 forks source link

Unknown error for route based LIFO: queue closed for host #1851

Closed AlexanderYastrebov closed 2 years ago

AlexanderYastrebov commented 3 years ago

Describe the bug

Users observe 500 response errors and Unknown error for route based LIFO: queue closed for host error in Skipper logs.

The error response is returned from https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/filters/scheduler/lifo.go#L295 https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/filters/scheduler/lifo.go#L310-L313 and originates in the jobqueue.

It seems that queue got closed by scheduler post-processor on route update: https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/scheduler/scheduler.go#L311-L318 while in-flight requests try to get slot from the queue (i.e. a race condition between ongoing requests and lifo scheduler post-processor).

existingKeys contains lifo filter names that are derived from route id: https://github.com/zalando/skipper/blob/6b089867410bf4b9f56b4686450cf95301ac02c7/scheduler/scheduler.go#L248-L249

To Reproduce

  1. Create a routes file and constantly update route id
    $ for i in {1..1000}; do echo 'route_id'$i': Path("/") -> lifo(1000, 1000, "10s") -> status(204) -> <shunt>' > /tmp/lifo.eskip; sleep 3; done
  2. Run Skipper
    bin/skipper -routes-file=/tmp/lifo.eskip -access-log-disabled
  3. Run vegeta
    echo "GET http://localhost:9090/" | vegeta attack -rate 1000/s | vegeta report -every 1s
  4. Observe occasional lifo errors in Skipper logs
    [APP]INFO[0117] route settings, update, route: route_id20: Path("/") -> lifo(1000, 1000, "10s") -> status(204) -> <shunt> 
    [APP]INFO[0117] route settings, update, deleted id: route_id19 
    [APP]INFO[0117] route settings received                      
    [APP]INFO[0117] route settings applied                       
    [APP]ERRO[0117] Unknown error for route based LIFO: queue closed for host localhost:9090 

    and 500s in vegeta report

    Requests      [total, rate, throughput]  13006, 1000.02, 999.90
    Duration      [total, attack, wait]      13.006326038s, 13.005685861s, 640.177µs
    Latencies     [mean, 50, 95, 99, max]    642.38µs, 563.278µs, 760.017µs, 1.932213ms, 20.776012ms
    Bytes In      [total, mean]              0, 0.00
    Bytes Out     [total, mean]              0, 0.00
    Success       [ratio]                    99.99%
    Status Codes  [code:count]               204:13005  500:1  
    Error Set:
    500 Internal Server Error
AlexanderYastrebov commented 2 years ago

To reduce probability we may defer queue closing for one route update cycle or a pre-configured delay. The proper fix would be to close queue on completion of in-flight requests which probably requires https://github.com/zalando/skipper/issues/202

AlexanderYastrebov commented 2 years ago

A simpler reproducer:

  1. create a route with an artificial delay before lifo filter:
    cat /tmp/lifo.eskip 
    r1: * -> backendLatency("30s") -> lifo(10, 10, "10s") -> status(204) -> <shunt>;
  2. run Skipper
    bin/skipper -routes-file=/tmp/lifo.eskip
  3. issues a request
    curl -v localhost:9090
  4. delete route
    truncate --size=0 /tmp/lifo.eskip
  5. Observe error in Skipper logs
    [APP]INFO[0000] Expose metrics in codahale format            
    [APP]INFO[0000] support listener on :9911                    
    [APP]INFO[0000] proxy listener on :9090                      
    [APP]INFO[0000] TLS settings not found, defaulting to HTTP   
    [APP]INFO[0000] route settings, reset, route: r1: * -> backendLatency("30s") -> lifo(10, 10, "10s") -> status(204) -> <shunt> 
    [APP]INFO[0000] route settings received                      
    [APP]INFO[0000] route settings applied                       
    [APP]INFO[0009] route settings, update, deleted id: r1       
    [APP]INFO[0009] route settings received                      
    [APP]INFO[0009] route settings applied                       
    [APP]ERRO[0034] Unknown error for route based LIFO: queue closed for host localhost:9090 
    127.0.0.1 - - [11/Feb/2022:15:54:25 +0100] "GET / HTTP/1.1" 500 0 "-" "curl/7.58.0" 30001 localhost:9090 - -

    and curl output

    < HTTP/1.1 500 Internal Server Error
    < Server: Skipper
    < Date: Fri, 11 Feb 2022 14:54:55 GMT
    < Transfer-Encoding: chunked
    < 
    * Connection #0 to host localhost left intact
AlexanderYastrebov commented 2 years ago

Mitigated by #1953