yesodweb / wai

Haskell Web Application Interface
MIT License
834 stars 263 forks source link

Gzip middleware leads to warp "Client closed connection prematurely" #997

Open pbrisbin opened 5 months ago

pbrisbin commented 5 months ago

We have an app using WAI, Warp, and hs-opetelemetry-instrumentation-wai and -yesod.

Ever since making this change:

Our open-telemetry tracing reports a high number of 200s with the error:

InvalidRequest: Warp: Client closed connection prematurely

new versions increasing error-rate

increase in "200" errors

As far as we can tell, the client of such requests receives a 200 (green in this trace) but the server continues to run beyond then (orange) and reports the error.

It's as if the response is returned, but then the request handling in the service doesn't end. It would make sense the LB (satisfied with the 200 it has already forwarded along) closes the connection, and I guess that would generate this error... but why? And why only once we started on-the-fly gzipping?

Since clients are getting 200s, there is no actual negative effects here, but the errors are confusing too see and contribute to our error-rate metrics causing false-positive alerts.

pbrisbin commented 4 months ago

I did a little more digging in the code. It seems the function in question is:

compressE
    :: Response
    -> (Response -> IO ResponseReceived)
    -> IO ResponseReceived
compressE res sendResponse =
    wb $ \body -> sendResponse $
        responseStream s (fixHeaders hs) $ \sendChunk flush -> do
            (blazeRecv, _) <- B.newBuilderRecv B.defaultStrategy
            deflate <- Z.initDeflate 1 (Z.WindowBits 31)
            let sendBuilder builder = do
                    popper <- blazeRecv builder
                    fix $ \loop -> do
                        bs <- popper
                        unless (S.null bs) $ do
                            sendBS bs
                            loop
                sendBS bs = Z.feedDeflate deflate bs >>= deflatePopper
                flushBuilder = do
                    sendBuilder Blaze.flush
                    deflatePopper $ Z.flushDeflate deflate
                    flush
                deflatePopper popper = fix $ \loop -> do
                    result <- popper
                    case result of
                        Z.PRDone -> return ()
                        Z.PRNext bs' -> do
                            sendChunk $ byteString bs'
                            loop
                        Z.PRError e -> throwIO e

            body sendBuilder flushBuilder
            sendBuilder Blaze.flush
            deflatePopper $ Z.finishDeflate deflate
  where
    (s, hs, wb) = responseToStream res

So it seems enabling compression changes the response to streaming-fashion. This could explain how the ALB is getting a complete 200 response before this code is finished whatever it's doing. If it decides to close the connection at that point, I think we have our answer. I don't really know how to fix it though.

pbrisbin commented 4 months ago

I wonder if using GzipCacheFolder (which is just GzipCompress but caching things in some folder) would effectively work around the issue. Using that construct won't switch the response to streaming and, once any given response has been compressed/cached, it'll just be serving the file.

pbrisbin commented 4 months ago

Update: GipCacheFolder does work much better. But it has introduced different errors that seem like a similar situation (200 response with an exception):

IOException: Network.SendFile.Linux.sendfileloop: resource vanished (Connection reset by peer)

They happen much less frequently though, and seem limited to specific kinds of requests (large responses, and from only frontend/JS clients).