Open pbrisbin opened 5 months ago
I did a little more digging in the code. It seems the function in question is:
compressE
:: Response
-> (Response -> IO ResponseReceived)
-> IO ResponseReceived
compressE res sendResponse =
wb $ \body -> sendResponse $
responseStream s (fixHeaders hs) $ \sendChunk flush -> do
(blazeRecv, _) <- B.newBuilderRecv B.defaultStrategy
deflate <- Z.initDeflate 1 (Z.WindowBits 31)
let sendBuilder builder = do
popper <- blazeRecv builder
fix $ \loop -> do
bs <- popper
unless (S.null bs) $ do
sendBS bs
loop
sendBS bs = Z.feedDeflate deflate bs >>= deflatePopper
flushBuilder = do
sendBuilder Blaze.flush
deflatePopper $ Z.flushDeflate deflate
flush
deflatePopper popper = fix $ \loop -> do
result <- popper
case result of
Z.PRDone -> return ()
Z.PRNext bs' -> do
sendChunk $ byteString bs'
loop
Z.PRError e -> throwIO e
body sendBuilder flushBuilder
sendBuilder Blaze.flush
deflatePopper $ Z.finishDeflate deflate
where
(s, hs, wb) = responseToStream res
So it seems enabling compression changes the response to streaming-fashion. This could explain how the ALB is getting a complete 200 response before this code is finished whatever it's doing. If it decides to close the connection at that point, I think we have our answer. I don't really know how to fix it though.
I wonder if using GzipCacheFolder
(which is just GzipCompress
but caching things in some folder) would effectively work around the issue. Using that construct won't switch the response to streaming and, once any given response has been compressed/cached, it'll just be serving the file.
Update: GipCacheFolder
does work much better. But it has introduced different errors that seem like a similar situation (200 response with an exception):
IOException: Network.SendFile.Linux.sendfileloop: resource vanished (Connection reset by peer)
They happen much less frequently though, and seem limited to specific kinds of requests (large responses, and from only frontend/JS clients).
We have an app using WAI, Warp, and hs-opetelemetry-instrumentation-wai and -yesod.
Ever since making this change:
Our open-telemetry tracing reports a high number of 200s with the error:
As far as we can tell, the client of such requests receives a 200 (green in this trace) but the server continues to run beyond then (orange) and reports the error.
It's as if the response is returned, but then the request handling in the service doesn't end. It would make sense the LB (satisfied with the 200 it has already forwarded along) closes the connection, and I guess that would generate this error... but why? And why only once we started on-the-fly gzipping?
Since clients are getting 200s, there is no actual negative effects here, but the errors are confusing too see and contribute to our error-rate metrics causing false-positive alerts.