Warp exits after accepting too many simultaneous connections on Linux

cuklev commented 3 years ago

Might be related to #603

Here is a sample code:

{-# LANGUAGE OverloadedStrings #-}
module Main (main) where

import Control.Concurrent
import Data.IORef
import Network.HTTP.Types
import Network.Wai
import Network.Wai.Handler.Warp

main :: IO ()
main = do
  counter <- newIORef (0 :: Int) -- just for keeping count of accepted connections
  run 3003 $ \_ respond -> do
    print =<< atomicModifyIORef' counter (\x -> (x+1, x+1))
    threadDelay 3000000 -- simulate something that is slow to process
    print =<< atomicModifyIORef' counter (\x -> (x-1, x-1))
    respond $ responseLBS status200 [] "works\n"

When I run something like while :; do curl -s http://localhost:3003 > /dev/null & done the Haskell program receives Network.Socket.accept: resource exhausted (Too many open files) and then exits successfully after all connections close. It always happens after printing 1011 for me. This is because each accepted connection is a new open file and there is a limit to open files per process. On my system this limit seems to be 1024 (can be seen or changed with ulimit -Sn).

I am not sure how this thing should be solved. Should warp not accept connections when there are too many that have been opened? Should accepting be allowed to fail and retry after that? Should the server respond with something like 429 Too Many Requests?

snoyberg commented 3 years ago

The server can't respond with a 429 in that case, since it cannot accept the new connection at all. I'd strongly advise bumping the FD limit, 1024 is far too low for a busy server.

cuklev commented 3 years ago

Well, it is not necessarily a busy server. It could be just someone trying to abuse it. In my case, I was surprised that my server process exited. I feel like bumping the FD limit is only a temporary solution.

swamp-agr commented 3 years ago

@cuklev Could you please provide client-side code you're invoking?

cuklev commented 3 years ago

while :; do curl -s http://localhost:3003 > /dev/null & done in bash.

swamp-agr commented 3 years ago

Seems that you're running out of sockets/FDs. ulimit -Sn will show current value of FDs.

Application cannot allocate more than ulimit -Sn sockets and simple refuses to respond since you're forcing it to wait for 3 seconds for every single query. Warp throws an error, since it could not allocate more.

I do not know what is the best strategy for the socket exhaustion fault tolerance here. Maybe add allocation counter, threshold and/or queue and to change its strategy when threshold is reached to schedule responses into the queue and process them separately.

As of now, you could go ahead and set soft/hard limits per user/application on system level based on expected/predicted RPS from clients/proxy.

cuklev commented 3 years ago

Yes, increasing the FDs limit will improve the situation but it will not solve it. Warp should definetely catch that error and not exit. I tested the same setup but with nginx in the middle, using proxy_pass to the Haskell server. In that case, my application never crashes. Nginx responds with 500 for half of the requests.

swamp-agr commented 3 years ago

curl

Consider curl case for simplicity.

So, application is listening port 3003 (1 FD).
It is trying to accept incoming connections from a lot of curls.
According to curl defaults, each "client" will wait for accept from server up to 60 seconds and for connect up to 300 seconds.
And if both events happened it will wait indefinitely for response from application.
E.g. curl/application will not close socket until response will be send from application and delivered to curl.
All 1023 available sockets/FDs will be exhausted soon.
In this case, application will throw something like Network.Socket.accept: resource exhausted (No file descriptors available).

According to current warp implementation, there should be appropriate design fix for leaking connections in case of accepting them. I am currently investigating leaking side of the story.

Let's return to the nginx.

NGINX

With nginx there are a lot of variables that should be taken into account:

nginx soft/hard limits;
workers parameters;
different timeout parameters;
nginx server parameters;
(multiple) nginx site configuration(s);
application soft/hard limits.
sysctl TCP/IP/socket parameters.

NGINX + Warp + /etc/sysctl.conf should be configured extremely careful, there should be no contradictions for all possible combinations of parameters mentioned above.

E.g. decreasing proxy_read_timeout and proxy_send_timeout on NGINX side could fix warp availability in particular use case. Another example is to remove keepalive from your upstream configuration. It could also help in different use case.

Vlix commented 2 years ago

I think it should be possible to not let the application crash, and just print to stdout/stderr that no file descriptors were available, and just continue with the loop?

The Network.Socket error is just an IOError with OtherError and a string, so it should be easy, although pretty frail, so let's hope Network.Socket doesn't change it's exception's syntax 🙃

yesodweb / wai

Warp exits after accepting too many simultaneous connections on Linux #825

curl

NGINX