Closed snormore closed 1 year ago
Another burst of too many open files
during a deploy, without any no such host
errors preceding it: logs
I guess one way this could be happening without a real fd leak is if pgdriver doesn't have a connection pool and during connectivity blips/issues, like during deploys or yesterday when we were getting no such host
from the DBs, clients hammer /query with retries and just increase number of open fds beyond the default limits of 1024. If that's the case, then we should 1. increase the fd limits, and 2. see if we can get a connection pool configured in pgdriver. But it still seems like the fds should have been released in that case unless it all happens very quickly/concurrently 🤔
Worth noting too that it seems to only be happening on prod, and not dev
Found a metric for open fds in DD:
We can see it peaked out around 1k a couple days ago, what we'd expect from the 1024 limit I guess, and has been increasing gradually for the past month. It's been around 700-800 since then, but we've seen "too many open files" errors in that period, so it's a little weird that we're not seeing it closer to 1k there.
Increasing fd limit on the ECS containers seems like the next thing to do.
During incident-5, the node containers were erroring with
too many open files
after being unable to connect to the DBs for a while withno such host
errors. We should figure out what's leaking file descriptors and under what conditions, and fix.