xmtp / xmtp-node-go

Software for the nodes that currently form the XMTP network
MIT License
10 stars 3 forks source link

Intermittent too many open files errors from db requests #248

Closed snormore closed 1 year ago

snormore commented 1 year ago

During incident-5, the node containers were erroring with too many open files after being unable to connect to the DBs for a while with no such host errors. We should figure out what's leaking file descriptors and under what conditions, and fix.

snormore commented 1 year ago

Another burst of too many open files during a deploy, without any no such host errors preceding it: logs

snormore commented 1 year ago

I guess one way this could be happening without a real fd leak is if pgdriver doesn't have a connection pool and during connectivity blips/issues, like during deploys or yesterday when we were getting no such host from the DBs, clients hammer /query with retries and just increase number of open fds beyond the default limits of 1024. If that's the case, then we should 1. increase the fd limits, and 2. see if we can get a connection pool configured in pgdriver. But it still seems like the fds should have been released in that case unless it all happens very quickly/concurrently 🤔

Worth noting too that it seems to only be happening on prod, and not dev

snormore commented 1 year ago

Another burst during the latest deploy: logs

snormore commented 1 year ago

Found a metric for open fds in DD:

image

We can see it peaked out around 1k a couple days ago, what we'd expect from the 1024 limit I guess, and has been increasing gradually for the past month. It's been around 700-800 since then, but we've seen "too many open files" errors in that period, so it's a little weird that we're not seeing it closer to 1k there.

Increasing fd limit on the ECS containers seems like the next thing to do.