xmtp / xmtp-node-go

Software for the nodes that currently form the XMTP network
MIT License
10 stars 3 forks source link

Intermittent i/o timeout errors from db requests #246

Closed snormore closed 1 year ago

snormore commented 1 year ago

We're seeing intermittent i/o timeout errors from interactions with the DB (logs), with about 2k of them on Query in the past week, and ~700 of them on storing message in the same time, with 600 of them during a writer restart initiated by AWS last week.

We should figure out why these are happening and resolve. In the case of the writer restart scenario, we might want to consider adding some inline retries to avoid the full api request failure.

neekolas commented 1 year ago

I wonder how much the inline retries will do if the writer instance is hard-down for a period of time. Maybe we wait 100ms and then retry, and hope that by then the load balancer is pointing at a different instance?

snormore commented 1 year ago

Yea something like that could be reasonable. Probably want to stop retrying before the api request latency goes too far, maybe up to a few seconds, or maybe more, and just let the client do their retries after that if needed. But a few retries could help with the quick writer restart/failover scenario, without failing the whole request, and just increase latency of the request by a bit. We should see how that plays out on dev though, how quick the restarts or failovers happen.

neekolas commented 1 year ago

Worth noting that pointing the Query API at the read replicas would protect against the query failures, since there would always be 2 replicas to choose from.

snormore commented 1 year ago

We can try that to start and see how it goes. It would remove the atomicity of https://github.com/xmtp/xmtp-node-go/pull/243 but I guess it's probably inevitable anyway.

snormore commented 1 year ago

Was setting up an alert in DD for i/o timeout errors and noticed that we're also seeing them in the notifications-api (logs)

snormore commented 1 year ago

Reference to "i/o timeout errors after a connection reset" in the bun repo https://github.com/uptrace/bun/issues/312 - not much useful discussion around it though, just "if you suspect pgdriver, try switching to pgx"

There's some discussion in a pgx issue too https://github.com/jackc/pgx/issues/831

It's not so obvious that this is directly related to DB-side issues though like restarting or failovers