Closed snormore closed 1 year ago
I wonder how much the inline retries will do if the writer instance is hard-down for a period of time. Maybe we wait 100ms and then retry, and hope that by then the load balancer is pointing at a different instance?
Yea something like that could be reasonable. Probably want to stop retrying before the api request latency goes too far, maybe up to a few seconds, or maybe more, and just let the client do their retries after that if needed. But a few retries could help with the quick writer restart/failover scenario, without failing the whole request, and just increase latency of the request by a bit. We should see how that plays out on dev though, how quick the restarts or failovers happen.
Worth noting that pointing the Query
API at the read replicas would protect against the query failures, since there would always be 2 replicas to choose from.
We can try that to start and see how it goes. It would remove the atomicity of https://github.com/xmtp/xmtp-node-go/pull/243 but I guess it's probably inevitable anyway.
Was setting up an alert in DD for i/o timeout
errors and noticed that we're also seeing them in the notifications-api (logs)
Reference to "i/o timeout errors after a connection reset" in the bun repo https://github.com/uptrace/bun/issues/312 - not much useful discussion around it though, just "if you suspect pgdriver, try switching to pgx"
There's some discussion in a pgx issue too https://github.com/jackc/pgx/issues/831
It's not so obvious that this is directly related to DB-side issues though like restarting or failovers
We're seeing intermittent
i/o timeout
errors from interactions with the DB (logs), with about 2k of them onQuery
in the past week, and ~700 of them onstoring message
in the same time, with 600 of them during a writer restart initiated by AWS last week.We should figure out why these are happening and resolve. In the case of the writer restart scenario, we might want to consider adding some inline retries to avoid the full api request failure.