Closed snormore closed 1 year ago
How many IPs are these spikes coming from? If it's just a few, I like the idea of banning them for a day and seeing who pops their head up to complain.
But yes, we should also implement connection pooling to keep the number of connections from spiking with traffic. Given that we have a small number of nodes, I don't think PGBouncer is needed. Client-side pooling should get the job done.
Yea looks like probably from the same IP:
With the requests being mostly queries:
The API errors from that chart are all io timeouts against the DB, as you might've guessed:
rpc error: code = Internal desc = read tcp 10.0.163.161:43626->10.0.133.153:5432: i/o timeout
The IP breakdown screenshot I posted before was the e2e runner, and not LB logs. Here's what it looks like for the LB logs:
The top 3 are DigitalOcean-owned/managed IPs afaics
Interesting. So we have a few big consumers, not just one. We could block the top 3?
Seems to be happening every day at the same time:
The query response time seems to be gradually increasing too which is a bit concerning:
It's interesting that RDS commit throughput seems to spike at the same time too:
I don't see an equivalent spike in publishes at that time, so not sure what that commit throughput is from exactly. I wonder if it's a daily vacuum or something, but I wouldn't expect that to happen twice daily within the same 2 hour period.
It seems to start at 6am ET every day, and then another spike at 8am ET every day.
This is probably a red herring but it seems the most CPU intensive queries in these spikes have to do with wallet 0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020 (I checked last three days). Wonder what wallet is that.
Hm, plot thickens, it looks like the wallet has 2000 invites and their timestamps seem to coincide with the CPU spikes. And many of them are self-invites which is odd:
│ 1990 │ 2023-05-24T11:02:19.930Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1991 │ 2023-05-24T11:02:21.277Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1992 │ 2023-05-24T11:02:21.444Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1993 │ 2023-05-24T11:02:24.415Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1994 │ 2023-05-24T11:02:24.868Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1995 │ 2023-05-24T11:02:26.298Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1996 │ 2023-05-24T11:04:06.984Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 1997 │ 2023-05-24T12:18:21.064Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 1998 │ 2023-05-24T12:18:22.778Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 1999 │ 2023-05-24T12:18:22.858Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2000 │ 2023-05-24T12:18:45.451Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2001 │ 2023-05-24T12:19:00.964Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2002 │ 2023-05-24T12:19:04.748Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2003 │ 2023-05-24T12:45:40.479Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2004 │ 2023-05-24T12:45:47.801Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2005 │ 2023-05-24T12:45:51.484Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│ 2006 │ 2023-05-25T11:08:06.333Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 2007 │ 2023-05-25T11:08:06.350Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 2008 │ 2023-05-25T11:10:54.655Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 2009 │ 2023-05-25T11:34:51.110Z │ '0xa35F…e020' │ '0xa35F…e020' │
│ 2010 │ 2023-05-25T12:49:21.033Z │ '0xa35F…e020' │ '0xa35F…e020' │
Hm, looks like that wallet has an ens record serendipitix.eth. Does that ring a bell?
https://etherscan.io/name-lookup-search?id=0xa35f4af4ef419e2c7af96ea5af4954b32d30e020
Address doesn't ring a bell.
Does seem odd that just returning 2000 rows would be enough to lock up our DB, given we have other users in the same ballpark of size.
It's unlikely just that, but the invite pattern is odd, doesn't seem like a regular user. I only see 2 addresses in the invites this one and 0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b
Very odd
Also the very first invites start just about when these issues started
│ 1991 │ 2023-05-13T11:50:49.487Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1992 │ 2023-05-13T11:50:48.897Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1993 │ 2023-05-13T11:50:48.541Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1994 │ 2023-05-13T11:50:48.352Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1995 │ 2023-05-13T11:50:43.323Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1996 │ 2023-05-13T11:50:43.090Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1997 │ 2023-05-13T11:50:43.085Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1998 │ 2023-05-13T11:50:42.982Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 1999 │ 2023-05-13T10:41:47.838Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2000 │ 2023-05-13T10:41:47.688Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2001 │ 2023-05-13T10:41:47.497Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2002 │ 2023-05-13T10:41:47.496Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2003 │ 2023-05-13T10:41:43.637Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2004 │ 2023-05-13T10:41:43.560Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2005 │ 2023-05-13T10:41:43.516Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2006 │ 2023-05-13T10:41:43.456Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2007 │ 2023-05-12T11:50:32.302Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│ 2008 │ 2023-05-12T11:00:04.824Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xE24DB2aC421e1645D8d1B4de295bb23b9A93Da6a' │
│ 2009 │ 2023-05-12T10:41:36.998Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│ 2010 │ 2023-05-12T10:36:19.173Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xf9a3BB070c1f9b3186A547DeD991BeD04a289C5B' │
OK, so it looks like the slow queries are coming from a paging query, we already have a note at that code path that this could be slow https://github.com/xmtp/xmtp-node-go/blob/main/pkg/store/query.go#L119-L132.
I'm going to add the paging capability to xmpt-debug and see if I can reproduce this.
One thing worth mentioning, this seems to be a rare query form, the usual ones are sorting in descending order. I wonder if there's a way to attach marginalia to the queries, so that we can search query logs based on the parameters of the api query.
Interesting look back at the API traffic growth trend over the past 8 months or so.
It seems some connection management will be needed here. The connection counts spike during these incidents. There's also network traffic spike which I'm not sure means large results or internal instance chatter (GiB).
Added new query metrics in this dashboard https://app.datadoghq.com/dashboard/63i-k7z-tj9/xmtp-network-grpc. Hopefully it will tell us something during the next spike
Looks like the new metrics pushed us over the limit at DD and we lost a bunch the next day. Opened a case to deal with this https://help.datadoghq.com/hc/en-us/requests/1222297?page=1
Also started an incident for this https://app.incident.io/incidents/13?from=home
Now that we have metrics again, there was another spike just before 5am EDT this morning. The slow query logs show convincingly that it all revolves around wallet 0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020
The contact queries seem to clock at about 300 q/sec (so not all of them are slow)
The invite queries at about 500 q/sec. Note that these are the individual page queries as the client pages through the invite list --alcdt
.
It's not entirely clear what is the ratio between slow and fast queries during this episode, we have an exact number of slow queries from the logs, but it's not clear how much fudging is involved in the metric counts. The slow queries seem to be peaking around 200 q/sec which seems somewhat lower than the total query counts. Also slow queries are not the failing queries, which seem to peak at around 200 q/sec although they are much more sparse than the slow queries
However it's not the query counts that pegs the CPU, we've been handling much higher rates the day before, albeit those were mostly conversation queries.
The notable outlier are the cumulative numbers of results returned.
This could be what is pegging not just the DB CPU but also the Node CPUs
And a corresponding bump in network traffic on both the DB and nodes as well. Note that the majority of that volume seems to be between the nodes and the DB, as it's the transmit traffic on the DB side and recv traffic on the Node side.
Somewhat surprising aspect is that the load seems to be generated from a distributed set of clients, as the ELB logs don't show a significant prevalence of particular client IPs during that time.
Another concerning aspect is the number of DB connections spiking hard during the episode
So where to go from here? I'm thinking the following:
The DB error logs for the incident are filled with 500kB of these:
2023-06-07 08:47:38 UTC:10.0.134.6(42344):xmtp@waku:[27386]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42312):xmtp@waku:[27383]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42328):xmtp@waku:[27385]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42302):xmtp@waku:[27381]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42358):xmtp@waku:[27388]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42288):xmtp@waku:[27380]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42316):xmtp@waku:[27384]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42354):xmtp@waku:[27387]:LOG: could not receive data from client: Connection reset by peer
...
This morning's episode
So, if the user has 2076 invites in their invite topic, getting the full list should only be 21 API requests if they are using the default page size of 100.
The traffic levels for that topic indicate close to 3000 refreshes of their invite list...which feels like a lot.
Yup, somebody is loadtesting our network or something. I think I'm getting close to reproducing this in dev.
Realized we're not closing the rows returned from db.Query
in FindMessages
- PR to fix: https://github.com/xmtp/xmtp-node-go/pull/276
Seems like a possible cause of the connection spikes, and maybe even the root cause of this issue.
Still seems to be happening with the connection limit. Connections spiked to only ~300 this time (80 max each across 4 nodes), but CPU also still spiked to ~100%.
The EXPLAIN on the active queries during these periods doesn't look great, scanning > 100k rows each time:
Hmm, so if this is the query pattern:
SELECT
id, receivertimestamp, sendertimestamp, contenttopic, pubsubtopic, payload, version
FROM message
WHERE
contentTopic IN ('/xmtp/0/invite-0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020/proto') AND
(senderTimestamp > 1684147362154000000 OR
(senderTimestamp = 1684147362154000000 AND id > '\xefdc10babfe847adc3371dac793290a3411ff27923cad0ead254f45207dfa617'))
ORDER BY senderTimestamp asc, id asc, pubsubTopic asc, receiverTimestamp asc
LIMIT 100
And current indexes are:
Indexes:
"messageindex" PRIMARY KEY, btree (sendertimestamp, id, pubsubtopic)
"message_contenttopic_idx" btree (contenttopic)
"message_pubsubtopic_idx" btree (pubsubtopic)
"message_recvts_shouldexpiretrue_idx" btree (receivertimestamp) WHERE should_expire IS TRUE
"message_sendertimestamp_idx" btree (sendertimestamp)
"message_sort_idx" btree (sendertimestamp, id)
Why don't we:
(contentTopic, senderTimestamp, id)
index, andORDER BY senderTimestamp, id
🤔
If we split up the query into 2 around the OR clause, then the EXPLAIN results look much better:
So maybe we can do this all in a single query with a UNION, or else multiple queries in the node. I'm surprised the query planner doesn't do something like this automatically.
Something like this:
Looks like it got absorbed pretty well this morning with no disruptions after we upgraded to the 4 vCPU DB instance:
I think we can close this as resolved.
On reflection, I guess it's not all that surprising that a 2 vCPU DB would run into CPU contention issues 😅
Looks like we lost all the new query metrics on Thursday 😦. I was wondering whether the retry hypothesis can be validated, with the db able to shoulder the load, there should be a lot fewer errors i.e. fewer retries should be necessary. I'm still seeing a fair bit of "slow" queries logged for the offending wallet although a lot fewer than before. Not quite sure if that is just because the db is now faster or that there are fewer retries.
Anyway, if it's our SDK to blame for the thundering herd, it would probably be prudent to fix that as well.
Yea we should definitely update our SDKs to backoff properly :+1:
We disabled the new query metrics last week (https://github.com/xmtp/xmtp-node-go/pull/281) because they were taking out the rest of the node's DD metrics, so we were flying blind, which was pretty uncomfortable/unacceptable operationally. If we want to add them back then let's figure out how to make sure we're not going to lose overall visibility before, or alternatively just rely on metrics via logs emitted for the queries.
In xmtp-js
the retry logic is pretty simple. Retry up to 4X with 100ms pauses between attempts. We could make the backoff exponential and/or allow fewer attempts to make the herd a little less thundering.
Oh it only retries up to 4 times? That's not too bad then, I thought it was endlessly retrying
OK, so call this done?
We're seeing intermittent spikes in CPU usage and connection counts on the production network DB:
This is causing intermittent E2E failures:
These seem to line up with spikes in traffic from xmtp-js with an unattributed app client:
It's possible that we need to configure pg connection pooling in the node, or something else, but it needs to be investigated and addressed, since it's causing degradation for users of the production network.