Intermittent DB resource contention and E2E test failures

snormore commented 1 year ago

We're seeing intermittent spikes in CPU usage and connection counts on the production network DB:

This is causing intermittent E2E failures:

These seem to line up with spikes in traffic from xmtp-js with an unattributed app client:

It's possible that we need to configure pg connection pooling in the node, or something else, but it needs to be investigated and addressed, since it's causing degradation for users of the production network.

neekolas commented 1 year ago

How many IPs are these spikes coming from? If it's just a few, I like the idea of banning them for a day and seeing who pops their head up to complain.

But yes, we should also implement connection pooling to keep the number of connections from spiking with traffic. Given that we have a small number of nodes, I don't think PGBouncer is needed. Client-side pooling should get the job done.

snormore commented 1 year ago

Yea looks like probably from the same IP:

Screenshot 2023-05-23 at 10 54 11 AM

With the requests being mostly queries:

Screenshot 2023-05-23 at 10 55 58 AM

snormore commented 1 year ago

The API errors from that chart are all io timeouts against the DB, as you might've guessed:

rpc error: code = Internal desc = read tcp 10.0.163.161:43626->10.0.133.153:5432: i/o timeout

snormore commented 1 year ago

The IP breakdown screenshot I posted before was the e2e runner, and not LB logs. Here's what it looks like for the LB logs:

Screenshot 2023-05-23 at 7 23 53 PM

snormore commented 1 year ago

The top 3 are DigitalOcean-owned/managed IPs afaics

neekolas commented 1 year ago

Interesting. So we have a few big consumers, not just one. We could block the top 3?

snormore commented 1 year ago

Seems to be happening every day at the same time:

Screenshot 2023-05-24 at 9 45 47 AM

The query response time seems to be gradually increasing too which is a bit concerning:

Screenshot 2023-05-24 at 9 45 52 AM

It's interesting that RDS commit throughput seems to spike at the same time too:

Screenshot 2023-05-24 at 9 42 49 AM

I don't see an equivalent spike in publishes at that time, so not sure what that commit throughput is from exactly. I wonder if it's a daily vacuum or something, but I wouldn't expect that to happen twice daily within the same 2 hour period.

It seems to start at 6am ET every day, and then another spike at 8am ET every day.

mkobetic commented 1 year ago

This is probably a red herring but it seems the most CPU intensive queries in these spikes have to do with wallet 0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020 (I checked last three days). Wonder what wallet is that.

mkobetic commented 1 year ago

Hm, plot thickens, it looks like the wallet has 2000 invites and their timestamps seem to coincide with the CPU spikes. And many of them are self-invites which is odd:

│  1990   │ 2023-05-24T11:02:19.930Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1991   │ 2023-05-24T11:02:21.277Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1992   │ 2023-05-24T11:02:21.444Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1993   │ 2023-05-24T11:02:24.415Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1994   │ 2023-05-24T11:02:24.868Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1995   │ 2023-05-24T11:02:26.298Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1996   │ 2023-05-24T11:04:06.984Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  1997   │ 2023-05-24T12:18:21.064Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  1998   │ 2023-05-24T12:18:22.778Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  1999   │ 2023-05-24T12:18:22.858Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2000   │ 2023-05-24T12:18:45.451Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2001   │ 2023-05-24T12:19:00.964Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2002   │ 2023-05-24T12:19:04.748Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2003   │ 2023-05-24T12:45:40.479Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2004   │ 2023-05-24T12:45:47.801Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2005   │ 2023-05-24T12:45:51.484Z │ '0xa35F…e020' │ '0x06D7…C24b' │
│  2006   │ 2023-05-25T11:08:06.333Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  2007   │ 2023-05-25T11:08:06.350Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  2008   │ 2023-05-25T11:10:54.655Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  2009   │ 2023-05-25T11:34:51.110Z │ '0xa35F…e020' │ '0xa35F…e020' │
│  2010   │ 2023-05-25T12:49:21.033Z │ '0xa35F…e020' │ '0xa35F…e020' │

mkobetic commented 1 year ago

Hm, looks like that wallet has an ens record serendipitix.eth. Does that ring a bell?

https://etherscan.io/name-lookup-search?id=0xa35f4af4ef419e2c7af96ea5af4954b32d30e020

neekolas commented 1 year ago

Address doesn't ring a bell.

Does seem odd that just returning 2000 rows would be enough to lock up our DB, given we have other users in the same ballpark of size.

mkobetic commented 1 year ago

It's unlikely just that, but the invite pattern is odd, doesn't seem like a regular user. I only see 2 addresses in the invites this one and 0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b

neekolas commented 1 year ago

Very odd

mkobetic commented 1 year ago

Also the very first invites start just about when these issues started

│  1991   │ 2023-05-13T11:50:49.487Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1992   │ 2023-05-13T11:50:48.897Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1993   │ 2023-05-13T11:50:48.541Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1994   │ 2023-05-13T11:50:48.352Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1995   │ 2023-05-13T11:50:43.323Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1996   │ 2023-05-13T11:50:43.090Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1997   │ 2023-05-13T11:50:43.085Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1998   │ 2023-05-13T11:50:42.982Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  1999   │ 2023-05-13T10:41:47.838Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2000   │ 2023-05-13T10:41:47.688Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2001   │ 2023-05-13T10:41:47.497Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2002   │ 2023-05-13T10:41:47.496Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2003   │ 2023-05-13T10:41:43.637Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2004   │ 2023-05-13T10:41:43.560Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2005   │ 2023-05-13T10:41:43.516Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2006   │ 2023-05-13T10:41:43.456Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2007   │ 2023-05-12T11:50:32.302Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0x06D7F2D453Eb87A3DE0F73C679598C68e02FC24b' │
│  2008   │ 2023-05-12T11:00:04.824Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xE24DB2aC421e1645D8d1B4de295bb23b9A93Da6a' │
│  2009   │ 2023-05-12T10:41:36.998Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │
│  2010   │ 2023-05-12T10:36:19.173Z │ '0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020' │ '0xf9a3BB070c1f9b3186A547DeD991BeD04a289C5B' │

mkobetic commented 1 year ago

OK, so it looks like the slow queries are coming from a paging query, we already have a note at that code path that this could be slow https://github.com/xmtp/xmtp-node-go/blob/main/pkg/store/query.go#L119-L132.

I'm going to add the paging capability to xmpt-debug and see if I can reproduce this.

mkobetic commented 1 year ago

One thing worth mentioning, this seems to be a rare query form, the usual ones are sorting in descending order. I wonder if there's a way to attach marginalia to the queries, so that we can search query logs based on the parameters of the api query.

mkobetic commented 1 year ago

Interesting look back at the API traffic growth trend over the past 8 months or so.

mkobetic commented 1 year ago

It seems some connection management will be needed here. The connection counts spike during these incidents. There's also network traffic spike which I'm not sure means large results or internal instance chatter (GiB).

mkobetic commented 1 year ago

Added new query metrics in this dashboard https://app.datadoghq.com/dashboard/63i-k7z-tj9/xmtp-network-grpc. Hopefully it will tell us something during the next spike

mkobetic commented 1 year ago

Looks like the new metrics pushed us over the limit at DD and we lost a bunch the next day. Opened a case to deal with this https://help.datadoghq.com/hc/en-us/requests/1222297?page=1

mkobetic commented 1 year ago

Also started an incident for this https://app.incident.io/incidents/13?from=home

mkobetic commented 1 year ago

Now that we have metrics again, there was another spike just before 5am EDT this morning. The slow query logs show convincingly that it all revolves around wallet 0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020

The contact queries seem to clock at about 300 q/sec (so not all of them are slow)

The invite queries at about 500 q/sec. Note that these are the individual page queries as the client pages through the invite list --alcdt.

It's not entirely clear what is the ratio between slow and fast queries during this episode, we have an exact number of slow queries from the logs, but it's not clear how much fudging is involved in the metric counts. The slow queries seem to be peaking around 200 q/sec which seems somewhat lower than the total query counts. Also slow queries are not the failing queries, which seem to peak at around 200 q/sec although they are much more sparse than the slow queries

However it's not the query counts that pegs the CPU, we've been handling much higher rates the day before, albeit those were mostly conversation queries.

The notable outlier are the cumulative numbers of results returned.

This could be what is pegging not just the DB CPU but also the Node CPUs

And a corresponding bump in network traffic on both the DB and nodes as well. Note that the majority of that volume seems to be between the nodes and the DB, as it's the transmit traffic on the DB side and recv traffic on the Node side.

Somewhat surprising aspect is that the load seems to be generated from a distributed set of clients, as the ELB logs don't show a significant prevalence of particular client IPs during that time.

Another concerning aspect is the number of DB connections spiking hard during the episode

So where to go from here? I'm thinking the following:

[ ] Reproduce the problem, it seems it's not just the type of query but also sufficient rate and volume of results
[ ] Figure out which are the slowest queries, analyze them and see if we can speed them up
[ ] Figure out what if anything to do with the connection spikes, we should probably pool them better, but that is not going to help with the DOS effect of these spikes, blocking queries is not going to let legit users through
[ ] Blocking by IP doesn't seem to do much here (the WAF was totally quiet in this one), it may be because the query throughput is not that large, it seems to be the results volume. We could look at budgeting per wallet as well maybe? That would only handle invite/contact queries but may still be worth while. Would likely address this particular case.

mkobetic commented 1 year ago

The DB error logs for the incident are filled with 500kB of these:

2023-06-07 08:47:38 UTC:10.0.134.6(42344):xmtp@waku:[27386]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42312):xmtp@waku:[27383]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42328):xmtp@waku:[27385]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42302):xmtp@waku:[27381]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42358):xmtp@waku:[27388]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42288):xmtp@waku:[27380]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42316):xmtp@waku:[27384]:LOG: could not receive data from client: Connection reset by peer
2023-06-07 08:47:38 UTC:10.0.134.6(42354):xmtp@waku:[27387]:LOG: could not receive data from client: Connection reset by peer
...

mkobetic commented 1 year ago

This morning's episode

neekolas commented 1 year ago

So, if the user has 2076 invites in their invite topic, getting the full list should only be 21 API requests if they are using the default page size of 100.

The traffic levels for that topic indicate close to 3000 refreshes of their invite list...which feels like a lot.

mkobetic commented 1 year ago

Yup, somebody is loadtesting our network or something. I think I'm getting close to reproducing this in dev.

snormore commented 1 year ago

Realized we're not closing the rows returned from db.Query in FindMessages - PR to fix: https://github.com/xmtp/xmtp-node-go/pull/276

Seems like a possible cause of the connection spikes, and maybe even the root cause of this issue.

snormore commented 1 year ago

Still seems to be happening with the connection limit. Connections spiked to only ~300 this time (80 max each across 4 nodes), but CPU also still spiked to ~100%.

Screenshot 2023-06-13 at 6 29 47 AM

snormore commented 1 year ago

The EXPLAIN on the active queries during these periods doesn't look great, scanning > 100k rows each time:

Screenshot 2023-06-13 at 6 37 44 AM

snormore commented 1 year ago

Hmm, so if this is the query pattern:

SELECT 
    id, receivertimestamp, sendertimestamp, contenttopic, pubsubtopic, payload, version 
FROM message 
WHERE 
    contentTopic IN ('/xmtp/0/invite-0xa35F4aF4EF419E2c7AF96eA5aF4954b32D30e020/proto') AND 
    (senderTimestamp > 1684147362154000000 OR 
    (senderTimestamp = 1684147362154000000 AND id > '\xefdc10babfe847adc3371dac793290a3411ff27923cad0ead254f45207dfa617')) 
ORDER BY senderTimestamp asc, id asc, pubsubTopic asc, receiverTimestamp asc 
LIMIT 100

And current indexes are:

Indexes:
    "messageindex" PRIMARY KEY, btree (sendertimestamp, id, pubsubtopic)
    "message_contenttopic_idx" btree (contenttopic)
    "message_pubsubtopic_idx" btree (pubsubtopic)
    "message_recvts_shouldexpiretrue_idx" btree (receivertimestamp) WHERE should_expire IS TRUE
    "message_sendertimestamp_idx" btree (sendertimestamp)
    "message_sort_idx" btree (sendertimestamp, id)

Why don't we:

Add a (contentTopic, senderTimestamp, id) index, and
Change the sort clause on that query to be just ORDER BY senderTimestamp, id

🤔

snormore commented 1 year ago

If we split up the query into 2 around the OR clause, then the EXPLAIN results look much better:

Screenshot 2023-06-15 at 8 53 26 AM

So maybe we can do this all in a single query with a UNION, or else multiple queries in the node. I'm surprised the query planner doesn't do something like this automatically.

Something like this:

Screenshot 2023-06-15 at 8 56 12 AM

snormore commented 1 year ago

Looks like it got absorbed pretty well this morning with no disruptions after we upgraded to the 4 vCPU DB instance:

Screenshot 2023-06-17 at 8 48 21 AM

Screenshot 2023-06-17 at 8 48 31 AM

snormore commented 1 year ago

I think we can close this as resolved.

On reflection, I guess it's not all that surprising that a 2 vCPU DB would run into CPU contention issues 😅

mkobetic commented 1 year ago

Looks like we lost all the new query metrics on Thursday 😦. I was wondering whether the retry hypothesis can be validated, with the db able to shoulder the load, there should be a lot fewer errors i.e. fewer retries should be necessary. I'm still seeing a fair bit of "slow" queries logged for the offending wallet although a lot fewer than before. Not quite sure if that is just because the db is now faster or that there are fewer retries.

Anyway, if it's our SDK to blame for the thundering herd, it would probably be prudent to fix that as well.

snormore commented 1 year ago

Yea we should definitely update our SDKs to backoff properly :+1:

We disabled the new query metrics last week (https://github.com/xmtp/xmtp-node-go/pull/281) because they were taking out the rest of the node's DD metrics, so we were flying blind, which was pretty uncomfortable/unacceptable operationally. If we want to add them back then let's figure out how to make sure we're not going to lose overall visibility before, or alternatively just rely on metrics via logs emitted for the queries.

neekolas commented 1 year ago

In xmtp-js the retry logic is pretty simple. Retry up to 4X with 100ms pauses between attempts. We could make the backoff exponential and/or allow fewer attempts to make the herd a little less thundering.

snormore commented 1 year ago

Oh it only retries up to 4 times? That's not too bad then, I thought it was endlessly retrying

mkobetic commented 1 year ago

OK, so call this done?

xmtp / xmtp-node-go

Intermittent DB resource contention and E2E test failures #269