[DocDB][Index backfill] Index backfill fails and throws Service unavailable (yb/rpc/yb_rpc.cc:167): Call rejected due to memory pressure.

Arjun-yb commented 1 year ago

Jira Link: DB-4322

Description

Version: 2.16.0.0-b56 Steps:

Crete universe with below GFLAGs

Master:
{
"tablet_split_high_phase_shard_count_per_node": 10000, 
"tablet_split_high_phase_size_threshold_bytes": 10485760, 
"tablet_split_low_phase_size_threshold_bytes": 2097152, 
"tablet_split_low_phase_shard_count_per_node": 16
}

Create a database and 2 tables

create table table1_1(id int primary key, name text, age int, description text); create table table1_2(id int, name text, age int, description text, primary key(id ASC));

Start Loading 500K rows in each table.
Check for tablets, once count increases(here observed > 50 tablets for both tables)

Create indexes for both tables and observe the behaviour.

demo10=# create index idx1_1 on table1_1(description);
ERROR:  Aborted: ERROR:  Remote error: Service unavailable (yb/rpc/yb_rpc.cc:167): Call rejected due to memory pressure: Call yb.tserver.PgClientService.Perform 10.9.196.54:48448 => 10.9.196.54:9100 (request call id 159)
demo10=# \d table1_1;
            Table "public.table1_1"
Column    |  Type   | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
id          | integer |           | not null |
name        | text    |           |          |
age         | integer |           |          |
description | text    |           |          |
Indexes:
"table1_1_pkey" PRIMARY KEY, lsm (id HASH)
"idx1_1" lsm (description HASH) INVALID

and

demo10=# create index idx1_2 on table1_2(description);
ERROR:  Aborted: ERROR:  Remote error: [Remote error (yb/rpc/outbound_call.cc:415): Service unavailable (yb/rpc/yb_rpc.cc:167): Call rejected due to memory pressure: Call yb.tserver.TabletServerService.Write 10.9.131.213:35699 => 10.9.122.148:9100 (request call id 10150475) (rpc error 5)]
demo10=# \d table1_2;
                Table "public.table1_2"
   Column    |  Type   | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
 id          | integer |           | not null |
 name        | text    |           |          |
 age         | integer |           |          |
 description | text    |           |          |
Indexes:
    "table1_2_pkey" PRIMARY KEY, lsm (id ASC)
    "idx1_2" lsm (description HASH) INVALID

Observations:

Both indexes creation failed and indexes are in INVALID state.

Data loading is in progress and observed below errors(at client side) while creating indexes.

ysqlsh:sql_data.sql:492095: ERROR:  Operation failed. Try again: Resource unavailable: null
ysqlsh:sql_data.sql:492096: ERROR:  Operation failed. Try again: Resource unavailable: null
ysqlsh:sql_data.sql:492097: ERROR:  Query error: schema version mismatch for table 0000402900003000800000000000402f: expected 2, got 1 (compt with prev: 0)

Note: Throws same error(with 800K+ rows) if we run this without automatic tablet splitting ON.

shamanthchandra-yb commented 1 year ago

@yusong-yan I faced this issue aggressively in our CDC testcase in recent build. The reason we could identify is: We give some predefined time and expect the workload to load in that time. Recently, tablet splitting cases failed to load, when we looked into workload log, we saw these logs.

 91 5811 [Thread-5] ERROR com.yugabyte.sample.common.metrics.ExceptionsTracker  - Failed write with error: Batch entry 2 INSERT INTO test_cdc_3913fb (k, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13
 92 java.sql.BatchUpdateException: Batch entry 2 INSERT INTO test_cdc_3913fb (k, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, v31
 93         at com.yugabyte.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:186)
 94         at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:564)
 95         at com.yugabyte.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:881)
 96         at com.yugabyte.jdbc.PgStatement.executeBatch(PgStatement.java:904)
 97         at com.yugabyte.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1629)
 98         at com.yugabyte.sample.apps.SqlDataLoad.doWrite(SqlDataLoad.java:354)
 99         at com.yugabyte.sample.apps.AppBase.performWrite(AppBase.java:801)
100         at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:99)
101 Caused by: com.yugabyte.util.PSQLException: ERROR: Remote error: Service unavailable (yb/rpc/yb_rpc.cc:165): Call rejected due to memory pressure: Call yb.tserver.PgClientService.Perform 172.151.17.200:56516 => 172.151.17
102         at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2679)
103         at com.yugabyte.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2359)
104         at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:553)
105         ... 6 more

I am not sure, why this is happening on latest builds. Please check. cc: @kripasreenivasan @Arjun-yb

shamanthchandra-yb commented 1 year ago

Hi @arybochkin We are hitting this issue constantly, while running workloads which are affecting the test runs. Request is, would it be possible to prioritise this ticket ? cc: @Arjun-yb @adithya-yb @suranjan

arybochkin commented 1 year ago

@shamanthchandra-yb , sure, I'm focusing on this next week.

arybochkin commented 1 year ago

@shamanthchandra-yb would you please specify nodes details (node types, CPUs, RAMs, etc) or share the universe where you've observed the error? Was it the only test executed at that time or some other payloads were running?

shamanthchandra-yb commented 1 year ago

@arybochkin I am using c5.xlarge nodes. You can refer this run for example: http://stress.dev.yugabyte.com/stress_test/3bba2be0-9233-4db6-8875-1a3afc84d530

Basically we load data and verify data in target (CDC testcase)

We are running via sample apps. Have about 50 writer threads in all iterations and only they are executing.

Reason to escalate: Though these cases are being executed quite a long time, we are seeing these kind of issues recently. Also, mostly I am seeing in tablet splitting enabled testcases.

arybochkin commented 1 year ago

@shamanthchandra-yb, do you remember the build number where you've stated to see this kind of failures? would it possible for you to identify that build number?

shamanthchandra-yb commented 1 year ago

@arybochkin I believe I first saw it from here 2.17.2.0-b6

Here is the slack thread if needed: https://yugabyte.slack.com/archives/C03H4D4EVC6/p1673544424656019

rthallamko3 commented 1 year ago

@shamanthchandra-yb , Based on the recent findings on a customer cluster, we have come to realize that when CDC is enabled, the transaction metadata is not reclaimed as quickly as expected - CDC cleanup holds on to 4 hours worth of intents as of today. We need to discuss this with CDC team and debug if that is waht is causing OOM on the workload. cc @suranjan

rthallamko3 commented 1 year ago

@shamanthchandra-yb , Can you work with CDC team to understand if the workload is expected to run without memory issues, given that CDC holds up clean up of intents for 4 hours? cc @suranjan

shamanthchandra-yb commented 1 year ago

@rthallamko3 last time we were discussing it here: slack_thread

@arybochkin had mentioned over chat that this could also be covered with https://github.com/yugabyte/yugabyte-db/issues/16091. Could you let me know if some fix is planned for the same.

We are still facing this issue, sometimes as soon as 3 iterations (<75000 rows with 35MB split) Profile (1) cc: @suranjan let me know if we need to verify something @rthallamko3 has mentioned above, in verifying intent records count issue causing this.

arybochkin commented 1 year ago

@shamanthchandra-yb to be precise I've mentioned it could be related to #16091, and if it is related then it could be covered partially or completely.

yugabyte / yugabyte-db

[DocDB][Index backfill] Index backfill fails and throws Service unavailable (yb/rpc/yb_rpc.cc:167): Call rejected due to memory pressure. #15112

Description