Open Arjun-yb opened 1 year ago
@yusong-yan I faced this issue aggressively in our CDC testcase in recent build. The reason we could identify is: We give some predefined time and expect the workload to load in that time. Recently, tablet splitting cases failed to load, when we looked into workload log, we saw these logs.
91 5811 [Thread-5] ERROR com.yugabyte.sample.common.metrics.ExceptionsTracker - Failed write with error: Batch entry 2 INSERT INTO test_cdc_3913fb (k, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13
92 java.sql.BatchUpdateException: Batch entry 2 INSERT INTO test_cdc_3913fb (k, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, v31
93 at com.yugabyte.jdbc.BatchResultHandler.handleCompletion(BatchResultHandler.java:186)
94 at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:564)
95 at com.yugabyte.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:881)
96 at com.yugabyte.jdbc.PgStatement.executeBatch(PgStatement.java:904)
97 at com.yugabyte.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1629)
98 at com.yugabyte.sample.apps.SqlDataLoad.doWrite(SqlDataLoad.java:354)
99 at com.yugabyte.sample.apps.AppBase.performWrite(AppBase.java:801)
100 at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:99)
101 Caused by: com.yugabyte.util.PSQLException: ERROR: Remote error: Service unavailable (yb/rpc/yb_rpc.cc:165): Call rejected due to memory pressure: Call yb.tserver.PgClientService.Perform 172.151.17.200:56516 => 172.151.17
102 at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2679)
103 at com.yugabyte.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2359)
104 at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:553)
105 ... 6 more
I am not sure, why this is happening on latest builds. Please check. cc: @kripasreenivasan @Arjun-yb
Hi @arybochkin We are hitting this issue constantly, while running workloads which are affecting the test runs. Request is, would it be possible to prioritise this ticket ? cc: @Arjun-yb @adithya-yb @suranjan
@shamanthchandra-yb , sure, I'm focusing on this next week.
@shamanthchandra-yb would you please specify nodes details (node types, CPUs, RAMs, etc) or share the universe where you've observed the error? Was it the only test executed at that time or some other payloads were running?
@arybochkin I am using c5.xlarge nodes. You can refer this run for example: http://stress.dev.yugabyte.com/stress_test/3bba2be0-9233-4db6-8875-1a3afc84d530
Basically we load data and verify data in target (CDC testcase)
We are running via sample apps. Have about 50 writer threads in all iterations and only they are executing.
Reason to escalate: Though these cases are being executed quite a long time, we are seeing these kind of issues recently. Also, mostly I am seeing in tablet splitting enabled testcases.
@shamanthchandra-yb, do you remember the build number where you've stated to see this kind of failures? would it possible for you to identify that build number?
@arybochkin I believe I first saw it from here 2.17.2.0-b6
Here is the slack thread if needed: https://yugabyte.slack.com/archives/C03H4D4EVC6/p1673544424656019
@shamanthchandra-yb , Based on the recent findings on a customer cluster, we have come to realize that when CDC is enabled, the transaction metadata is not reclaimed as quickly as expected - CDC cleanup holds on to 4 hours worth of intents as of today. We need to discuss this with CDC team and debug if that is waht is causing OOM on the workload. cc @suranjan
@shamanthchandra-yb , Can you work with CDC team to understand if the workload is expected to run without memory issues, given that CDC holds up clean up of intents for 4 hours? cc @suranjan
@rthallamko3 last time we were discussing it here: slack_thread
@arybochkin had mentioned over chat that this could also be covered with https://github.com/yugabyte/yugabyte-db/issues/16091. Could you let me know if some fix is planned for the same.
We are still facing this issue, sometimes as soon as 3 iterations (<75000 rows with 35MB split) cc: @suranjan let me know if we need to verify something @rthallamko3 has mentioned above, in verifying intent records count issue causing this.
@shamanthchandra-yb to be precise I've mentioned it could be related to #16091, and if it is related then it could be covered partially or completely.
Jira Link: DB-4322
Description
Version: 2.16.0.0-b56 Steps:
Crete universe with below GFLAGs
Create a database and 2 tables
create table table1_1(id int primary key, name text, age int, description text); create table table1_2(id int, name text, age int, description text, primary key(id ASC));
and
Observations:
Note: Throws same error(with 800K+ rows) if we run this without automatic tablet splitting ON.