yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.66k stars 1.04k forks source link

[CDCSDK] WAL segments count doesn't go down after CDC streaming is done #17220

Open shamanthchandra-yb opened 1 year ago

shamanthchandra-yb commented 1 year ago

Jira Link: DB-6471

Description

Testcase: test_cdc_tx_wal_verification

Having tablets count to be equal to only 1.

Gflags used:

Screenshot 2023-05-08 at 3 17 10 PM

We are trying to check number of WAL segments for given tablet.

ls -a /mnt/d0/yb-data/tserver/wals/table-00004007000030008000000000004007/tablet-2924a79b0f5448719e8d73c95cdf5a40 | grep wal | wc -l

After the workload was run it was 6.

We did run workload -> flush all tablets -> Verify CDC replication is complete -> Verify WAL -> flush all tablets -> Verify WAL

Observation: Number of WAL segments isn't going down. It remains 6 in this case.

Source connector version

quay.io/yugabyte/debezium-connector:latest

1.9.5.y.21

Connector configuration

add connector connector_name='ybconnector_cdc_02d407_table_1' stream_id='2b8ca8d553d546f1b1b0b1eaa4b11fe6' db_name='cdc_02d407' connector_host='172.151.24.175' table_list=['table_1'] {'name': 'ybconnector_cdc_02d407_table_1', 'config': {'connector.class': 'io.debezium.connector.yugabytedb.YugabyteDBConnector', 'database.hostname': '172.151.16.189', 'database.master.addresses': '172.151.22.111:7100,172.151.16.189:7100,172.151.27.33:7100', 'database.port': 5433, 'database.masterhost': '172.151.16.189', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_02d407', 'database.server.name': 'db_cdc', 'database.streamid': '2b8ca8d553d546f1b1b0b1eaa4b11fe6', 'snapshot.mode': 'never', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 600000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '10', 'table.include.list': 'public.table_1'}}

YugabyteDB version

2.18.0.0-b61

Warning: Please confirm that this issue does not contain any sensitive information

suranjan commented 1 year ago

The checkpoint matches with the last OpId in the WAL.

The last entry in the WAL

I0508 11:08:06.153332 20467 log_util.cc:792] Scanning /mnt/d0/yb-data/tserver/wals/table-00004007000030008000000000004007/tablet-2924a79b0f5448719e8d73c95cdf5a40/wal-000000007 for valid entry headers following offset 342...
I0508 11:08:06.154304 20467 log_util.cc:836] Found no log entry headers
I0508 11:08:06.154315 20467 log_util.cc:697] Ignoring partially flushed segment in write ahead log /mnt/d0/yb-data/tserver/wals/table-00004007000030008000000000004007/tablet-2924a79b0f5448719e8d73c95cdf5a40/wal-000000007 because there are no log entries following this one. The server probably crashed in the middle of writing an entry to the write-ahead log or downloaded an active log via remote bootstrap. Error detail: Corruption (yb/consensus/log_util.cc:907): Invalid checksum in log entry head header: found=0, computed=2351477386: Failed trying to read batch #3 at offset 342 for log segment /mnt/d0/yb-data/tserver/wals/table-00004007000030008000000000004007/tablet-2924a79b0f5448719e8d73c95cdf5a40/wal-000000007: Prior batch offsets: 248 313 342; Last log entries read: [REPLICATE (2.22903)]
replicate {
  id {
    term: 2
    index: 22903
  }
  hybrid_time: HT{ days: 19485 time: 10:39:56.330332 }
  op_type: NO_OP
  size: 30
  id { term: 2 index: 22903 } hybrid_time: 6895789655369039872 op_type: NO_OP committed_op_id { term: 1 index: 22902 } noop_request { }
}

The entry in cdc_state table

2924a79b0f5448719e8d73c95cdf5a40 | 2b8ca8d553d546f1b1b0b1eaa4b11fe6 | 1.22902 | {'active_time': '1683540067641262', 'cdc_sdk_safe_time': '18446744073709551614'} | 2023-05-08 10:01:07.641000+0000

Need to check what is blocking the WAL GC.