yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.02k stars 1.08k forks source link

[CDC] Data loss observed on master case with tablet splitting with yb_use_hash_splitting_by_default set to false #24992

Closed ShikharSahay closed 11 hours ago

ShikharSahay commented 2 days ago

Jira Link: DB-14134

Description

test_cdc_main_with_tablet_splitting failed on 2024.2.0.0-b116 due to data loss. This was seen with per parity runs with the following gflags -

For both tserver and master  
 {
   "yb_enable_read_committed_isolation": "true",
    "ysql_enable_read_request_caching": "true",
    "ysql_pg_conf_csv": "yb_enable_base_scans_cost_model=true,yb_enable_optimizer_statistics=true,"
     "yb_bnl_batch_size=1024,yb_fetch_row_limit=0,yb_fetch_row_limit=0,"
     "yb_fetch_size_limit='1MB',yb_use_hash_splitting_by_default=false",
}

Source connector version

io.debezium.connector.yugabytedb.YugabyteDBgRPCConnector

Connector configuration

adding yb connector stream_id='229784f881861da3354c68966ad58f81' db_name='cdc_0492c1' connector_host='172.151.27.126' table_list=['test_cdc_9574e9']
2024-11-06 12:27:19,742:DEBUG: add connector connector_name='ybconnector_cdc_0492c1_test_cdc_9574e9' stream_id='229784f881861da3354c68966ad58f81' db_name='cdc_0492c1' connector_host='172.151.27.126' table_list=['test_cdc_9574e9'] {'name': 'ybconnector_cdc_0492c1_test_cdc_9574e9', 'config': {'database.master.addresses': '172.151.23.53:7100,172.151.26.203:7100,172.151.20.123:7100', 'database.hostname': '172.151.23.53:5433,172.151.26.203:5433,172.151.20.123:5433', 'database.port': 5433, 'database.masterhost': '172.151.20.123', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_0492c1', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'connector.class': 'io.debezium.connector.yugabytedb.YugabyteDBgRPCConnector', 'database.server.name': 'ybconnector_cdc_0492c1_test_cdc_9574e9', 'database.streamid': '229784f881861da3354c68966ad58f81', 'table.include.list': 'public.test_cdc_9574e9', 'database.sslrootcert': '/kafka/ca.crt'}}

YugabyteDB version

2024.2.0.0-b116

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

vaibhav-yb commented 9 hours ago

Update:

We have a code where we are calculating higher offsets for each partition based on the received values from Kafka:

Map<String, ?> lastOffset = entry.getValue().getOffset();
this.ybOffset = getHigherOffsets(lastOffset);

The above has a logical error. For example, consider that ybOffset currently has keys {a, b, c} and lastOffset only has {b, c}:

  1. The method getHigherOffsets(lastOffset) is written in a way that it will only return a map with keys {b, c}
  2. This way, we will end up overwriting the map ybOffset.
  3. Logically, we need to preserve the values present in ybOffset which are not present in lastOffset i.e. a in above example.

The above bug will cause a loss of offsets for partition a and we will end up skipping this partition (or tablet) while committing the offsets which can ultimately lead to data loss as reported in internal QA runs.