yugabyte / debezium-connector-yugabytedb

A Debezium CDC connector for the YugabyteDB database
https://docs.yugabyte.com/stable/explore/change-data-capture/using-logical-replication/yugabytedb-connector/
Apache License 2.0
12 stars 8 forks source link

[yugabyte/yugabyte-db#20636] Fetch for tablets from service in snapshot phase as well #320

Closed vaibhav-yb closed 9 months ago

vaibhav-yb commented 9 months ago

Problem

The connector flow today is such that:

  1. Connector obtains the tablets from service
  2. HashPartitions are created and the partition keys are passed down to the tasks
  3. Tasks receive an immutable list of tablets and start processing that
  4. In snapshot phase, connector directly rely on the passed list from the top level connector and take snapshot but in streaming phase it validates the tablet list again by asking the current tablets from service.
  5. Now suppose if a tablet is split and connector is in streaming phase, connector will handle the split gracefully.
  6. At this stage, if a task is restarted the flow will start from the snapshot phase itself and connector will try to get the checkpoint of tablets in the task to confirm whether to take snapshot. But the tablet list available here is stale and it will error out while getting the checkpoint since the tablet would have been deleted by this time.

Solution

This diff implements a solution to the problem by making a call to GetTabletListToPollForCDC in the snapshot phase as well which then ensures that we are only receiving a valid set of tablets to poll.

Testing

Tested manually using the following steps:

  1. Start connector with snapshot mode initial with a single tablet
  2. Split a tablet once it reaches streaming phase
  3. Insert more data
  4. Restart the task
  5. Upon restart: i. Without fix: The task will fail while trying to get the checkpoint of deleted parent tablet ii. With fix: Connector proceeds to the streaming phase without any error

This fixes yugabyte/yugabyte-db#20636