yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.04k stars 1.08k forks source link

[DocDB] XCluster: Replication can stall if there are multiple DDLs and the xCluster poller restarts #24990

Open lingamsandeep opened 5 days ago

lingamsandeep commented 5 days ago

Jira Link: DB-14131

Description

If the following sequence of events happen, then XCluster replication can stall.

T0 - XCluster source checkpoints the OpId till which the target of XCluster is caught up. T1 - DDL 1 alters the table bumping the schema version to 1 T2 - DDL 2 alters the table bumping the schema version to 2. T3 - DDL 3 alters the table bumping the schema version 3. T4 - XCluster poller on target applies everything upto T3. In doing so, the xCluster target creates mapping of Source->Target schema versions for the most recent 2 schema versions, so has a mapping of [source schema version 2: target schema version x], [source schema version 3:target schema version y]. T5 - XCluster poller restarts causing it to start from the most recent checkpoint on the source.

If the XCluster source has not updated its checkpoint since T0 and goes through a restart or its in-memory checkpoint state expires, the XCluster source starts sending changes from T1 onwards . So it ends up re-sending ChangeMetadataOp associated with DDL1.

This gets ignored by the target as it already has mappings for newer schema versions 2 and 3. However, when rows are retrieved with Schema version 1, xCluster target does not know how to handle them as it does not have the mapping for schema version 1 anymore causing replication to fail.

Proposed solution:

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information