If the following sequence of events happen, then XCluster replication can stall.
T0 - XCluster source checkpoints the OpId till which the target of XCluster is caught up.
T1 - DDL 1 alters the table bumping the schema version to 1
T2 - DDL 2 alters the table bumping the schema version to 2.
T3 - DDL 3 alters the table bumping the schema version 3.
T4 - XCluster poller on target applies everything upto T3. In doing so, the xCluster target creates mapping of Source->Target schema versions for the most recent 2 schema versions, so has a mapping of [source schema version 2: target schema version x], [source schema version 3:target schema version y].
T5 - XCluster poller restarts causing it to start from the most recent checkpoint on the source.
If the XCluster source has not updated its checkpoint since T0 and goes through a restart or its in-memory checkpoint state expires, the XCluster source starts sending changes from T1 onwards . So it ends up re-sending ChangeMetadataOp associated with DDL1.
This gets ignored by the target as it already has mappings for newer schema versions 2 and 3. However, when rows are retrieved with Schema version 1, xCluster target does not know how to handle them as it does not have the mapping for schema version 1 anymore causing replication to fail.
Proposed solution:
Force a checkpoint during GetChanges if a CHANGE_METADATA_OP is part of the changes on the XCluster source. This will ensure that the checkpoint is always at most 1 DDL ago.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
[X] I confirm this issue does not contain any sensitive information.
Jira Link: DB-14131
Description
If the following sequence of events happen, then XCluster replication can stall.
T0 - XCluster source checkpoints the OpId till which the target of XCluster is caught up. T1 - DDL 1 alters the table bumping the schema version to 1 T2 - DDL 2 alters the table bumping the schema version to 2. T3 - DDL 3 alters the table bumping the schema version 3. T4 - XCluster poller on target applies everything upto T3. In doing so, the xCluster target creates mapping of Source->Target schema versions for the most recent 2 schema versions, so has a mapping of [source schema version 2: target schema version x], [source schema version 3:target schema version y]. T5 - XCluster poller restarts causing it to start from the most recent checkpoint on the source.
If the XCluster source has not updated its checkpoint since T0 and goes through a restart or its in-memory checkpoint state expires, the XCluster source starts sending changes from T1 onwards . So it ends up re-sending ChangeMetadataOp associated with DDL1.
This gets ignored by the target as it already has mappings for newer schema versions 2 and 3. However, when rows are retrieved with Schema version 1, xCluster target does not know how to handle them as it does not have the mapping for schema version 1 anymore causing replication to fail.
Proposed solution
:Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information