xCluster replication replicates data changes from a producer universe to a consumer universe.
It uses a variant of Change Data Capture (CDC) to stream changes captured from producer tablet Raft logs. These streams (one per producer tablet being replicated) have at least once semantics. That is, each change is guaranteed to be delivered at least once.
The catch is that the same change can be delivered more than once and the system needs to handle this.
In slightly more detail, there is a persistent checkpoint associated with each stream, which is periodically advanced. If the receiver crashes, processing restarts at the last checkpoint taken. This can cause the last bunch of changes to be repeated (in the same order). Thus the stream can occasionally stutter, repeating the last N changes.
Repeated schema changes
Today we do not replicate DML operations directly in the sense of applying the same DML operation on the consumer as the producer. What we do instead is send over the new schema; replication will be paused until the consumer schema matches the new schema from the producer. Admins are expected to apply DML operations one at a time, each first to the consumer side and then to the producer side.
It is clear that repeating the same schema change operation is idempotent. However, a sequence of schema change ops are not. For example, the sequence of schema ops X Y X Y can leave us stuck even if the consumer cluster upgrades first to X then to Y.
This issue
To deal with these problems, we want to have the consumer remember each schema change it processes from the producer (e.g., persist the monotonically increasing schema version number) and not process schema changes it has already seen.
That is we would process X then Y but when we see the second occurrence of X we would recognize it as a repeat and skip it; likewise the second occurrence of Y.
Warning: Please confirm that this issue does not contain any sensitive information
[X] I confirm this issue does not contain any sensitive information.
Jira Link: DB-6530
Description
Background
xCluster replication replicates data changes from a producer universe to a consumer universe.
It uses a variant of Change Data Capture (CDC) to stream changes captured from producer tablet Raft logs. These streams (one per producer tablet being replicated) have at least once semantics. That is, each change is guaranteed to be delivered at least once.
The catch is that the same change can be delivered more than once and the system needs to handle this.
In slightly more detail, there is a persistent checkpoint associated with each stream, which is periodically advanced. If the receiver crashes, processing restarts at the last checkpoint taken. This can cause the last bunch of changes to be repeated (in the same order). Thus the stream can occasionally stutter, repeating the last N changes.
Repeated schema changes
Today we do not replicate DML operations directly in the sense of applying the same DML operation on the consumer as the producer. What we do instead is send over the new schema; replication will be paused until the consumer schema matches the new schema from the producer. Admins are expected to apply DML operations one at a time, each first to the consumer side and then to the producer side.
It is clear that repeating the same schema change operation is idempotent. However, a sequence of schema change ops are not. For example, the sequence of schema ops X Y X Y can leave us stuck even if the consumer cluster upgrades first to X then to Y.
This issue
To deal with these problems, we want to have the consumer remember each schema change it processes from the producer (e.g., persist the monotonically increasing schema version number) and not process schema changes it has already seen.
That is we would process X then Y but when we see the second occurrence of X we would recognize it as a repeat and skip it; likewise the second occurrence of Y.
Warning: Please confirm that this issue does not contain any sensitive information