yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9k stars 1.07k forks source link

[xCluster] Make xCluster replication ignore producer schema changes if the consumer has already seen them #17306

Open mdbridge opened 1 year ago

mdbridge commented 1 year ago

Jira Link: DB-6530

Description

Background

xCluster replication replicates data changes from a producer universe to a consumer universe.

It uses a variant of Change Data Capture (CDC) to stream changes captured from producer tablet Raft logs. These streams (one per producer tablet being replicated) have at least once semantics. That is, each change is guaranteed to be delivered at least once.

The catch is that the same change can be delivered more than once and the system needs to handle this.

In slightly more detail, there is a persistent checkpoint associated with each stream, which is periodically advanced. If the receiver crashes, processing restarts at the last checkpoint taken. This can cause the last bunch of changes to be repeated (in the same order). Thus the stream can occasionally stutter, repeating the last N changes.

Repeated schema changes

Today we do not replicate DML operations directly in the sense of applying the same DML operation on the consumer as the producer. What we do instead is send over the new schema; replication will be paused until the consumer schema matches the new schema from the producer. Admins are expected to apply DML operations one at a time, each first to the consumer side and then to the producer side.

It is clear that repeating the same schema change operation is idempotent. However, a sequence of schema change ops are not. For example, the sequence of schema ops X Y X Y can leave us stuck even if the consumer cluster upgrades first to X then to Y.

This issue

To deal with these problems, we want to have the consumer remember each schema change it processes from the producer (e.g., persist the monotonically increasing schema version number) and not process schema changes it has already seen.

That is we would process X then Y but when we see the second occurrence of X we would recognize it as a repeat and skip it; likewise the second occurrence of Y.

Warning: Please confirm that this issue does not contain any sensitive information

mdbridge commented 1 year ago

This issue was created as part of the investigation done in [xCluster] Handle any issues due to replication repeating changes #15547