The XClusterOutboundReplicationGroupParameterized.MasterRestartDuringCheckpoint test covered a case where we do not successfully survive a master restart when check pointing.
The failure case is due to:
2188:[P-m-1] W1022 09:10:57.290823 1869164544 xcluster_outbound_replication_group.cc:316] xClusterOutboundReplicationGroup rg1 :Failed to checkpoint namespace 00004000000030008000000000000000: Service unavailable (yb/master/catalog_manager.cc:2105): Catalog manager is shutting down. State: 3
This task is to fix semi-automatic mode so it survives master restarts even in this case; moreover, this test (or a new one) should be fixed to deterministically detect that the current code fails this case.
Note that this test has been disabled in the meantime.
Also note that there is a separate task to make automatic mode survive master restarts:
[DocDB] Make xCluster automatic mode setup survive master restarts
Jira Link: DB-13826
Description
The XClusterOutboundReplicationGroupParameterized.MasterRestartDuringCheckpoint test covered a case where we do not successfully survive a master restart when check pointing.
The failure case is due to:
This task is to fix semi-automatic mode so it survives master restarts even in this case; moreover, this test (or a new one) should be fixed to deterministically detect that the current code fails this case.
Note that this test has been disabled in the meantime.
Also note that there is a separate task to make automatic mode survive master restarts:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information