[xCluster] Replication timeouts due to copy commands

Description

Observed this on: 2.23.0.0-b574

In colocated database after setting up replication and performing following copy commands data is not replicated. The GetChanges RPCs were timing out

COPY employees_col TO '/tmp/data'
COPY employees_col_cpy FROM '/tmp/data'

Logs in Jira ticket

Steps

Summary Of Tests:
testdbdrscenarios-aws-rf3: Start
    (     0.417s) User Login : Success
    (     0.149s) Refresh YB Version : Success
    (    95.702s) Setup Provider : Success
    (     0.108s) Enable RBAC Flag : Success
    (     0.000s) Copy YBA CLI Package : Success
    (     0.094s) Updating Health Check Interval to 60000 sec : Success
    (   481.588s) Create universe sagr-isd12555-cb954b5d7c-20240710-103648-1 : Success
    (    18.383s) Updating Health Check Interval to 60000 sec : Success
    (   481.399s) Create universe sagr-isd12555-cb954b5d7c-20240710-103648-2 : Success
    (    43.778s) Create Secondary Index. : Success
    (     0.190s) Create Unique Index. : Success
    (     0.158s) Create Partial Index. : Success
    (     0.164s) Create Secondary Index. : Success
    (     0.210s) Create Unique Index. : Success
    (     0.190s) Create Partial Index. : Success
    (     2.414s) Create Secondary Index. : Success
    (     2.283s) Create Unique Index. : Success
    (     2.069s) Create Partial Index. : Success
    (    41.159s) Create Secondary Index. : Success
    (     0.174s) Create Unique Index. : Success
    (     0.229s) Create Partial Index. : Success
    (     0.200s) Create Secondary Index. : Success
    (     0.248s) Create Unique Index. : Success
    (     0.225s) Create Partial Index. : Success
    (     2.459s) Create Secondary Index. : Success
    (     1.978s) Create Unique Index. : Success
    (     2.295s) Create Partial Index. : Success
    (    68.359s) Setup DR replication and schema : Success
    (   318.430s) Perform copy cmds : >>> Integration Test Failed <<< 
Data validation got failed  for employees_col_cpy, 
Data length at source: 1000, 
Data length at target: 0

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

Why does this happen? xCluster uses the RocksDB rate limiter to throttle data transfer rate. There is one rate limiter on the source (to protect source tablets), and one on the target (to protect target tablets). The default rate limit is set to 100MBps (--xcluster_get_changes_max_send_rate_mbps). We check the rate on every batch we send, and typically batches are capped to 4MB (--consensus_max_batch_size_bytes) in size. COPY commands generate WAL ops that can get much larger, upto 255MB (--rpc_max_message_size). Since we cannot break apart WAL ops (they are an atomic commit batch) these are allowed to violate the xCluster 4MB batch limit. The RocksDB rate limiter has a bug which causes it to forever block calls that are greater than 10MB (100MBps100ms), so these GetChanges responses hang forever with the large memory that they allocated. There is a safety mechanism on the source that limits the number of in-flight GetChanges calls to 921 (FLAGS_rpc_workers_limit (1 - FLAGS_cdc_get_changes_free_rpc_ratio)) using a semaphore (get_changes_rpcsem). Depending on the amount of available memory, and threads we may run out of one resource, or hit the semaphore limit after which GetChagnes will fail with LeaderNotReadyToServeerror from cdc_service.cc:1460.

yugabyte / yugabyte-db