yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.93k stars 1.06k forks source link

[DocDB] [Scale out] YB is unable perform RBS writes on newly added node at a throughput set by RBS rate limitter #24031

Open shantanugupta-yb opened 3 weeks ago

shantanugupta-yb commented 3 weeks ago

Jira Link: DB-12920

Description

Issue: With RBS limit of 256MBPS, the max RBS write throughput on the newly added node was 200MB (After setting the remote_bootstrap_idle_timeout_ms to 5 minutes).

It is observed that with RBS limit set to 0, the RBS write throughput scales linearly from 135MBps > 270MBps >420MBps with 1RBS,2RBS and 3RBS respectively. So the expectation is since the master's loadbalancer gflags are set such that they are not throttling the RBS throughput the RBS writes on newly added node should be at the rate set by RBS rate limiter(remote_bootstrap_rate_limit_bytes_per_sec).

Details of two tests showcasing above issue:

If the RBS rate limiter(remote_bootstrap_rate_limit_bytes_per_sec) is set at 256MBps, the expectation was that the write throughput on newly added nodes should be around ~256MBps but the actual observed RBS write throughput is 115MBps for 55min + 190MBps for 15min at end.

Query used: (sum by(exported_instance)(rate({ node_prefix="$dbcluster",saved_name=~"proxy_response_bytes_yb_tserver_RemoteBootstrapService_FetchData"}[60s])))

image

The cluster was scaleup from 3 to 4 nodes YB version: 2.23.1.0-b195 Disk throughput: 600MBps

Yb logs at location

Updating the defect with the cause of over-throttling resulting in 115MBps for 55min + 190MBps for 15min

After setting the remote_bootstrap_idle_timeout_ms on each node to 5min saw the incoming RBS write throughput of ~200MBps on newly added node and the data load-balancing completed in 48min.

The over-throttling due to expired sessions seems to be getting addressed by reducing the RBS idle timeouts. Though we are still over-throttling the RBS write throughput by ~50MBps in this case.

Observing ~200MBps disk reads and network received bytes on N4.

image

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented 1 week ago

The main thing to address - The accounting should not include the failed RBS sessions towards the throttling limit - #21563 cc @basavaraj29