yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.7k stars 1.04k forks source link

[Docs] Lack of documentation for safely performing cluster balancing activities #22980

Open shivangraina opened 1 month ago

shivangraina commented 1 month ago

Description

For performing cluster balancing activities on the YugabyteDB cluster such as:

The developer docs don't have much information for gracefully handling such scenarios that include the movement of data such as (replacing a failed node). Here we should provide a step to blacklist a node to gracefully remove replicas before removing the node. For other scenarios (ex: node upgrade/patching) where the expectation is that the node will come back again quickly after performing the maintenance, we should add a step to blacklist leaders on this node. This will help in avoiding failures for inflight requests that are not being retried by the client due to the activity.

Warning: Please confirm that this issue does not contain any sensitive information

ddorian commented 1 month ago

Hi @shivangraina

For now we've used the https://docs.yugabyte.com/preview/manage/change-cluster-config/ page as a way to include them all. This shows how to change every node of the cluster (masters & tservers).

There are also some pages on https://docs.yugabyte.com/preview/troubleshoot/cluster/

The developer docs don't have much information for gracefully handling such scenarios that include the movement of data such as (replacing a failed node).

https://docs.yugabyte.com/preview/troubleshoot/cluster/replace_tserver/ & https://docs.yugabyte.com/preview/troubleshoot/cluster/replace_master/

For other scenarios (ex: node upgrade/patching) where the expectation is that the node will come back again quickly after performing the maintenance, we should add a step to blacklist leaders on this node.

I assumed the restart is very fast and best done in-place. The existing connections will be lost and retried automatically by the clients and work.

cc @hari90 ?