YB Scale Up/Down and other lifecycle operations

vishal-biyani commented 4 years ago

This issue is related to https://github.com/yugabyte/yugabyte-db/issues/4047, https://github.com/yugabyte/yugabyte-db/issues/4037, and conversation on slack around scaling of YugaByte cluster. The issue summarises use cases, possible solutions, and draft design of potential change to the YB operator.

There are some issues in K8S community which describe scenarios for more hooks than just PreStop and PostStart, one of them which has links to some related discussions: https://github.com/kubernetes/kubernetes/issues/25275

The limitation with PreStop hook is that the hook will be executed even in case of liveness probe failure, preemption, and resource contention.

One of the other options discussed is about using VPA (Vertical Pod Autoscaler) - which re-creates the pods with lower resources assigned. This works with statefulset too but this is far more disruptive operation than scaling horizontally. Also, there are some details around StatefulSet and volumes provisioned in different AZs for HA which affect VPA operations. Reference. VPA is also a separate component to be run and managed within the cluster.

The recommended approach here is to enable this logic in the operator as every product has its way of dealing with it.

User Interface

When a user wants to scale up from 3 tablet servers to let's say 6 tablet servers, the specification of CR will change from source to target spec as shown below. The change can be done by a Git-based pipeline or by applying changed CR via Kubectl.

Source Spec:

...
  tserver:
    replicas: 3
    tserverUIPort: 9000
...

Target Spec:

...
  tserver:
    replicas: 6
    tserverUIPort: 9000
...

Changes to Operator

Currently, the update of a tablet server statefulset is applying the new specs but not doing anything additional to the database itself.

Reference: https://github.com/yugabyte/yugabyte-operator/blob/master/pkg/controller/ybcluster/ybcluster_update_controller.go#L63-L87

The overall update will involve a few phases - which later can be expanded to accommodate other use cases too.

Identify the current state vs. the desired state
Based on (1) there might be operations such as scale-up, scale down, backup data, etc.
For each of the operations, a phase will be called and executed

Specifically for Scale Down operation as an example:

Check server is in a healthy state
Determine the pods to be scaled down
Label the pods identified to be removed as "status:being_decomissioned" or something of that effect
Change blacklist to remove the pods from list of TServers through an API/CLI (https://docs.yugabyte.com/latest/admin/yb-admin/#change-blacklist)
Wait for the data move to complete (https://docs.yugabyte.com/latest/admin/yb-admin/#get-load-move-completion)
Once the data move is complete, remove specific nodes from Stateful set

Testing & Acceptance

User should be able to scale down T servers by changing spec and rest of the operation will be done by the operator
The event of each logical operation should be updated on the status field of CR.

Limitations/Future work

This approach works with specs being applied via Kubectl. The kubectl scale approach is neither recommended for StatefulSets nor will work in this scenario. In the future we can potentially have a Kubectl plugin for running ybdamin commands interactively. But the general practice is to apply such changes through a Git change so that the history is maintained and state of things in cluster and source code is consistent.

References:

Elastic Operator: https://github.com/elastic/cloud-on-k8s

bhavin192 commented 4 years ago

I'm proposing this way to scale down TServers, (initial implementation is in #17)

Solution without finalizers

User decreases the replica count (gets validated against RF)
We set the Status TargetedTServerReplicas: 3 (3 is the number to which user want to scale down to).
- This value is not updated if the Status condition is ScalingDownTServers: True.
Add Status condition ScalingDownTServers: True on ybcluster resource.
We select pods with higher index value for removal (we will select yb-tserver-3 and yb-tserver-4 if going from 5 TServers to 3)
1. Annotate these pods with yb.com/blacklist: true.
If the Status is ScalingDownTServers: True or MovingData: True
- We don't act on any new changes in the spec (scale up/scale down)
- Operation 3 will be skipped. Basically it won't update the LastTransitionTime of ScalingDownTServers: True condition.
- We will make sure that the intended number of pods are annotated with yb.com/blacklist: true i.e. operation 4.
  - This calculation will use the Status TargetedTServerReplicas: 3.
Blacklist sync will kick in and make sure that the pods with yb.com/blacklist: true are in YB-Master's blacklist
Update the STS only if (i.e. setting the replicas to 3 from 5)
- The status condition is ScalingDownTServers: True.
- Value of status condition is MovingData: False and it is updated (at least 5 minutes?) after the status condition ScalingDownTServers's update time. (This will make sure that we don't update the STS immediately and wait for the data move to start and it is reflected correctly on the ybcluster's Status)
- If above is not satisfied then requeue the request with exponential back-off.
Progress sync will keep checking the progress and make sure the status condition MovingData's value reflects that.
- It is True if progress is not 100%, otherwise False

Syncing the blacklisted pods with YB-Master's blacklist

We will add annotation on Pod yb.com/blacklist: true. If that is present, controller will make sure that this Pod's FQDN is in blacklist as well.
~~Once the sync with masters is completed we will add yb.com/synced: true on the Pod.~~
If the annotation if not present then it will be removed from blacklist
With this mechanism, even if we have some FQDN in blacklist and a new Pod having same FQDN (without annotation) comes up, it will get removed from blacklist

nodiex-cloud commented 3 years ago

I am presuming that this has not been implemented correct? What are the day2 actions that the operator currently facilitates? What is the intended roadmap of this operator vs the Rook Yugabyte functionality?

yugabyte / yugabyte-operator