Open l0kix2 opened 8 months ago
It is possible to store this flag in the cypress of updated yt, but not 100% sure it wouldn't bite us at some point when cluster is not available and we can't. But maybe if we implement flow carefully that would be good solution.
Maybe we can use ytsaurus resource in some other way: set label for example, or edit condition/status (is it possible in kubectl?). Maybe we can just let the operator to edit this field to false by itself?
Maybe we can have some fuse-resource which is created on successfull cluster update by operator and is never deleted by operator. But human can delete that resource via kubectl which will lead to full update.
I could suggest two approaches for this or a combination of them. The first is to use API aggregation. This allows to implement custom operations for specific tasks:
kubectl yt-upgrade -n yt ghcr.io/ytsaurus/ytsaurus
The second approach is to define separate CRDs for different tasks. For example:
apiVersion: cluster.ytsaurus.tech/v1
kind: YtsaurusVersionUpgrade
metadata:
name: yt
spec:
coreImage: ghcr.io/ytsaurus/ytsaurus
Both approaches require making Ytsaurus CRD read-only. The manifest should only be changed using action CRDs or API calls.
From my point of view, the second approach is preferable due to its simplicity. Distributing cluster settings across different CRDs is less error prone for users. For instance, only YtsaurusClusterCreate
would have cellTag
field as it cannot be change later. Another advantage is a predetermined procedure for each CRD: the operator would know that only a version upgrade should trigger FullUpdate, using RollingUpdate for all other CRDs.
The current use of a single CRD is too difficult to maintain. It is often ambiguous which action operator should execute when changing several fields in the manifest.
Currently we have EnableFullUpdate field in the ytsaurus main CRD and if it set to true operator consider it can recreate all the pods and fully update yt cluster. The idea here is full update would be controllable by human, but the problem is on deploy we often forget to change
EnableFullUpdate=false
back.It would be better replaced with something that can be changed back by operator itself after full update is triggered and approved by human,
Ideas are appreciated.