Node maintenance mode - Githubissues

mvgorbunov commented 7 months ago

Support single node maintenance mode - command like ydbops node maintenance --host <node_fqdn> [--user USER --ttl REQUEST_TTL --prepare-node] ... Maintenance mode in general means that the node will NOT be prepared (NO tablets and sessions drain, NO moving out bs-groups, etc) if --prepare-node not specified - only check current cluster state and lock the node in CMS. --prepare option is supported from 24-1. We need to warn users that using this mode might be destructive and they should use it for non-destructive operation (e.g. replacing disk that had already been failed or reboot server)

Jorres commented 6 months ago

Discussed with @pixcc and came to conclusion: 1) the node maintenance (as opposed to host maintenance) is not really required 2) the ydbops maintenance host WILL give the caller back a task ID (string identifier), because: 2.1) there is a mechanism for priority in CMS. You can schedule a request to CMS with high priority, and even if the node from this request can be given away, they won't be given to anyone with lower priority. 2.2) with --prepare functionality, the user WILL need to wait until the host has been de-populated before taking out the host. So the task WILL have to exist in CMS for some time, in non-completed state.

Assuming that the initial ydbops maintenance host is invoked with --host-fqdn, we tried to think of a way to NOT give the user the task id - maybe ydbops will be able to determine the task from the same --host-fqdn later, when the user comes with the next command ydbops maintenance [refresh|drop|complete]. But a lot of problems quickly showed up.

1) what if there are multiple tasks which request the same host? to which of them should the operation be applied? to the current one? 2) what if the current one was not created by you? should you be allowed to modify other user's tasks? (example users: walle, rolling restart, infra on call) 3) finally, K8s. It is impossible to find out which tasks were created for your --host-fqdn, because even if you list all the tasks from CMS, they will have pod internal fqdns, not the external fqdn that the user gave you, and mapping can be ambiguous.

A quick schematic (basically just for me):

Jorres commented 6 months ago

Discussed with @mvgorbunov: 1) ydbops maintenance host -> ydbops maintenance create 2) create and complete operate with different entities: host-fqdn and task-id respectively, but it is impossible to implement otherwise (complete --host-fqdn does not give enough information) 3) even if user didn't take a note of his task-id when ydbops maintenance create was called, it is still possible to ydbops maintenance list tasks and try to select what was his, at least based on the username.

pixcc commented 6 months ago

Please update docs when feature is ready

https://ydb.tech/docs/ru/devops/manual/maintenance-without-downtime#node-maintenance

https://ydb.tech/docs/en/devops/manual/maintenance-without-downtime#node-maintenance

Jorres commented 2 weeks ago

Feature has been ready for some time,

❯ ydbops maintenance --help
ydbops maintenance [command]:
    Manage host maintenance operations: request and return hosts
    with performed maintenance back to the cluster.

Usage: ydbops [global options...] maintenance [options] <subcommand> 

Subcommands:
maintenance            Request hosts from the Cluster Management System
├─ complete            Declare the maintenance task completed
├─ create              Create a maintenance task to obtain a set of hosts
├─ drop                Drop an existing maintenance task
├─ list                List all existing maintenance tasks
└─ refresh             Try to obtain previously reserved hosts

Global options: 
  {-e|--endpoint}, --grpc-timeout-seconds, --grpc-skip-verify, --ca-file, --user, --password-file, --no-password, --token-file, --sa-key-file, --iam-endpoint, --use-metadata-credentials, --profile, --profile-file
, {-v|--verbose}
  To get full description of these options run 'ydbops --help'.

Use "ydbops maintenance [command] --help" for more information about a command.

TODO: documentation :)

pixcc commented 2 days ago

I added some docs to the article about CMS.

https://github.com/ydb-platform/ydb/pull/11793

ydb-platform / ydbops

Node maintenance mode #2