Open mvgorbunov opened 7 months ago
Discussed with @pixcc and came to conclusion:
1) the node
maintenance (as opposed to host
maintenance) is not really required
2) the ydbops maintenance host
WILL give the caller back a task ID (string identifier), because:
2.1) there is a mechanism for priority in CMS. You can schedule a request to CMS with high priority, and even if the node from this request can be given away, they won't be given to anyone with lower priority.
2.2) with --prepare
functionality, the user WILL need to wait until the host has been de-populated before taking out the host. So the task WILL have to exist in CMS for some time, in non-completed state.
Assuming that the initial ydbops maintenance host
is invoked with --host-fqdn
, we tried to think of a way to NOT give the user the task id
- maybe ydbops
will be able to determine the task from the same --host-fqdn
later, when the user comes with the next command ydbops maintenance [refresh|drop|complete]
. But a lot of problems quickly showed up.
1) what if there are multiple tasks which request the same host? to which of them should the operation be applied? to the current one?
2) what if the current one was not created by you? should you be allowed to modify other user's tasks? (example users: walle
, rolling restart
, infra on call
)
3) finally, K8s. It is impossible to find out which tasks were created for your --host-fqdn
, because even if you list all the tasks from CMS, they will have pod internal fqdns, not the external fqdn that the user gave you, and mapping can be ambiguous.
A quick schematic (basically just for me):
Discussed with @mvgorbunov:
1) ydbops maintenance host
-> ydbops maintenance create
2) create
and complete
operate with different entities: host-fqdn
and task-id
respectively, but it is impossible to implement otherwise (complete --host-fqdn
does not give enough information)
3) even if user didn't take a note of his task-id
when ydbops maintenance create
was called, it is still possible to ydbops maintenance list
tasks and try to select what was his, at least based on the username.
Please update docs when feature is ready
https://ydb.tech/docs/ru/devops/manual/maintenance-without-downtime#node-maintenance
https://ydb.tech/docs/en/devops/manual/maintenance-without-downtime#node-maintenance
Feature has been ready for some time,
❯ ydbops maintenance --help
ydbops maintenance [command]:
Manage host maintenance operations: request and return hosts
with performed maintenance back to the cluster.
Usage: ydbops [global options...] maintenance [options] <subcommand>
Subcommands:
maintenance Request hosts from the Cluster Management System
├─ complete Declare the maintenance task completed
├─ create Create a maintenance task to obtain a set of hosts
├─ drop Drop an existing maintenance task
├─ list List all existing maintenance tasks
└─ refresh Try to obtain previously reserved hosts
Global options:
{-e|--endpoint}, --grpc-timeout-seconds, --grpc-skip-verify, --ca-file, --user, --password-file, --no-password, --token-file, --sa-key-file, --iam-endpoint, --use-metadata-credentials, --profile, --profile-file
, {-v|--verbose}
To get full description of these options run 'ydbops --help'.
Use "ydbops maintenance [command] --help" for more information about a command.
TODO: documentation :)
I added some docs to the article about CMS.
Support single node maintenance mode - command like
ydbops node maintenance --host <node_fqdn> [--user USER --ttl REQUEST_TTL --prepare-node] ...
Maintenance mode in general means that the node will NOT be prepared (NO tablets and sessions drain, NO moving out bs-groups, etc) if--prepare-node
not specified - only check current cluster state and lock the node in CMS.--prepare
option is supported from 24-1. We need to warn users that using this mode might be destructive and they should use it for non-destructive operation (e.g. replacing disk that had already been failed or reboot server)