the scripts were ran by Foreman just fine (see details below)
the checks for the files and the subsequent bi-restart of kubelet either were (1) did not help to resolve the issue (as it was the case for *-controlplane-{woergl,vienna}*) hosts or (2) not necessary at all since the files were already there (rest of the hosts below)
however, upgrading docker is super invasive, an excerpt by @flo-weber:
the asp-ibk cluster showed as Cluster agent not connected in the rancher UI
NB: k8s-dev, which has seen no docker upgrade, was working just fine
NB2: the cattle-cluster-agent deployment only runs on controlplane nodes due to its tolerations
at the same time all monitoring went red, i.e., the actual workload was not working either
reboot of all controlplane nodes did not bring the cluster back in Rancher
what eventually helped to fix the cluster was:
shutting all controlplane and worker nodes
starting them from scratch and waiting until the workload comes back up again (registry etc.)
none of the worker nodes were accessible by ssh since they were running out of memory (visible by attaching to the Vsphere console where OOM errors are displayed even for not logged-in users
theory: once the pods get scheduled to a node they, as time goes by, consume more and more memory and since actively running workload is not re-scheduled but only OOM-killed, and many of workload simply does not have proper resource limits set, eventually the whole worker's memory runs out, causing OOM things to kick in; if that's the case, it's usually too late since essential OS things might have started too fail already
Discussion/Thoughts
the script in its current form does not bring any benefit, since a reboot is the way to go
we don't want to reboot if no packages have been updated
we want to reboot as fast as possible if any docker package got updated since the likelihood that the workload is already failing is quite high -> no need to randomly sleep prior to reboot
we are not restarting all controlplane nodes at once, since they are time-wise distributed (23:00 <-> 23:15 <-> 23:30)
k8s-rke1-foreman.sh
run 22.08.2024The first run of the script was not a success:
*-controlplane-{woergl,vienna}*
) hosts or (2) not necessary at all since the files were already there (rest of the hosts below)asp-ibk
cluster showed asCluster agent not connected
in the rancher UI NB:k8s-dev
, which has seen no docker upgrade, was working just fine NB2: thecattle-cluster-agent
deployment only runs on controlplane nodes due to its tolerationsDiscussion/Thoughts
asp-ibk
asp-ibk-controlplane-ibk1.wd.loc
asp-ibk-worker-ibk1.wd.loc
asp-ibk-worker-ibk2.wd.loc
asp-ibk-worker-ibk3.wd.loc
asp-ibk-worker-ibk4.wd.loc
asp-ibk-controlplane-vienna1.wd.loc
asp-ibk-controlplane-woergl1.wd.loc
asp-ibk-worker-woergl1.wd.loc
asp-ibk-worker-woergl2.wd.loc
asp-ibk-worker-woergl3.wd.loc
asp-ibk-worker-woergl4.wd.loc
[0]: They all got the following packages updated:
k8s-dev
k8s-dev-controlplane-ibk1.wd.loc
k8s-dev-worker-ibk1.wd.loc
k8s-dev-worker-ibk2.wd.loc
k8s-dev-worker-ibk3.wd.loc
k8s-dev-controlplane-vienna1.wd.loc
k8s-dev-controlplane-woergl1.wd.loc
k8s-dev-worker-woergl1.wd.loc
k8s-dev-worker-woergl2.wd.loc
k8s-dev-worker-woergl3.wd.loc
[1]: NO packages were updated (docker was updated a week before)