Closed goku321 closed 3 years ago
I've added toggle for cordon/uncordon but not sure how to test the controls.
not sure how to test the controls
If you mean locally, to establish whether it does what you expect:
SchedulingDisabled
status in kubectl get nodes
I was able to run scope on a kubernetes cluster (also, changed the image
tags inside examples/k8s/*.yaml
to point to the one I built locally and pushed to docker hub):
Question:
Where do I add cordon controls? Should I add a hostTopology
method similar to podTopology
and add the controls or is it done inside the Tag
method?
I mentioned at https://github.com/weaveworks/scope/pull/3822#pullrequestreview-506434787 how you can add controls in the Tagger in probe/kubernetes/reporter.go
.
A hostTopology
method in the Kubernetes part of the probe will likely create problems elsewhere, so I'd prefer you tried it via the Tagger.
Actually, I tried adding the controls inside Tagger but that didn't seem to work. I couldn't see the cordon toggle (refer to the screenshot in my previous comment).
Sorry, I missed that.
Looks like you created CordonControl
but never used it anywhere. I think you can do that in the Tagger code too.
(ActiveControls is which buttons to show on any individual node; Controls is the higher-level description of how to draw each button)
I am not sure if the below logic for Tagger is working as expected:
for id, n := range rpt.Host.Nodes {
controls := append(n.ActiveControls(), CordonNode, UncordonNode)
rpt.Host.Nodes[id] = n.WithLatestActiveControls(controls...)
}
I tried logging the host controls just after the above piece of code executes and I only see one control host_exec
:
Also, I was able to bring up the cordon toggle but I see a message saying "control not recognized" upon clicking it. So, I put up a log here to print all registered controls and this is what I see:
I guess that controls are not being tagged to the host otherwise it would print the kubernetes controls as well. Not sure if I am even doing it correctly.
Lastly, here are the cordon controls (I know there should only be one, still has to figure that out).
That's fantastic - I'm really pleased you got this far. If you've pushed all the changes you made, I'll download the code and see if I can debug it. Might not be today though.
Sure, would really appreciate that. I've pushed my changes.
By adding some more logging here:
probe/controls/controls.go
@@ -118,6 +119,7 @@ func (r *HandlerRegistry) Batch(toRemove []string, toAdd map[string]xfer.Control
func (r *HandlerRegistry) HandleControlRequest(req xfer.Request) xfer.Response {
h, ok := r.handler(req.Control)
if !ok {
+ log.Errorf("Control %q not recognised", req.Control)
return xfer.ResponseErrorf("Control %q not recognised", req.Control)
}
I could see that the command is being sent to the probe on the node, not the cluster-wide probe that talks to Kubernetes. The probe to talk to is selected by this field in the JSON sent to the browser: https://github.com/weaveworks/scope/blob/de60ebd37f251543b8f5ea99015ce26311e5321c/render/detailed/node.go#L43 which is coming from this field on the Host node: https://github.com/weaveworks/scope/blob/de60ebd37f251543b8f5ea99015ce26311e5321c/probe/host/reporter.go#L129
Stepping back slightly, we now have three controls on the host - Exec shell, Cordon and Uncordon - and we want two of them to go to a different probe than the other one. So we need to send some data from the cluster probe to merge into the Node object, and have the app know to look for a different probe ID.
I have an idea how to do that, but it's a bit involved.
I kind of understand the issue here. Would be more than happy to solve it and take it to conclusion.
I've implemented a way for one probe to override the destination of a Control: see latest commits on this copy of your branch: https://github.com/weaveworks/scope/commits/3810-cordon-control
However in my cluster I get an error because the hostname doesn't match Kubernetes' node name. I mentioned this at https://github.com/weaveworks/scope/pull/3822#pullrequestreview-506434787 - the bit about Node ID.
Hi @goku321 , wondering if you will come back to this. If not I can finish it off.
Hi @bboreham , thanks for your patience. I got busy with the other tickets and I was planning to work on this in a day or two if that's okay.
Sure, no urgency, just wanted to see it finished one way or another.
However in my cluster I get an error because the hostname doesn't match Kubernetes' node name. I mentioned this at #3822 (review) - the bit about Node ID.
I didn't encounter any error regarding hostname and kubernetes' node name mismatch but I did get permissions error. Adding more verbs inside cluster-role.yaml
fixed the permission issues:
Finally, I was able to cordon a node by clicking on the cordon control. Output of kubectl get nodes
:
Screenshot from GKE UI:
I am trying to display only one control(cordon or uncordon) at a time on the host based on Unschedulable
property of a kubernetes node. This is what I could think of: Since, only report
is accessible inside a Tag
method, the status of a kubernetes' node should be a part of this report. For this to happen, there should be a node abstraction inside probe/kubernetes
(similar to what we have for deployment
, daemonset
etc.) so that we can generate report about cluster nodes as well. Not sure if there are other ways.
@bboreham Need some directions on this one 😅
Good to hear you got it to work.
I think we can add the 'cordon' control, then work on getting it to toggle later. Adding a Node abstraction might be the way to go, or maybe we can have the Kubernetes reporter add data onto the existing Host node if the names are made to match.
I'd like to see the latest code that you are running; could you push the branch to GitHub please?
I've pushed the latest changes here
Sure. What should be used for the display button?
I think we can add the 'cordon' control, then work on getting it to toggle later.
Display buttons seem fine, but we could consider "lock"/"lock-open", or "minus-circle" (like a stop sign).
It doesn't work when I try it due to the hostname being different to the Kubernetes node name. I guess that's just a difference in the environment where I'm running it. I can fix that up if you like; just not tonight as it's getting late here.
Back at this point:
display only one control(cordon or uncordon) at a time on the host based on Unschedulable property of a kubernetes node.
This can be done by calling WithLatestActiveControls()
with just one button, but currently we're doing that in the Tag()
method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe, plus something in Report.Merge
to add controls from different probes.
I guess that's just a difference in the environment where I'm running it. I can fix that up if you like; just not tonight as it's getting late here.
Sure, would appreciate the help.
This can be done by calling WithLatestActiveControls() with just one button, but currently we're doing that in the Tag() method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe, plus something in Report.Merge to add controls from different probes.
I get the idea here, will try implementing this.
hostname being different to the Kubernetes node name
This one is easier than I thought: just add SCOPE_HOSTNAME
to the DaemonSet: #3827
(just a different spelling to what I said earlier; the line above what I pointed to already calls hostname.Get()
which checks this environment variable)
Just a heads-up, will return to this in a week.
This can be done by calling WithLatestActiveControls() with just one button, but currently we're doing that in the Tag() method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe
I guess this would move to Reporter.Report
method inside probe/kubernetes
. If yes, I'm not sure how to access host nodes as the report available there is a clean report. Would DaemonSet be useful here? (because all I need are physical node ids to check if they are cordoned/un-cordoned).
plus something in Report.Merge to add controls from different probes
Not sure about this either. Aren't we already merging reports from different probes?
I guess this would move to
Reporter.Report
method insideprobe/kubernetes
. If yes, I'm not sure how to access host nodes
I was thinking to generate a fresh set of (Scope type) host nodes from Kubernetes Node
list.
So long as they have the same ID they will get merged in the Scope app.
plus something in Report.Merge to add controls from different probes
Not sure about this either. Aren't we already merging reports from different probes?
Yes, but the low-level representation of "which controls are active" is a single string per (Scope) node. It's implemented here: https://github.com/weaveworks/scope/blob/69dd8853c4be559ce3b351acece97dd194724b17/report/node.go#L134-L144
When Scope merges two nodes it simply takes the latest version of that string, because it isn't expecting the same string to come from two different probes.
So I am suggesting that some code in Node.Merge()
(not Report.Merge()
- my bad) could look for the NodeActiveControls
string on both nodes and merge that specially. This should be a rare case (typically one Host per hundreds of Process and Container nodes).
This works now, showing one control(cordon/uncordon) at a time. It would be nice to show a loader next to the control while cordon/uncordon is in progress but I'm not sure how to do that (loader does appear but for a very short duration).
Regarding the node merge logic, I couldn't think of a way to avoid using control names(host_exec
and cordon controls) directly.
I tested this today, and it worked - great job!
I did get a couple of "probe xxx is not connected" errors at first, but after adding more logging to track down, they did not re-occur. I'll assume that is unrelated to this PR.
The cordon/uncordon button takes a few seconds to update, which is slightly disconcerting, but probably not easy to fix. The panel it lives in is currently updated on a 5-second timer.
The Node.Merge()
code does look a little specific to this change; I'd like to see if I can think of a more general way to address it.
I prefer something like this: https://github.com/weaveworks/scope/commit/c6def0a92babafc4b679a36cb120f5af317c727b, which should work unmodified if someone adds a third control.
(on my branch https://github.com/weaveworks/scope/commits/3810-cordon-control)
I considered using StringSet
like you did, but I think it would end up a little too heavy with all the re-allocating it does.
I'm going to merge this with the two additional commits I proposed at https://github.com/weaveworks/scope/pull/3822#issuecomment-843207791.
Thanks for all the help. So happy to see this merged 😄
I actually used this feature for real yesterday! - got too many pods on one node, and quickly cordoned it and deleted the pods to move them elsewhere.
However I made a slighly mistake in my more-generic merge - fixed at #3859
this adds a control to cordon a node which prevents any workload to be scheduled on the node.
Fixes #3810