weaveworks / scope

Monitoring, visualisation & management for Docker & Kubernetes
https://www.weave.works/oss/scope/
Apache License 2.0
5.84k stars 708 forks source link

probe: add control to cordon a node #3822

Closed goku321 closed 3 years ago

goku321 commented 3 years ago

this adds a control to cordon a node which prevents any workload to be scheduled on the node.

Fixes #3810

goku321 commented 3 years ago

I've added toggle for cordon/uncordon but not sure how to test the controls.

bboreham commented 3 years ago

not sure how to test the controls

If you mean locally, to establish whether it does what you expect:

goku321 commented 3 years ago

I was able to run scope on a kubernetes cluster (also, changed the image tags inside examples/k8s/*.yaml to point to the one I built locally and pushed to docker hub):

Screenshot 2020-10-21 at 12 04 07 AM

Question: Where do I add cordon controls? Should I add a hostTopology method similar to podTopology and add the controls or is it done inside the Tag method?

bboreham commented 3 years ago

I mentioned at https://github.com/weaveworks/scope/pull/3822#pullrequestreview-506434787 how you can add controls in the Tagger in probe/kubernetes/reporter.go.

A hostTopology method in the Kubernetes part of the probe will likely create problems elsewhere, so I'd prefer you tried it via the Tagger.

goku321 commented 3 years ago

Actually, I tried adding the controls inside Tagger but that didn't seem to work. I couldn't see the cordon toggle (refer to the screenshot in my previous comment).

bboreham commented 3 years ago

Sorry, I missed that.

Looks like you created CordonControl but never used it anywhere. I think you can do that in the Tagger code too. (ActiveControls is which buttons to show on any individual node; Controls is the higher-level description of how to draw each button)

goku321 commented 3 years ago

I am not sure if the below logic for Tagger is working as expected:

for id, n := range rpt.Host.Nodes {
        controls := append(n.ActiveControls(), CordonNode, UncordonNode)
    rpt.Host.Nodes[id] = n.WithLatestActiveControls(controls...)
}

I tried logging the host controls just after the above piece of code executes and I only see one control host_exec:

image

Also, I was able to bring up the cordon toggle but I see a message saying "control not recognized" upon clicking it. So, I put up a log here to print all registered controls and this is what I see:

image

I guess that controls are not being tagged to the host otherwise it would print the kubernetes controls as well. Not sure if I am even doing it correctly.

Lastly, here are the cordon controls (I know there should only be one, still has to figure that out).

image
bboreham commented 3 years ago

That's fantastic - I'm really pleased you got this far. If you've pushed all the changes you made, I'll download the code and see if I can debug it. Might not be today though.

goku321 commented 3 years ago

Sure, would really appreciate that. I've pushed my changes.

bboreham commented 3 years ago

By adding some more logging here:

probe/controls/controls.go
@@ -118,6 +119,7 @@ func (r *HandlerRegistry) Batch(toRemove []string, toAdd map[string]xfer.Control
 func (r *HandlerRegistry) HandleControlRequest(req xfer.Request) xfer.Response {
        h, ok := r.handler(req.Control)
        if !ok {
+               log.Errorf("Control %q not recognised", req.Control)
                return xfer.ResponseErrorf("Control %q not recognised", req.Control)
        }

I could see that the command is being sent to the probe on the node, not the cluster-wide probe that talks to Kubernetes. The probe to talk to is selected by this field in the JSON sent to the browser: https://github.com/weaveworks/scope/blob/de60ebd37f251543b8f5ea99015ce26311e5321c/render/detailed/node.go#L43 which is coming from this field on the Host node: https://github.com/weaveworks/scope/blob/de60ebd37f251543b8f5ea99015ce26311e5321c/probe/host/reporter.go#L129

Stepping back slightly, we now have three controls on the host - Exec shell, Cordon and Uncordon - and we want two of them to go to a different probe than the other one. So we need to send some data from the cluster probe to merge into the Node object, and have the app know to look for a different probe ID.

I have an idea how to do that, but it's a bit involved.

goku321 commented 3 years ago

I kind of understand the issue here. Would be more than happy to solve it and take it to conclusion.

bboreham commented 3 years ago

I've implemented a way for one probe to override the destination of a Control: see latest commits on this copy of your branch: https://github.com/weaveworks/scope/commits/3810-cordon-control

However in my cluster I get an error because the hostname doesn't match Kubernetes' node name. I mentioned this at https://github.com/weaveworks/scope/pull/3822#pullrequestreview-506434787 - the bit about Node ID.

bboreham commented 3 years ago

Hi @goku321 , wondering if you will come back to this. If not I can finish it off.

goku321 commented 3 years ago

Hi @bboreham , thanks for your patience. I got busy with the other tickets and I was planning to work on this in a day or two if that's okay.

bboreham commented 3 years ago

Sure, no urgency, just wanted to see it finished one way or another.

goku321 commented 3 years ago

However in my cluster I get an error because the hostname doesn't match Kubernetes' node name. I mentioned this at #3822 (review) - the bit about Node ID.

I didn't encounter any error regarding hostname and kubernetes' node name mismatch but I did get permissions error. Adding more verbs inside cluster-role.yaml fixed the permission issues:

Screenshot 2020-12-30 at 7 35 28 PM

Finally, I was able to cordon a node by clicking on the cordon control. Output of kubectl get nodes:

Screenshot 2020-12-30 at 7 32 45 PM

Screenshot from GKE UI:

Screenshot 2020-12-30 at 7 33 01 PM
goku321 commented 3 years ago

I am trying to display only one control(cordon or uncordon) at a time on the host based on Unschedulable property of a kubernetes node. This is what I could think of: Since, only report is accessible inside a Tag method, the status of a kubernetes' node should be a part of this report. For this to happen, there should be a node abstraction inside probe/kubernetes (similar to what we have for deployment, daemonset etc.) so that we can generate report about cluster nodes as well. Not sure if there are other ways.

goku321 commented 3 years ago

@bboreham Need some directions on this one 😅

bboreham commented 3 years ago

Good to hear you got it to work.

I think we can add the 'cordon' control, then work on getting it to toggle later. Adding a Node abstraction might be the way to go, or maybe we can have the Kubernetes reporter add data onto the existing Host node if the names are made to match.

I'd like to see the latest code that you are running; could you push the branch to GitHub please?

goku321 commented 3 years ago

I've pushed the latest changes here

Sure. What should be used for the display button?

I think we can add the 'cordon' control, then work on getting it to toggle later.

bboreham commented 3 years ago

Display buttons seem fine, but we could consider "lock"/"lock-open", or "minus-circle" (like a stop sign).

It doesn't work when I try it due to the hostname being different to the Kubernetes node name. I guess that's just a difference in the environment where I'm running it. I can fix that up if you like; just not tonight as it's getting late here.

Back at this point:

display only one control(cordon or uncordon) at a time on the host based on Unschedulable property of a kubernetes node.

This can be done by calling WithLatestActiveControls() with just one button, but currently we're doing that in the Tag() method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe, plus something in Report.Merge to add controls from different probes.

goku321 commented 3 years ago

I guess that's just a difference in the environment where I'm running it. I can fix that up if you like; just not tonight as it's getting late here.

Sure, would appreciate the help.

This can be done by calling WithLatestActiveControls() with just one button, but currently we're doing that in the Tag() method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe, plus something in Report.Merge to add controls from different probes.

I get the idea here, will try implementing this.

bboreham commented 3 years ago

hostname being different to the Kubernetes node name

This one is easier than I thought: just add SCOPE_HOSTNAME to the DaemonSet: #3827

(just a different spelling to what I said earlier; the line above what I pointed to already calls hostname.Get() which checks this environment variable)

goku321 commented 3 years ago

Just a heads-up, will return to this in a week.

goku321 commented 3 years ago

This can be done by calling WithLatestActiveControls() with just one button, but currently we're doing that in the Tag() method in the per-host probe which doesn't connect to Kubernetes. That code would need to move to the "cluster" probe

I guess this would move to Reporter.Report method inside probe/kubernetes. If yes, I'm not sure how to access host nodes as the report available there is a clean report. Would DaemonSet be useful here? (because all I need are physical node ids to check if they are cordoned/un-cordoned).

plus something in Report.Merge to add controls from different probes

Not sure about this either. Aren't we already merging reports from different probes?

bboreham commented 3 years ago

I guess this would move to Reporter.Report method inside probe/kubernetes. If yes, I'm not sure how to access host nodes

I was thinking to generate a fresh set of (Scope type) host nodes from Kubernetes Node list. So long as they have the same ID they will get merged in the Scope app.

plus something in Report.Merge to add controls from different probes

Not sure about this either. Aren't we already merging reports from different probes?

Yes, but the low-level representation of "which controls are active" is a single string per (Scope) node. It's implemented here: https://github.com/weaveworks/scope/blob/69dd8853c4be559ce3b351acece97dd194724b17/report/node.go#L134-L144

When Scope merges two nodes it simply takes the latest version of that string, because it isn't expecting the same string to come from two different probes. So I am suggesting that some code in Node.Merge() (not Report.Merge() - my bad) could look for the NodeActiveControls string on both nodes and merge that specially. This should be a rare case (typically one Host per hundreds of Process and Container nodes).

goku321 commented 3 years ago

This works now, showing one control(cordon/uncordon) at a time. It would be nice to show a loader next to the control while cordon/uncordon is in progress but I'm not sure how to do that (loader does appear but for a very short duration).

Regarding the node merge logic, I couldn't think of a way to avoid using control names(host_exec and cordon controls) directly.

bboreham commented 3 years ago

I tested this today, and it worked - great job!

I did get a couple of "probe xxx is not connected" errors at first, but after adding more logging to track down, they did not re-occur. I'll assume that is unrelated to this PR.

The cordon/uncordon button takes a few seconds to update, which is slightly disconcerting, but probably not easy to fix. The panel it lives in is currently updated on a 5-second timer.

The Node.Merge() code does look a little specific to this change; I'd like to see if I can think of a more general way to address it.

bboreham commented 3 years ago

I prefer something like this: https://github.com/weaveworks/scope/commit/c6def0a92babafc4b679a36cb120f5af317c727b, which should work unmodified if someone adds a third control.

(on my branch https://github.com/weaveworks/scope/commits/3810-cordon-control)

I considered using StringSet like you did, but I think it would end up a little too heavy with all the re-allocating it does.

bboreham commented 3 years ago

I'm going to merge this with the two additional commits I proposed at https://github.com/weaveworks/scope/pull/3822#issuecomment-843207791.

goku321 commented 3 years ago

Thanks for all the help. So happy to see this merged 😄

bboreham commented 3 years ago

I actually used this feature for real yesterday! - got too many pods on one node, and quickly cordoned it and deleted the pods to move them elsewhere.

However I made a slighly mistake in my more-generic merge - fixed at #3859