weaveworks / launcher

Weave Cloud Launcher
Apache License 2.0
10 stars 13 forks source link

Convert agent to a CustomResource #245

Open lilic opened 6 years ago

lilic commented 6 years ago

By converting the agent part of the launcher to be deployed as a CustomResource we can pass arguments and options via the CustomResourceDefinition and with that also solve https://github.com/weaveworks/launcher/issues/145.

Besides the configurations we would also solve the problem of just applying the manifest file every X amount of time, but instead we would watch and react on events, e.g. when a Secret gets deleted we would update the resources, similiar to how we already do for the cloudwatch in the agents.

We could still keep the bootstrap part of the launcher and in it generate the CRD manifest file needed to deploy the agent CR (weave-cloud-operator :) ). So the install part would be the same as now, if we want to, e.g. via the curl command and the helm option could remain as well. But the weave-cloud-operator could also be installed and configured with a standalone manifest file in the users cluster.

The https://github.com/operator-framework might be useful IMO in this case to generate things needed to convert the launcher/agent into an "operator".

cc @bricef @marccarre @dlespiau @leth

lilic commented 6 years ago

We could split the above into two separate tasks. First creating just a CustomResource, used to easily configure the agent and second part of creating a controller from the agent, e.g. replacing the kubectl apply part with creating each needed resource based on the passed configurations.

leth commented 6 years ago

Besides the configurations we would also solve the problem of just applying the manifest file every X amount of time

Don't forget we also want to be able to push agent changes out to people's cluster :)

lilic commented 6 years ago

Example manifest file for WeaveCloud CRD:

apiVersion: cloud.weave.works/v1beta1
kind: WeaveCloud
metadata:
  name: weave-cloud-operator
  namespace: weave
spec:
    general:
      environment: gke
      auto-update: true
      wc-hostname: cloud.weave.works
      wc-report-errors: false
      wc-poll-url: foo-bar.works
    flux:
      disable: false  
      config:
        git-label: label
        git-url: url
        git-path: path
        git-branch: branch
    scope:
        disable: false
        read-only: true
        container-runtime-endpoint: /var/run/crio.sock
        kubernetes-probe: true
    prometheus:
        disable: false
        pod-scrape-policy: optIn
dlespiau commented 6 years ago

There are some suggestions from Brice in another issue: https://github.com/weaveworks/launcher/issues/145

stefanprodan commented 6 years ago

I would add kind: Alert to the launcher controller when Weave Cloud will switch to Alertmanager yaml format.

The reasoning behind the alerts as custom resources is here https://docs.google.com/document/d/1t_Ai_tXh7HlJxaxULuyev6KQZ4eWBt7twUw59sTYh1I/edit?usp=sharing

lilic commented 6 years ago

cc @squaremo @errordeveloper

leth commented 6 years ago

I would add kind: Alert to the launcher controller when Weave Cloud will switch to Alertmanager yaml format.

The reasoning behind the alerts as custom resources is here https://docs.google.com/document/d/1t_Ai_tXh7HlJxaxULuyev6KQZ4eWBt7twUw59sTYh1I/edit?usp=sharing

I think managing alerting config via CRDs is a great idea but we should not plan to include this right now; it's something that can be added to the code later, as a separate piece of work.

leth commented 6 years ago

Thanks for the example manifest @LiliC! So the idea is that every option we want to provide is explicitly coded, i.e. we have to know how to change the flux agent manifest to support git-url?

As a kludge, can we also include something like an "append-args" option to allow people to inject extra flags into an agent's command? My thinking is that it might be very useful in a pinch, when debugging something with a customer, or when trying something out.

dlespiau commented 6 years ago

We've talked about flux with @LiliC and @squaremo at lunch about flux and the CRD integration. For the first CRD version we will not include anything flux except the disable toggle.

leth commented 6 years ago

What is the timing of "at some point."?

If this will ship first we still need to support some flux args so that we have an automatic migration path for people who have configured a git repo, and continue to not trample on people's running config when they're trying to set it up for the first time.

We cannot ship a version which breaks everyone's flux config, or makes it impossible to set up flux with a git repo.

squaremo commented 6 years ago

What is the timing of "at some point."?

Unknown, but it doesn't matter too much, because there will necessarily be an indefinite period during which fluxd supports both the command line arguments (if supplied) and the configmap (when mounted).

dlespiau commented 6 years ago

For the time being, launcher will still be doing what it does today: kubectl apply while preserving the flux arguments.

lilic commented 6 years ago

For the time being, launcher will still be doing what it does today: kubectl apply while preserving the flux arguments.

Yes agreed, maybe in the future we can convert it to do something more complex like actually creating each individual resource and watching for any changes on those and reconciling. But for now that is sort of too much to change all at once.

errordeveloper commented 6 years ago

cc @stefanprodan

squaremo commented 6 years ago

Perhaps this is a dim question, but is there a reason to prefer defining a custom resource over just supplying a config map? E.g., is the intention that a single agent will handle multiple installations of agents in a cluster.

dlespiau commented 6 years ago

We do have cases where we send data to two WC instances (weave cloud, socks-shop) I was envisioning two CRs for those clusters.

squaremo commented 6 years ago

We do have cases where we send data to two WC instances (weave cloud, socks-shop) I was envisioning two CRs for those clusters.

We're the only people that do that; two agents each with its own config would cover that case I think.

dlespiau commented 6 years ago

To really answer that question we'd need to collect +/- of the two :) eg.

squaremo commented 6 years ago

Operators are all going through CRDs, not configmaps. They must be some reasoning here :)

The idea of operators is that they manage arbitrary numbers of resources; e.g., you get one set of agents per weave cloud custom resource. This is where we probably want to end up, but I don't think all our agents are capable to behaving correctly in this scenario (flux isn't, quite), so it's generality we can't use at present.

What will the operator do in response to multiple weave cloud resources?

One plus that I've been looking at is the CRD validation layer so we can get k8s to reject invalid configuration before it reaches the agent

Yep, earlier validation is a point in favour of using a custom resource. It'd be trickier trying to validate a ConfigMap on creation.

We don't need the file mounted on the fs, which is one of the configmap features.

I don't know if this is a point in favour of a CRD -- it's mechanically simpler to watch a file than to watch the kubernetes API.

lilic commented 6 years ago

@squaremo can you explain how the ConfigMap flow would work? e.g. how would we generate things, how would use deploy, update etc.

For example for CRDs user just needs one deployment manifest to deploy the "operator" and one manifest to specify configurations, which they can update later on and the operator would pick up the updates. Like @dlespiau mentioned above, we get things like validations but also versioning of CRDs, meaning we know how to handle different versions of deployed agent operators.

squaremo commented 6 years ago

can you explain how the ConfigMap flow would work?

  1. Mount a configmap with the config file into the weave cloud agent pod
  2. The weave cloud agent should look at the config file and do what it says
  3. When the config file changes (as it will when the ConfigMap is updated), make compensating actions, e.g., delete the deployment for agent blahblah which has been switched off.

Pretty much the same thing you'd do with a custom resource, except it's in the filesystem rather than something you get through the Kubernetes API. Have I misunderstood something fundamental about what you want to do with the CRD?

For example for CRDs user just needs one deployment manifest to deploy the "operator" and one manifest to specify configurations, which they can update later on and the operator would pick up the updates.

And with a config file, one deployment and one configmap.

we get things like validations but also versioning of CRDs, meaning we know how to handle different versions of deployed agent operators.

I don't think there's anything inherent to versioning of CRDs that makes it better than versioning a config file -- presumably you still have to write code to interpret old versions, for instance. But I haven't tried it, so I'm ignorant of any library support for versioning.

I can think of another +/- point:

I don't think using a custom resource is a bad design; it's more forward-looking than using a config file, in its favour. I was curious why you hadn't just gone for a config file though, so I wanted to see whether that was an adequate alternative.

lilic commented 6 years ago

I don't know if this is a point in favour of a CRD -- it's mechanically simpler to watch a file than to watch the kubernetes API.

Forgot to mention, IIRC and nothing changed since the last time I tried this, watching for the mounted config map file changes is actually not that simple as it's a symlink. Would be easier to just watch the ConfigMap resource with client-go but then it doesn't make any difference from watching a CRD resource.

stefanprodan commented 6 years ago

@LiliC watching for ConfigMaps or Secrets is easy check this https://github.com/stefanprodan/k8s-podinfo/blob/master/pkg/fscache/fscache.go#L59

The only downside is the delay. It takes at least 2 minutes for kubelet to update the symlink while the CRD event is almost instantaneous.