Make ELK available on demo environments

LutzLange commented 2 years ago

Tony and Mostafa want to have ELK available on demo environments.

We do have an elk stack in the profiles catalog

I guess we just need to push it into the weave-gitops-profile-examles

Longer term goal would be to have ELK available and deployed in the management cluster alongside prometheus & grafana. Leaf clusters should send their metrics and logs to a central collection.

LutzLange commented 2 years ago

I did create a PR to pull in the elk-stack.

I'm waiting on a review, so I can merge it : PR #68

LutzLange commented 2 years ago

We merged the PR and have now available a ELK profile for installation.

This version of the ELK stack does not work. There is a clash between the way the helm chart is built and our template engine. The chart uses an extraInitContainers :

extraInitContainers:
  - name: setup-tls-cert
    image: "docker.elastic.co/elasticsearch/elasticsearch:7.16.3"
    command:
    - sh
    - -c
    - |
      #!/usr/bin/env bash
      set -euo pipefail
      elasticsearch-certutil cert \
        --name ${NODE_NAME} \
        --days 1000 \
        --ip ${POD_IP} \
        --dns ${NODE_NAME},${POD_SERVICE_NAME},${POD_SERVICE_NAME_HEADLESS},${NODE_NAME}.${POD_SERVICE_NAME},${NODE_NAME}.${POD_SERVICE_NAME_HEADLESS} \
...

The template engine tries to replace these and can't find the values. Screenshot from 2022-09-01 11-07-18

How could we hide these from the template engine?

LutzLange commented 2 years ago

Is there a way to have a string which looks like a var from being excluded? Like escaping in bash?

darrylweaver commented 2 years ago

e.g. ${VARIABLE} becomes \$\{VARIABLE\}

darrylweaver commented 2 years ago

Or a way to print the string using the templating engine instead?

bigkevmcd commented 2 years ago

Can you link to the template?

You could change to using the Go templating engine quite easily, which would change your template params to {{ .params.NAME_OF_PARAM }} or alternatively, you could try escaping the values.

Underneath, it uses https://github.com/drone/envsubst and looking at https://github.com/drone/envsubst/blob/master/parse/scan.go#L191-L212 I can see that it could use something like...

--name $${NODE_NAME} \

darrylweaver commented 2 years ago

https://github.com/weaveworks/profiles-catalog/blob/main/charts/elk-stack/values.yaml - this is the values file for the profile.

bigkevmcd commented 2 years ago

Right, so it's in the values...

I wouldn't escape it in this case, because I'd probably want the values.yaml to be usable standalone...

If Go templating isn't an option, we could possibly come up with another idea, something like the old vi-line

# template:no-render
auth:
  generateCerts: true
  username: elastic
  # password:

Where # template:no-render would indicate that we shouldn't render the values as a template?

darrylweaver commented 2 years ago

I don't see how moving to go templating would work as this values.yaml doesn't need any vars replaced at all. I guess we could look at the template: no-render header to change the template engine behaviour with some values.yamls as that would work. I think we need to look into this profile a little more to see if there is a better way to write the profile in the first place.

bigkevmcd commented 2 years ago

@darrylweaver the template parsing should be done from the same parser as is used for the outer template, so, it would only recognise the go templating params.

darrylweaver commented 2 years ago

After reviewing the values.yaml file, there is one place where we should be using a template to fill in the value i.e. the domain name associated with the Ingress for the Kibana UI. So, it would be a mixed values.yaml file that would need 1 value replaced and the others left alone. I would expect we could see further examples of this in future.

Looking at why we are needing this, it is only for mTLS between the elasticsearch cluster nodes, because a cert is created for each pod. I think for the demo we would be able to just strip that out and have no TLS between the nodes instead to simplify the profile.

bigkevmcd commented 2 years ago

In that case, it depends...if you care about the Values.yaml being able to be used "standalone" which it sounds like it can't because of that mTLS issue, then you probably need the go templating mechanism.

Otherwise, you can use the escaping I outlined above.

darrylweaver commented 2 years ago

OK, we will try this out and update here the results.

darrylweaver commented 2 years ago

Tried out the double dollar syntax and that works fine for this case for the demos to work for now. This is no longer blocking a demo, so I will resolve this issue for now.

LutzLange commented 2 years ago

Our ELK stack is available, but not functional.

That is why I reopened the issue. @darrylweaver please update and close once it is working.

bigkevmcd commented 2 years ago

@LutzLange Just to clarify, the current "not functional" state is unrelated to the templates?

darrylweaver commented 2 years ago

Unrelated to the templates, yes.

The issue we have now, is that the elk-stack profile works on a Kind cluster with local-path-provisioner with plenty of resources, but not on EKS. There are 3 PVC associated with elasticsearch, which are listed in Pending state. Looking at them I see an event which states: waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

This is using the "in-tree" driver and not the CSI driver.

darrylweaver commented 2 years ago

I think I see the issue as the elasticsearch PVC shows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
    volume.kubernetes.io/selected-node: ip-10-0-161-7.eu-central-1.compute.internal
    volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
  creationTimestamp: "2022-09-02T12:58:19Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: elasticsearch-master
  name: elasticsearch-master-elasticsearch-master-0
  namespace: elk-stack
  resourceVersion: "2946"
  uid: 9c797125-3f29-42a3-b535-5ed389a01c98
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi
  storageClassName: gp2
  volumeMode: Filesystem
status:
  phase: Pending

darrylweaver commented 2 years ago

The storageclass shows:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"gp2"},"parameters":{"fsType":"ext4","type":"gp2"},"provisioner":"kubernetes.io/aws-ebs","volumeBindingMode":"WaitForFirstConsumer"}
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2022-09-02T12:50:04Z"
  name: gp2
  resourceVersion: "307"
  uid: 3d8b5315-91aa-441c-acf4-320dffab9429
parameters:
  fsType: ext4
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Looks like the provisioner does not match.

darrylweaver commented 2 years ago

KUBERNETES_VERSION:v1.23.7

I do not know where this mismatch is coming from, but I suspect this might be because of the K8s version?

darrylweaver commented 2 years ago

https://github.com/elastic/helm-charts/blob/main/elasticsearch/values.yaml#L108 - the VolumeClaimTemplate does not enlighten me at all.

darrylweaver commented 2 years ago

Here is the cluster definition: https://github.com/weavegitops/demo2-repo/blob/main/clusters/management/clusters/default/eks-elk9.yaml Used the template: https://github.com/weavegitops/demo2-repo/blob/main/weave-gitops-platform/capi-templates/eks-big-machines.yaml

Any insights would be helpful.

LutzLange commented 2 years ago

$ ku 10 describe pvc elasticsearch-master-elasticsearch-master-2 -n flux-system
...
Events:
  Type    Reason                Age                    From                         Message
  ----    ------                ----                   ----                         -------
  Normal  WaitForFirstConsumer  42m                    persistentvolume-controller  waiting for first consumer to be created before binding
  Normal  ExternalProvisioning  2m23s (x163 over 42m)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

This has external provisioner "ebs.csi.aws.com". It could be a case of mismatch in the requested provisioner. Our default storage class has "provisioner":"kubernetes.io/aws-ebs" . I did try to change the pvc annotation, but without any luck. ( Tried on the cli and with the profile.yaml and change of the values )

I'm also wondering where I could see the logs of the provisioner, and check if it is triggered and what the error is. It might be that we are missing permissions again.

I tried to include these for LutzAdm, but failed : https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/example-iam-policy.json

LutzLange commented 2 years ago

The old in-tree ebs provisonier implementation is not extended any more. It just might be that EKS defaults to use the in-tree kubernetes provisioner per default (=kubernetes.io/aws-ebs).

There is an aws-ebs-csi-driver addon to EKS. I'm testing this in the next step to see if that will change the storage class definition and the attached provisioner.

LutzLange commented 2 years ago

There is a bit more to it then just adding the addon.

See : https://docs.aws.amazon.com/eks/latest/userguide/managing-ebs-csi.html

Updated template with addon ( done )
Associate OICD provider ( how can we do that with CAPI? ) $ eksctl utils associate-iam-oidc-provider --cluster eks-elk11 --region eu-central-1 --approve
Setup IAM permissions ( can we do this manually ? ) see https://docs.aws.amazon.com/eks/latest/userguide/csi-iam-role.html

I'll reach out to Richard to get input on the above items.

darrylweaver commented 2 years ago

Next steps: See if the K8s version makes any difference. Add a configuration for the EKS CSI driver, so we can use that a the default storage provider as the 'in-tree' driver will be deprecated soon.

darrylweaver commented 2 years ago

Helpful docs: https://cluster-api-aws.sigs.k8s.io/topics/eks/addons.html - how to add addons to EKS clusters in CAPA. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md - Manual installation instructions from the source project.

As this is implemented by eksctl we may be able to re-use code from that project.

richardcase commented 2 years ago

I attached the policy from here to the role used by the nodes in the CX account.

LutzLange commented 2 years ago

I was able to install the addon :

clusterawsadm eks addons list-installed -n eks-elk13 -r eu-central-1
Installed addons for cluster eks-elk13:NAME                 VERSION              STATUS   CREATED                             MODIFIED                            SA ARN   ISSUES
aws-ebs-csi-driver   v1.10.0-eksbuild.1   ACTIVE   2022-09-06 06:42:53.888 +0000 UTC   2022-09-06 06:57:55.643 +0000 UTC            0

The default storage class is not using the new provisioner. I thought a manual test should be done before proceeding with automation and changing the values of the elk stack profile.

I deleted the default gp2 storage class.

I created a modified gp2 storage class that uses the provisioner that is requested by the standard ekl-stack.

I created a new pv to test that storage class.

It did stay in pending and was not bound. Do I need to have a pod that wants to use it to trigger the provisioning?

LutzLange commented 2 years ago

See Storage Issue (#33)[https://github.com/weaveworks/sa-demos/issues/33] we are missing permissions

LutzLange commented 2 years ago

The storage issue is solved. We still need to sort out Ingress and Authentication.

darrylweaver commented 1 year ago

closing this issue as we have an ELK stack template that works

weaveworks / sa-demos

Make ELK available on demo environments #29