Closed LutzLange closed 1 year ago
I did create a PR to pull in the elk-stack.
I'm waiting on a review, so I can merge it : PR #68
We merged the PR and have now available a ELK profile for installation.
This version of the ELK stack does not work. There is a clash between the way the helm chart is built and our template engine. The chart uses an extraInitContainers :
extraInitContainers:
- name: setup-tls-cert
image: "docker.elastic.co/elasticsearch/elasticsearch:7.16.3"
command:
- sh
- -c
- |
#!/usr/bin/env bash
set -euo pipefail
elasticsearch-certutil cert \
--name ${NODE_NAME} \
--days 1000 \
--ip ${POD_IP} \
--dns ${NODE_NAME},${POD_SERVICE_NAME},${POD_SERVICE_NAME_HEADLESS},${NODE_NAME}.${POD_SERVICE_NAME},${NODE_NAME}.${POD_SERVICE_NAME_HEADLESS} \
...
The template engine tries to replace these and can't find the values.
How could we hide these from the template engine?
Is there a way to have a string which looks like a var from being excluded? Like escaping in bash?
e.g. ${VARIABLE}
becomes \$\{VARIABLE\}
Or a way to print the string using the templating engine instead?
Can you link to the template?
You could change to using the Go templating engine quite easily, which would change your template params to {{ .params.NAME_OF_PARAM }}
or alternatively, you could try escaping the values.
Underneath, it uses https://github.com/drone/envsubst and looking at https://github.com/drone/envsubst/blob/master/parse/scan.go#L191-L212 I can see that it could use something like...
--name $${NODE_NAME} \
https://github.com/weaveworks/profiles-catalog/blob/main/charts/elk-stack/values.yaml - this is the values file for the profile.
Right, so it's in the values...
I wouldn't escape it in this case, because I'd probably want the values.yaml to be usable standalone...
If Go templating isn't an option, we could possibly come up with another idea, something like the old vi-line
# template:no-render
auth:
generateCerts: true
username: elastic
# password:
Where # template:no-render
would indicate that we shouldn't render the values as a template?
I don't see how moving to go templating would work as this values.yaml doesn't need any vars replaced at all.
I guess we could look at the template: no-render
header to change the template engine behaviour with some values.yamls as that would work. I think we need to look into this profile a little more to see if there is a better way to write the profile in the first place.
@darrylweaver the template parsing should be done from the same parser as is used for the outer template, so, it would only recognise the go templating params.
After reviewing the values.yaml file, there is one place where we should be using a template to fill in the value i.e. the domain name associated with the Ingress for the Kibana UI. So, it would be a mixed values.yaml file that would need 1 value replaced and the others left alone. I would expect we could see further examples of this in future.
Looking at why we are needing this, it is only for mTLS between the elasticsearch cluster nodes, because a cert is created for each pod. I think for the demo we would be able to just strip that out and have no TLS between the nodes instead to simplify the profile.
In that case, it depends...if you care about the Values.yaml being able to be used "standalone" which it sounds like it can't because of that mTLS issue, then you probably need the go templating mechanism.
Otherwise, you can use the escaping I outlined above.
OK, we will try this out and update here the results.
Tried out the double dollar syntax and that works fine for this case for the demos to work for now. This is no longer blocking a demo, so I will resolve this issue for now.
Our ELK stack is available, but not functional.
That is why I reopened the issue. @darrylweaver please update and close once it is working.
@LutzLange Just to clarify, the current "not functional" state is unrelated to the templates?
Unrelated to the templates, yes.
The issue we have now, is that the elk-stack profile works on a Kind cluster with local-path-provisioner with plenty of resources, but not on EKS. There are 3 PVC associated with elasticsearch, which are listed in Pending state. Looking at them I see an event which states: waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
This is using the "in-tree" driver and not the CSI driver.
I think I see the issue as the elasticsearch PVC shows:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
volume.kubernetes.io/selected-node: ip-10-0-161-7.eu-central-1.compute.internal
volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
creationTimestamp: "2022-09-02T12:58:19Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: elasticsearch-master
name: elasticsearch-master-elasticsearch-master-0
namespace: elk-stack
resourceVersion: "2946"
uid: 9c797125-3f29-42a3-b535-5ed389a01c98
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
storageClassName: gp2
volumeMode: Filesystem
status:
phase: Pending
The storageclass shows:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"gp2"},"parameters":{"fsType":"ext4","type":"gp2"},"provisioner":"kubernetes.io/aws-ebs","volumeBindingMode":"WaitForFirstConsumer"}
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2022-09-02T12:50:04Z"
name: gp2
resourceVersion: "307"
uid: 3d8b5315-91aa-441c-acf4-320dffab9429
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Looks like the provisioner does not match.
KUBERNETES_VERSION:v1.23.7
I do not know where this mismatch is coming from, but I suspect this might be because of the K8s version?
https://github.com/elastic/helm-charts/blob/main/elasticsearch/values.yaml#L108 - the VolumeClaimTemplate does not enlighten me at all.
Here is the cluster definition: https://github.com/weavegitops/demo2-repo/blob/main/clusters/management/clusters/default/eks-elk9.yaml Used the template: https://github.com/weavegitops/demo2-repo/blob/main/weave-gitops-platform/capi-templates/eks-big-machines.yaml
Any insights would be helpful.
$ ku 10 describe pvc elasticsearch-master-elasticsearch-master-2 -n flux-system
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 42m persistentvolume-controller waiting for first consumer to be created before binding
Normal ExternalProvisioning 2m23s (x163 over 42m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
This has external provisioner "ebs.csi.aws.com". It could be a case of mismatch in the requested provisioner. Our default storage class has "provisioner":"kubernetes.io/aws-ebs" . I did try to change the pvc annotation, but without any luck. ( Tried on the cli and with the profile.yaml and change of the values )
I'm also wondering where I could see the logs of the provisioner, and check if it is triggered and what the error is. It might be that we are missing permissions again.
I tried to include these for LutzAdm, but failed : https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/example-iam-policy.json
The old in-tree ebs provisonier implementation is not extended any more. It just might be that EKS defaults to use the in-tree kubernetes provisioner per default (=kubernetes.io/aws-ebs).
There is an aws-ebs-csi-driver addon to EKS. I'm testing this in the next step to see if that will change the storage class definition and the attached provisioner.
There is a bit more to it then just adding the addon.
See : https://docs.aws.amazon.com/eks/latest/userguide/managing-ebs-csi.html
I'll reach out to Richard to get input on the above items.
Next steps: See if the K8s version makes any difference. Add a configuration for the EKS CSI driver, so we can use that a the default storage provider as the 'in-tree' driver will be deprecated soon.
Helpful docs: https://cluster-api-aws.sigs.k8s.io/topics/eks/addons.html - how to add addons to EKS clusters in CAPA. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md - Manual installation instructions from the source project.
As this is implemented by eksctl
we may be able to re-use code from that project.
I attached the policy from here to the role used by the nodes in the CX account.
I was able to install the addon :
clusterawsadm eks addons list-installed -n eks-elk13 -r eu-central-1
Installed addons for cluster eks-elk13:NAME VERSION STATUS CREATED MODIFIED SA ARN ISSUES
aws-ebs-csi-driver v1.10.0-eksbuild.1 ACTIVE 2022-09-06 06:42:53.888 +0000 UTC 2022-09-06 06:57:55.643 +0000 UTC 0
The default storage class is not using the new provisioner. I thought a manual test should be done before proceeding with automation and changing the values of the elk stack profile.
I deleted the default gp2 storage class.
I created a modified gp2 storage class that uses the provisioner that is requested by the standard ekl-stack.
I created a new pv to test that storage class.
It did stay in pending and was not bound. Do I need to have a pod that wants to use it to trigger the provisioning?
See Storage Issue (#33)[https://github.com/weaveworks/sa-demos/issues/33] we are missing permissions
The storage issue is solved. We still need to sort out Ingress and Authentication.
closing this issue as we have an ELK stack template that works
Tony and Mostafa want to have ELK available on demo environments.
We do have an elk stack in the profiles catalog
I guess we just need to push it into the weave-gitops-profile-examles
Longer term goal would be to have ELK available and deployed in the management cluster alongside prometheus & grafana. Leaf clusters should send their metrics and logs to a central collection.