Upgrade camunda platform helm chart

Zelldon commented 9 months ago

Description

Our dependency on the Camunda platform was outdated, for quite a while. This is problematic since we are not testing features, we have built and also not the setup of Camunda platform helm chart anymore.

Changes

Several things have changed: the support of ES 8, migration to a different ES helm chart (bitnami), the curator was replaced by ILM, labels have changed, and no longer sub-charts in Camunda Platform (only sub-folders in the templates as of now).

The PR includes several adjustments to the values files to cover these changes.

Closes https://github.com/zeebe-io/benchmark-helm/issues/127 Closes https://github.com/zeebe-io/benchmark-helm/issues/126 Closes https://github.com/zeebe-io/benchmark-helm/issues/125

Security context:

We need to make sure that we don't overwrite existing values, from containerSecurityContext (there are now existing defaults)
The init container for pyroscope needs to be run as non-root as well, and has to be configured in this way

Elasticsearch:

Upgrade ES to supported 8.9.x version
Enable new ILM retention configuration, and set useful configuration for minimumAge
ES configuration didn't worked anymore, we had to migrate this to the bitnami ES helm data structure
Prometheus elasticsearch metrics: The ES URI has changed and had to been adjusted

Furthermore, due to the sub-chart to sub-folder migration in the camunda platform charts, several path have changed, which we had to adjust in our golden tests.

Benchmark

Right now we have a benchmark running to verify whether everything works, especially ILM and elasticsearch setup etc.

https://grafana.dev.zeebe.io/d/zeebe-dashboard/zeebe?orgId=1&var-DS_PROMETHEUS=prometheus&var-cluster=All&var-namespace=ck-test-helm&var-pod=All&var-partition=All&from=1705065208717&to=1705067183123

2024-01-12_14-47

2024-01-12_14-47_1

Elasticsearch metrics can be found here https://grafana.dev.zeebe.io/d/elasticsearch/elasticsearch?orgId=1&var-datasource=prometheus&var-cluster=All&var-namespace=ck-test-helm&var-index=All

2024-01-12_14-46

Zelldon commented 9 months ago

I investigated further the ILM deletion, and thought first it doesn't work with the ILM settings which is why I made it possible to create hourly based indices https://github.com/camunda/zeebe/pull/15953

Turned out I was wrong. I have a running benchmark without changing the indices and the smaller disk size (16 gig) and it is running strong.

ck-helm-defaults-es-disk

Based on the logs we can see that the indexes need to go through different phases before getting deleted (I just selected on index for simplicity)

# Creation
[2024-01-16T19:59:00,325][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [zeebe-record-retention-policy]
[2024-01-16T19:59:00,783][INFO ][o.e.c.r.a.AllocationService] [zeebe-benchmark-test-elasticsearch-master-1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[zeebe-record_variable_8.5.0-snapshot_2024-01-16][0]]])." previous.health="YELLOW" reason="shards started [[zeebe-record_variable_8.5.0-snapshot_2024-01-16][0]]"

# start of deletion
[2024-01-16T20:18:59,809][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] in policy [zeebe-record-retention-policy]

[2024-01-16T20:28:59,772][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] to [{"phase":"delete","action":"delete","name":"cleanup-snapshot"}] in policy [zeebe-record-retention-policy]

# Now it gets deleted 30 min after creation
[2024-01-16T20:28:59,797][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"delete","action":"delete","name":"cleanup-snapshot"}] to [{"phase":"delete","action":"delete","name":"delete"}] in policy [zeebe-record-retention-policy]

Zelldon commented 9 months ago

@npepinpe I think we can go ahead and merge this change. I will try tomorrow how the charts behaves when we upgrade from an older version of the chart to a newer version, based on the results we might want to pin the release charts to the older version for now.

Zelldon commented 9 months ago

I have a benchmark with the mixed setup (incl. Operate running) here

How to set up:

helm install zeebe-benchmark charts/zeebe-benchmark \
> --set starter.rate=5 \
        --set worker.replicas=1 \
        --set timer.replicas=1 \
        --set timer.rate=5 \
        --set publisher.replicas=1 \
        --set publisher.rate=5 \
        --set camunda-platform.operate.enabled=true \
        --set camunda-platform.operate.image.repository=camunda/operate \
        --set camunda-platform.operate.image.tag=SNAPSHOT    \
        --set camunda-platform.elasticsearch.master.persistence.size=128Gi \
        --set camunda-platform.zeebe.retention.minimumAge=1d \
        --set camunda-platform.operate.retention.minimumAge=30m

zeebe-io / benchmark-helm