Closed Zelldon closed 9 months ago
I investigated further the ILM deletion, and thought first it doesn't work with the ILM settings which is why I made it possible to create hourly based indices https://github.com/camunda/zeebe/pull/15953
Turned out I was wrong. I have a running benchmark without changing the indices and the smaller disk size (16 gig) and it is running strong.
Based on the logs we can see that the indexes need to go through different phases before getting deleted (I just selected on index for simplicity)
# Creation
[2024-01-16T19:59:00,325][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [zeebe-record-retention-policy]
[2024-01-16T19:59:00,783][INFO ][o.e.c.r.a.AllocationService] [zeebe-benchmark-test-elasticsearch-master-1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[zeebe-record_variable_8.5.0-snapshot_2024-01-16][0]]])." previous.health="YELLOW" reason="shards started [[zeebe-record_variable_8.5.0-snapshot_2024-01-16][0]]"
# start of deletion
[2024-01-16T20:18:59,809][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] in policy [zeebe-record-retention-policy]
[2024-01-16T20:28:59,772][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"delete","action":"delete","name":"wait-for-shard-history-leases"}] to [{"phase":"delete","action":"delete","name":"cleanup-snapshot"}] in policy [zeebe-record-retention-policy]
# Now it gets deleted 30 min after creation
[2024-01-16T20:28:59,797][INFO ][o.e.x.i.IndexLifecycleTransition] [zeebe-benchmark-test-elasticsearch-master-1] moving index [zeebe-record_variable_8.5.0-snapshot_2024-01-16] from [{"phase":"delete","action":"delete","name":"cleanup-snapshot"}] to [{"phase":"delete","action":"delete","name":"delete"}] in policy [zeebe-record-retention-policy]
@npepinpe I think we can go ahead and merge this change. I will try tomorrow how the charts behaves when we upgrade from an older version of the chart to a newer version, based on the results we might want to pin the release charts to the older version for now.
I have a benchmark with the mixed setup (incl. Operate running) here
How to set up:
helm install zeebe-benchmark charts/zeebe-benchmark \
> --set starter.rate=5 \
--set worker.replicas=1 \
--set timer.replicas=1 \
--set timer.rate=5 \
--set publisher.replicas=1 \
--set publisher.rate=5 \
--set camunda-platform.operate.enabled=true \
--set camunda-platform.operate.image.repository=camunda/operate \
--set camunda-platform.operate.image.tag=SNAPSHOT \
--set camunda-platform.elasticsearch.master.persistence.size=128Gi \
--set camunda-platform.zeebe.retention.minimumAge=1d \
--set camunda-platform.operate.retention.minimumAge=30m
Description
Our dependency on the Camunda platform was outdated, for quite a while. This is problematic since we are not testing features, we have built and also not the setup of Camunda platform helm chart anymore.
Changes
Several things have changed: the support of ES 8, migration to a different ES helm chart (bitnami), the curator was replaced by ILM, labels have changed, and no longer sub-charts in Camunda Platform (only sub-folders in the templates as of now).
The PR includes several adjustments to the values files to cover these changes.
Closes https://github.com/zeebe-io/benchmark-helm/issues/127 Closes https://github.com/zeebe-io/benchmark-helm/issues/126 Closes https://github.com/zeebe-io/benchmark-helm/issues/125
Security context:
Elasticsearch:
Furthermore, due to the sub-chart to sub-folder migration in the camunda platform charts, several path have changed, which we had to adjust in our golden tests.
Benchmark
Right now we have a benchmark running to verify whether everything works, especially ILM and elasticsearch setup etc.
https://grafana.dev.zeebe.io/d/zeebe-dashboard/zeebe?orgId=1&var-DS_PROMETHEUS=prometheus&var-cluster=All&var-namespace=ck-test-helm&var-pod=All&var-partition=All&from=1705065208717&to=1705067183123
Elasticsearch metrics can be found here https://grafana.dev.zeebe.io/d/elasticsearch/elasticsearch?orgId=1&var-datasource=prometheus&var-cluster=All&var-namespace=ck-test-helm&var-index=All