zeebe-io / zeebe-chaos

Contains everything related to chaos engineering in Zeebe, which means chaos experiments, hypothesis backlog etc.
https://zeebe-io.github.io/zeebe-chaos/
Apache License 2.0
23 stars 4 forks source link

Chaos: Terminating Gateway Stops Workflow Processing #336

Open shahamit opened 1 year ago

shahamit commented 1 year ago

Chaos Experiment

When running the terminate chaos experiment against a zeebe cluster that was under load, we observed that the cluster stops processing any workflows there after.

Config - 6 brokers, 2 gateways, 6 partitions, 2 replication factor.

Note we don't have an ingress controller configured in front of the zeebe-gateway. Since our client (benchmarking tool in this case) runs within the same zeebe cluster it should be fine given that k8s service (zeebe-gateway) does the load balancing between them (which isn't happening but that's a separate issue).

We were hoping that since the client (benchmarking tool) connects to the k8s zeebe-gateway service, terminating one of the gateway instances shouldn't have any impact on the client. I didn't follow why do we see errors on the client. Please share more insights.

Thanks.

Benchmarking tool logs ksnip_20230321-181948

Terminate command output ksnip_20230321-181029

Zelldon commented 1 year ago

Hey @shahamit sorry for the late reply.

Regarding:

When running the terminate chaos experiment against a zeebe cluster that was under load, we observed that the cluster stops processing any workflows there after.

What means it stops, does it recover afterwards? After some time eventually?

Zelldon commented 1 year ago

We were hoping that since the client (benchmarking tool) connects to the k8s zeebe-gateway service, terminating one of the gateway instances shouldn't have any impact on the client.

The affect will never be zero, because some request might fail or timeout, but yes after retry it should work and take the next gateway I agree.

shahamit commented 1 year ago

What means it stops, does it recover afterwards? After some time eventually?

Yes after a few seconds, the workflows did start getting processed. In between though some workflow do fail (indicated by the backpressure % increasing). Given that the gateway replicas are behind a k8s service, shouldn't it automatically go to the next gateway instance instead of failing the workflows?

Zelldon commented 1 year ago

Do you have any metrics to show? What type of load we are speaking of? :thinking:

Given that the gateway replicas are behind a k8s service, shouldn't it automatically go to the next gateway instance

Yes if a new request comes in I would expect something like that.

failing the workflows?

Be aware that the process instances are not failing, they are just not continued right?

shahamit commented 1 year ago

Sorry for the late reply @Zelldon

Do you have any metrics to show? What type of load we are speaking of? thinking

We are running the benchmarking tool against a zeebe cluster of 7 brokers and 2 gateways. We could see the throughput as 170 PI/s.

Be aware that the process instances are not failing, they are just not continued right?

This is hard to find out since the benchmarking tool starts around 170 process instances per second. If you can think of a way to find this out, please let me know.

Thanks