Open mikkeloscar opened 6 years ago
The main problem with this is, that we have a quite stable input from the ZMON scheduler into the queue. Once we reached the 0 length queue, we must not scale down again to keep the current worker throughput.
What about exposing another value than queue length? E.g. "scheduled checks per minute" or whatever makes sense for the workers, then you have a number that will not be 0.
Just an idea: If zmon-scheduler exposes a count of scheduled events in prometheus format, then we could use a prometheus query as the metric source for scaling e.g. events per min.
up, let us start discussing this again.
Maybe queue length aggregated over a specified time frame is good enough.
It could work like this:
sum(rate(queue_size{}[5m]))
This should work without exposing zmon-scheduler stats, because the rate will be the same and not fluctuate too much.
@szuecs We wouldn't need the zmon check, we can simply have an HPA based on prometheus query: https://github.com/zalando-incubator/kube-metrics-adapter#example-external-metric
true, but if you need more logic you can do this in zmon check
We should scale the number of zmon-workers running in Kubernetes based on the redis queue length. Since we now have custom metrics available in our Kubernetes setup we can do it by exposing a metric from the pods and scale based on that.
If each zmon-worker could expose the current redis queue length in a json metrics enpoint, then we could use the Horizontal Pod Autoscaler configuration described here: https://github.com/zalando-incubator/kube-metrics-adapter#example to do the scaling. This would allow us to run with a baseline of 1 zmon-worker in each cluster and only scale up when needed.
The alternative to the json metrics enpoint could be to scale on a ZMON check but it would not make sense to depend on ZMON in order to scale... ZMON. :)