Expose redis queue length metrics from ZMON worker

zalando-zmon / zmon-worker

ZMON Python Worker

https://zmon.io/

Other

19 stars 41 forks source link

Expose redis queue length metrics from ZMON worker #376

Open mikkeloscar opened 6 years ago

mikkeloscar commented 6 years ago

We should scale the number of zmon-workers running in Kubernetes based on the redis queue length. Since we now have custom metrics available in our Kubernetes setup we can do it by exposing a metric from the pods and scale based on that.

If each zmon-worker could expose the current redis queue length in a json metrics enpoint, then we could use the Horizontal Pod Autoscaler configuration described here: https://github.com/zalando-incubator/kube-metrics-adapter#example to do the scaling. This would allow us to run with a baseline of 1 zmon-worker in each cluster and only scale up when needed.

The alternative to the json metrics enpoint could be to scale on a ZMON check but it would not make sense to depend on ZMON in order to scale... ZMON. :)

vetinari commented 6 years ago

The main problem with this is, that we have a quite stable input from the ZMON scheduler into the queue. Once we reached the 0 length queue, we must not scale down again to keep the current worker throughput.

mikkeloscar commented 6 years ago

What about exposing another value than queue length? E.g. "scheduled checks per minute" or whatever makes sense for the workers, then you have a number that will not be 0.

Just an idea: If zmon-scheduler exposes a count of scheduled events in prometheus format, then we could use a prometheus query as the metric source for scaling e.g. events per min.

jrake-revelant commented 5 years ago

up, let us start discussing this again.

szuecs commented 5 years ago

Maybe queue length aggregated over a specified time frame is good enough.

It could work like this:

Prometheus collects queue size
zmon check to query Prometheus, like this: sum(rate(queue_size{}[5m]))
custom metrics HPA to use zmon check

This should work without exposing zmon-scheduler stats, because the rate will be the same and not fluctuate too much.

mikkeloscar commented 5 years ago

@szuecs We wouldn't need the zmon check, we can simply have an HPA based on prometheus query: https://github.com/zalando-incubator/kube-metrics-adapter#example-external-metric

szuecs commented 5 years ago

true, but if you need more logic you can do this in zmon check