zalando-zmon / zmon-controller

ZMON UI and REST API
https://docs.zmon.io/
Other
21 stars 17 forks source link

Schedule zmon check execution with some jitter / offset #663

Open otrosien opened 5 years ago

otrosien commented 5 years ago

If a check has many entities, the amount of parallelism of the check execution can DoS the target service. Offer a way to (e.g. evenly) distribute the check execution throughout the check interval.

Currently zmon assumes that two entities are independent, and thus can be queried in parallel. But this assumption often does not hold.

Example 1: We have Elasticsearch data nodes as entities in zmon, and have checks that pull local stats from the entities. If all data nodes are queried at the same time, it will cause a lot of stress inside the Elasticsearch cluster, which can lead to user-facing latency / GC pauses.

Example 2: Our neighbour team has a check that queries all main zalando categories (as zmon entities) for currently returned page-1 items. This check cannot be properly rate-limited in zmon and causes request spikes in our search cluster.

csenol commented 5 years ago

We have a workaround for Example 2 in application layer. Maybe zmon can introduce sleep in utility functions