zalando-zmon / zmon-docs

ZMON Documentation
https://docs.zmon.io
14 stars 26 forks source link

Weird behaviour in cloud watch ELB metrics #62

Closed mo-gr closed 7 years ago

mo-gr commented 7 years ago

I'm not sure, if this is a bug or just something that should be mentioned in the documentation as a warning, but we are sometimes observing weird behaviour in our ELB based metrics.

We have a cloud watch check, that fetches SUM of requests per second (per minute) and it filters by AvailabilityZone: NOT_SET. Normally, we see the correct values, but there seems to be a race condition, where the values are roughly halved. Values in cloud watch directly are fine

The root cause seems to be a weird timing issue in AWS itself about when it collects the values per AZ to sum them. Why suspect this, since we can get rid of that weird "halving behaviour", when we trigger the forced evaluation of the associated alert a few times to change the timing of when this check runs.

Sorry, I can't really give a better explanation to reproduce this. We only saw this with one check, but there it was very consistent (4 out of 15 measurements consistently showed about half the value of what we saw in cloud watch directly for multiple consecutive days). Triggering forced evaluation of the pulling alert a few times fixed that behaviour.

Jan-M commented 7 years ago

Yes, this is kind of known and maybe indeed deserves a note on any CLOUD WATCH metric, where the underlying matric consists of multiple data points.

As you pointed out is timing, you can also see this in the AWS UI, sometimes you only get 1 out of 2 data points and on next refresh things are fine. However as ZMON alreay stored the observed value it will not fix itself.

Triggering evaluate "solves" this by moving the check & cloud watch query around in time.