Weird behaviour in cloud watch ELB metrics

I'm not sure, if this is a bug or just something that should be mentioned in the documentation as a warning, but we are sometimes observing weird behaviour in our ELB based metrics.

We have a cloud watch check, that fetches SUM of requests per second (per minute) and it filters by AvailabilityZone: NOT_SET. Normally, we see the correct values, but there seems to be a race condition, where the values are roughly halved. Values in cloud watch directly are fine

The root cause seems to be a weird timing issue in AWS itself about when it collects the values per AZ to sum them. Why suspect this, since we can get rid of that weird "halving behaviour", when we trigger the forced evaluation of the associated alert a few times to change the timing of when this check runs.

Sorry, I can't really give a better explanation to reproduce this. We only saw this with one check, but there it was very consistent (4 out of 15 measurements consistently showed about half the value of what we saw in cloud watch directly for multiple consecutive days). Triggering forced evaluation of the pulling alert a few times fixed that behaviour.

zalando-zmon / zmon-docs

Weird behaviour in cloud watch ELB metrics #62