zalando-zmon / zmon-worker

ZMON Python Worker
https://zmon.io/
Other
19 stars 41 forks source link

zmon-worker connect to zmon-redis causes flaky uncatchable alert #276

Open szuecs opened 6 years ago

szuecs commented 6 years ago

During cluster updates zmon-redis will be down for some time and zmon-worker are not tollerating this downtime. It will trigger uncatchable exceptions, which don't provide value to us. We would like to get these alerts not triggered.

I hope I provided enough information.

alert

def alert():
    return value.get("pods", 0) > 1000

check

def check():
    try:
        return {
            "pods": len(kubernetes(namespace=None).pods()),
            "_use_scheduled_time": True,
        }
    except Exception as e:
        return {"exception": str(e), "_use_scheduled_time": True}

history

2017-09-29 13:42:44 | ALERT_ENTITY_STARTED | "kube-cluster[aws:537814120105:eu-central-1]" | {"td":131.41097402572632,"worker":"plocal.zmon-worker-2295089407-ggqbz","ts":1.506685224806833E9,"value":"Error 110 connecting to zmon-redis:6379. Connection timed out.","exc":1}
-- | -- | -- | --

kubernetes pods

% kubectl get pods -n kube-system -l application=zmon-redis
NAME                          READY     STATUS    RESTARTS   AGE
zmon-redis-1546107048-9lzgg   1/1       Running   0          11m

% kubectl get pods -n kube-system -l application=zmon-worker
NAME                           READY     STATUS    RESTARTS   AGE
zmon-worker-2295089407-ggqbz   2/2       Running   0          46m
zmon-worker-2295089407-qp991   2/2       Running   0          7m
beverage commented 6 years ago

@szuecs Is this still an issue for you?

szuecs commented 6 years ago

@beverage did you fixed it? If not sure there is a problem. @mohabusama might be the right person to answer this question.

mohabusama commented 6 years ago

This is still an issue.

szuecs commented 5 years ago

@mohabusama can't you just save the exception and store it in some value, for example "exception", to pass it to the alert function? Then nobody needs to wrap these check functions and it can easily handled by the alert.