zmon-redis downtime - Githubissues

zalando-incubator / kubernetes-on-aws

Deploying Kubernetes on AWS with CloudFormation and Ubuntu

https://kubernetes-on-aws.readthedocs.io/

MIT License

626 stars 163 forks source link

zmon-redis downtime #842

Closed mohabusama closed 6 years ago

mohabusama commented 6 years ago

Since zmon-redis is running as a single pod, there are chances where the pod gets re-scheduled (autoscaling) and in turn leading to some unexpected behavior:

Local alert state changes which could lead to inconsistent paging
No check execution during the re-scheduling period

We have various alternatives afaik:

Persist Redis data.
Make sure Redis does not get rescheduled.
Deploy an HA Redis.
Switch to external Redis (e.g. ELC).
...?

Jan-M commented 6 years ago

This is just another case of the auto-scaler not honoring and prioritizing nodes for termination.

If downscaling via cluster auto scaler would prefer nodes without statefulsets or nodes not impacting pod disruption budgets this could be easily prevented.

szuecs commented 6 years ago

IMHO this is working by design in Kubernetes, Pods can be terminated any time. Systems that have no operator, that takes ownership of failover to replica similar to https://github.com/zalando-incubator/postgres-operator are a bug itself if these can not run with more than 2 replicas and need a single write master. Therefore closing this, because it has to be solved by the redis application owner