zegl / kube-score

Kubernetes object analysis with recommendations for improved reliability and security. kube-score actively prevents downtime and bugs in your Kubernetes YAML and Charts. Static code analysis for Kubernetes.
https://kube-score.com
MIT License
2.78k stars 178 forks source link

Improve Probe Checks? #285

Closed markuslackner closed 4 years ago

markuslackner commented 4 years ago

Currently kube-score will report a critical issue when using identical liveness/readiness checks based on protocol, endpoint and port.

Many application servers (e.g. spring) have only a "/health" endpoint, which is often used for both checks. As outlined in https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html it is ok to use one endpoint for both checks IF the failureThreshold of the liveness probe is higher than the failureThreshold of the readiness probe.

What do you think about following additional checks when using the same endpoint for liveness and readiness probes:

zegl commented 4 years ago

Hey Markus,

Since kube-score v1.70 (released in May this year), kube-score has adopted the approach that it's far more dangerous to configure a livenessProbe without fully understanding the consequences than to not configure one at all. This is described in the README_PROBES.md document.

I believe that if you're at a point where you need to have a livenessProbe, you'll also be able to set up a custom HTTP endpoint in your application server that is tailored exactly to the needs of a livenessProbe. It's just another endpoint after all.

Many application servers will check downstream services, such as the connection to a database, in the the default healthcheck endpoint, which you absolutely do not want to do.

So I think that it would be a hard for me to be able to sleep well while recommending to re-use the same probe for different purposes. But maybe you can help me to understand it.

What problem have you solved with re-using the same probe for health- and liveness probes, but where the livenessProbes period*threshold is longer than the health probes?

markuslackner commented 4 years ago

Hi!

In my case the problem arises with migrating from kube-score 1.5 to 1.7. Some services in my company are having only one "/health" endpoint and have configured both probes to this endpoint - especially those using older frameworks/application servers (newer releases often already have a "/ready" and "/health" endpoint).

I agree with you, that backend services must not be checked in a /health or /ready endpoint, but reusing a probe for both purposes is not that evil in my opinion (if you know what you do). Consider a /health endpoint checking only application internals. I think there is nothing bad with using that endpoint for both probes.

A Problem arises if the liveness Probe fails before the readiness Probe, because the pod will be restarted before the pod/container is removed from the loadbalancer service. At the bottom it is a logical problem: A container that is not live can never be ready. So I think if you use the same endpoint for both probes, logic dictates that the readiness probe must fail first. Its the same with different endpoints, but there the application(server) could manage it.

I see your point that it is often better to not use a liveness probe, because the container should exit if encountering a unsolvable problem.

Thanks for your answer!! I will close that issue and take a look into the affected services.