kube-metrics-adapter (with pod collector enabled) returns stale data

rahanar commented 4 years ago

Hello,

I have been testing kube-metrics-adapter (with pod collector enabled) and noticed that it returns stale data during specific conditions, which results in HPA misbehaving. The conditions occur when HPA scales up a deployment due to high load, then load subsides and it scales the deployment back to min-replica. However, if I increase the load again that should result in a scale-up, the HPA would not scale up because the data it is acting upon is stale. When it queries kube-metrics-adapter for custom metrics, the adapter is returning old data that contains non-existent pods and old values. This results in incorrect averages for the HPA.

I tracked it down to the in-memory cache that kube-metrics-adapter uses to track pods and metrics. The metrics have TTL of 15 mins and the GC is run every 10 mins. The issue mentioned above disappears after old metrics are cleaned up by GC. To fix the issue permanently, I modified TTL to 30s and GC to run every minute. HPA is able to act on new conditions much better. This should cover other edge cases, such as if HPA is in a middle of scale-down and there is a new load, it should able to act on it faster.

I think this is a simple solution for now that solves the issue and there is a better way of dealing with it. The cache can be updated on each run, so HPA can have an up-to-date view of the cluster without a delay of up to 1 minute.

Expected Behavior

kube-metrics-adapter doesn't return stale data to HPA that contains information about non-existent pods and their values.

Actual Behavior

kube-metrics-adapter returns stale data to HPA due to 15min TTL and 10 mins garbage collection. This causes HPA to calculate wrong averages based on custom metrics.

Steps to Reproduce the Problem

Let HPA scale up a deployment based on a custom metric from pod (pod collector)
Let HPA scale down the deployment due to load subsiding
As soon as the deployment hits minReplica, increase the load. Notice that averages in HPA are not correct. Specifically, the average is a function of the current value of the custom metric divided by number of pods that HPA scaled up to previously.

Specifications

Version: v0.0.5

I have a fix for this, but would like to discuss if it is appropriate. I propose to make TTL and GC run configurable and set the default values to 30s and 1m respectively. A better long term solution can be keeping the the cache up-to-date on each run.

mikkeloscar commented 4 years ago

Thanks for the detailed bug report. I would suggest we add the changes you propose first and then think about a more robust solution.

I'm surprised the HPA controller is asking for metrics of old pods, even if the metrics server is storing them it seems like the wrong behavior to get those. I would have to check how the official "metrics-server" behaves for CPU and Memory metrics because I also believe it would have similar TTLs.

rahanar commented 4 years ago

Thanks @mikkeloscar for your reply! I will create a PR for this.

I looked into why HPA was getting old metrics. In my case, at least, it was querying using labels. So, kube-metrics-adapter was correctly returning objects that matched the labels. Whether they should be there or not is the issue here.

I have not looked into metrics-server, but will check it out after I prepare a PR.

rahanar commented 4 years ago

Hello,

Just providing an update on the issue and the fix.

I am currently working on an approval from my employer to contribute to this project. As soon as that's done, I will make a PR with the fix. If anyone else wants to do it, feel free to create a PR.

Thanks, Anar

mikkeloscar commented 4 years ago

Thanks for the update, did you have a chance to look at how the metrics-server handles this issue?

lsievert commented 4 years ago

Hello!! Any idea when this problem will be solved? I am experiencing the same issue using sqs queue messages number.

Thanks, Lucas

szuecs commented 4 years ago

@lsievert can you provide more data, please?

mikkeloscar commented 4 years ago

Following up on the original issue. The metrics-server has a list of active pods and only serve metrics for those pods: https://github.com/kubernetes-sigs/metrics-server/blob/2a1d1385123b9b7eed36f8b3efe8b78175db5b28/pkg/api/pod.go#L77-L124

We could do something similar to avoid this issue.

ermakov-oleg commented 3 years ago

Hello, when do you plan to release these changes?

mikkeloscar commented 3 years ago

@ermakov-oleg Hi, we're just testing another feature for ScheduledScaling and will release it all together. We aim to make the release next week.

jonathanbeber commented 3 years ago

@ermakov-oleg as Mikkel said we plan to release it soon, although, if you want you can test the pre-release version with the same image version we are running in our clusters:

registry.opensource.zalan.do/teapot/kube-metrics-adapter:v0.1.10-56-g6f9aba8

ermakov-oleg commented 3 years ago

Thank you!

I already looked at the diff with the latest version and used v0.1.10-68-g85f6dda, everything works fine on this one.

wildermesser commented 2 years ago

I am experiencing the same issue at the newest kube-metrics-adapter:v0.1.17-6-g9d8359b version. @jonathanbeber @mikkeloscar can you show me merged PR's with possible fixes? It seems like the only way to fix it is do not expose old metrics to HPA controller.

mikkeloscar commented 2 years ago

@wildermesser I think the only thing we have here is this: https://github.com/zalando-incubator/kube-metrics-adapter/pull/309

wildermesser commented 2 years ago

@mikkeloscar thank you! I saw these arguments before. But I could not realize its purpose. After reading this issue I can set proper values for these arguments. And it resolves this issue for me. I think a small description of all arguments in README will be useful for people like me :)

rbjorklin commented 1 month ago

Some general guidelines for configuring these options:

The default --horizontal-pod-autoscaler-sync-period is 15s so without tweaking this value to improve decision time there isn't much purpose of collecting metrics faster than every 15s.
The --metrics-ttl option MUST be greater than the --garbage-collector-interval unless you rely on other means of keeping metrics alive such as a more frequent collection interval.
Running garbage collection more frequently than the collection interval makes little sense.

zalando-incubator / kube-metrics-adapter