currentMetrics is null for external metrics hpa

johnzheng1975 commented 6 months ago

Expected Behavior

Works well as my another hpa, in same environment.

$ k get hpa -n test -oyaml
apiVersion: v1
items:
- apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    annotations:
      metric-config.external.istio-requests-total.prometheus/prometheus-server: http://prometheus-server.infra.svc
      metric-config.external.istio-requests-total.prometheus/query: |
        sum(
            rate(
                istio_requests_total{
                  destination_workload="podinfo",
                  destination_workload_namespace="test",
                  reporter="destination"
                }[2m]
            )
        ) /
        count(
          count(
            container_memory_usage_bytes{
              namespace="test",
            pod=~"podinfo.*"
            }
          ) by (pod)
        )
    creationTimestamp: "2024-06-06T11:06:40Z"
    name: podinfo
    namespace: test
    resourceVersion: "68922388"
    uid: 440d0a48-1e20-4eb3-b802-7a5df5274481
  spec:
    maxReplicas: 10
    metrics:
    - external:
        metric:
          name: istio-requests-total
          selector:
            matchLabels:
              type: prometheus
        target:
          averageValue: "10"
          type: AverageValue
      type: External
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: podinfo
  status:
    conditions:
    - lastTransitionTime: "2024-06-06T11:06:48Z"
      message: recommended size matches current size
      reason: ReadyForNewScale
      status: "True"
      type: AbleToScale
    - lastTransitionTime: "2024-06-06T11:29:33Z"
      message: 'the HPA was able to successfully calculate a replica count from external
        metric istio-requests-total(&LabelSelector{MatchLabels:map[string]string{type:
        prometheus,},MatchExpressions:[]LabelSelectorRequirement{},})'
      reason: ValidMetricFound
      status: "True"
      type: ScalingActive
    - lastTransitionTime: "2024-06-06T12:37:07Z"
      message: the desired replica count is less than the minimum replica count
      reason: TooFewReplicas
      status: "True"
      type: ScalingLimited
    currentMetrics:
    - external:
        current:
          averageValue: "0"
        metric:
          name: istio-requests-total
          selector:
            matchLabels:
              type: prometheus
      type: External
    currentReplicas: 1
    desiredReplicas: 1
    lastScaleTime: "2024-06-06T12:29:37Z"
kind: List
metadata:
  resourceVersion: ""

Actual Behavior

currentMetrics is null, hpa not work.

$ k get hpa -n zone-dev aiservice  -oyaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aiservice
status:
  conditions:
  - lastTransitionTime: "2024-06-06T12:37:45Z"
    message: the HPA controller was able to get the target's current scale
    reason: SucceededGetScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2024-06-06T12:37:45Z"
    message: scaling is disabled since the replica count of the target is zero
    reason: ScalingDisabled
    status: "False"
    type: ScalingActive
  currentMetrics: null
  desiredReplicas: 0

Steps to Reproduce the Problem

Install kube-metrics-adapter, with cd .\docs; kubectl apply -f .
Create hpa for podinfo, with external metrics "istio-requests-total", it works (details as upper)
Create hpa for aiservice, with external metrics "cgm-fi-dev-gpu-util", the hpa does not work, currentMetrics is null (detail configuration is as upper)

Here are logs, seems already got the metrics

$ k logs -n ks kube-metrics-adapter-7fd88d677c-fgr4t  -f | grep cgm-fi-dev-gpu-util

time="2024-06-06T11:37:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (0) [type=prometheus]" provider=hpa
time="2024-06-06T11:38:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (64) [type=prometheus]" provider=hpa
time="2024-06-06T11:39:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (0) [type=prometheus]" provider=hpa
time="2024-06-06T11:40:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (0) [type=prometheus]" provider=hpa
time="2024-06-06T11:41:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (0) [type=prometheus]" provider=hpa
time="2024-06-06T11:42:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (59) [type=prometheus]" provider=hpa
time="2024-06-06T11:43:29Z" level=info msg="Collected new external metric 'zone-dev/dcgm-fi-dev-gpu-util' (58) [type=prometheus]" provider=hpa

Here is prometheus, you can find the metrics is showed.

Specifications

Version: 0.2.2
Platform: k8s 1.29
Subsystem:

johnzheng1975 commented 6 months ago

Please guide how to make it works, thanks.

johnzheng1975 commented 6 months ago

@mikkeloscar could you help to take a look, is this a defect? Thanks.

johnzheng1975 commented 6 months ago

kubectl scale deployment aiservice --replicas=1 -n zone-dev can work, but will trigger one more rs. Not sure this is correct way to do.

johnzheng1975 commented 5 months ago

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa) So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus. Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

johnzheng1975 commented 5 months ago

here is k8s code for this: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L821

Note that cpu/memory metrics will not raise this issue. Workaround is: Add a combination metrics, then desired replicas and current replicas will be 1. Then currentMetrics will not be null.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aiservice
  namespace: zone-dev
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc/
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
     avg(
       avg_over_time(
         DCGM_FI_DEV_GPU_UTIL{
           app="nvidia-dcgm-exporter",
           container="service",
           exported_namespace="zone-dev",
           pod=~"aiservice-.*",
           service="nvidia-dcgm-exporter"
         }[1m]
       )
     )
spec:
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: External
    external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        type: AverageValue
        averageValue: "50"
  - resource:
      name: memory
      target:
        averageUtilization: 95
        type: Utilization
    type: Resource

johnzheng1975 commented 5 months ago

I think kube-metrics-adapter need improve for this issue. FYI

szuecs commented 5 months ago

can you show the table view of the query?

        DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"

I checked your pictures and for me it looks like the labels are not matching.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

johnzheng1975 commented 5 months ago

Thanks for your answer. @szuecs Here is the result:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.161.08", Hostname="ip-10-200-181-23.us-west-2.compute.internal", UUID="GPU-e1a61ba4-0fff-2b29-744f-110f9ca929cf", app="nvidia-dcgm-exporter", container="service", device="nvidia0", exported_namespace="zone-dev", gpu="0", instance="10.200.164.17:9400", job="kubernetes-service-endpoints", modelName="Tesla T4", namespace="infra", node="ip-10-200-181-23.us-west-2.compute.internal", pod="aiservice-84f444c7df-pw2jk", service="nvidia-dcgm-exporter"}

Value: 65

johnzheng1975 commented 5 months ago

Thanks for your reminder.

Unrelated to the issue: One other small thing that you likely want to change is memory averageUtilization: 95 would mean that it scales-out only at +10%, which is 105%, which is likely already OOM.

Since the averageUtilization: 95 is based on request memory, it will not OOM if limit memory higher than request memory, am I right? Thanks.

szuecs commented 5 months ago

Ok thanks the data looks good. Now I wonder a bit about if I understand the following correctly:

The thing is: we are using argorollout, so replicas of deploy will be 0. Real replicas count is 1 (will be change with hpa)

So, I think hpa should not show metrics is null. It should show as long as it can be queried from promtheus.

Is this a defect of kube-metrics-adapter or a defect of hpa? Thanks.

or can you provide me some workarround, thanks.

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right? However argocd rollout will set the replica to 0 and then it breaks, right? And your expectation is that we would provide the last non zero data. Do I understand this correctly?

johnzheng1975 commented 5 months ago

Thanks. @szuecs

So if replicas are more than 0, everything works: Prometheus query returns data and kube-metrics-adapter is providing the data for the hpa, right? Answer: Yes, if deployment replicas > 0, everything is fine.

However argocd rollout will set the replica to 0 and then it breaks, right? Answer: Because of argocd rollout, we have to set deploy replicas with 0

And your expectation is that we would provide the last non zero data. Do I understand this correctly? Answer: I expect:

No error message: message: the HPA controller was able to get the target's current scale
currentMetrics is not null
worked as another case in same enviroment as upper: istio-requests-total
Or worked as combination metrics https://github.com/zalando-incubator/kube-metrics-adapter/issues/724#issuecomment-2154390999

Note that for same deploy whose repolicas is 0,

hpa based on cpu or memory still works.
Set up a combine hpa based on memory and dcgm-fi-dev-gpu-util works

mikkeloscar commented 5 months ago

@johnzheng1975 What is the output if you describe the hpa?

kubectl --namespace zone-dev describe hpa aiservice

The events in the bottom are the most interesting from that output.

johnzheng1975 commented 5 months ago

@mikkeloscar , pls see upper

szuecs commented 5 months ago

From what I understand is that the istio query will return no data if you scale down to zero. CPU and memory is not a prometheus query but some kubernetes internal metrics server lookup that could respond data from cache. I wonder a bit if 0 replicas that it returns non zero CPU/memory, but that seems to be a side effect that makes argocd work.

From my side it sounds like a bug in argocd to be honest. I personally would not like this controller to cache data and null/nil seems to be the right value for a query with no data.

mikkeloscar commented 5 months ago

@johnzheng1975 I wanted to see the events, you shared get hpa output, I want to see describe hpa.

johnzheng1975 commented 5 months ago

@mikkeloscar @szuecs I found the root reason now. This is not a defect of kube-metrics-adapter. This is caused by incorrect configuration. Sorry for confuse I bring.

The wrong configuration is: scaleTargetRef is deploy which replicator is 0. It will bring the issue upper "scaling is disabled since the replica count of the target is zero"

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aiservice

The right configuration is changing scaleTargetRef from Deployment to Rollout, who replicas is 1 It works perfect.

Complete correct configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/prometheus-server: http://prometheus-server.infra.svc
    metric-config.external.dcgm-fi-dev-gpu-util.prometheus/query: |
      avg(
        avg_over_time(
          DCGM_FI_DEV_GPU_UTIL{
            app="nvidia-dcgm-exporter",
            container="service",
            exported_namespace="zone-dev",
            pod=~"aiservice-.*",
            service="nvidia-dcgm-exporter"
          }[1m]
        )
      )
  creationTimestamp: "2024-06-06T12:37:33Z"
  name: aiservice
  namespace: zone-dev
  resourceVersion: "68926327"
  uid: f0e5f9cf-cc9e-4f60-b97f-0ad8a0727cfd
spec:
  maxReplicas: 5
  metrics:
  - external:
      metric:
        name: dcgm-fi-dev-gpu-util
        selector:
          matchLabels:
            type: prometheus
      target:
        averageValue: "50"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: aiservice

johnzheng1975 commented 5 months ago

Let me close this ticket. Thanks for your excellent support. @szuecs @mikkeloscar

zalando-incubator / kube-metrics-adapter