webdevops / azure-metrics-exporter

Azure Monitor metrics exporter for Prometheus with dimension support, template engine and ServiceDiscovery
MIT License
118 stars 24 forks source link

Metric gaps #47

Open AM-Dani opened 1 year ago

AM-Dani commented 1 year ago

Hello Team,

We are trying to extract metrics from various resources (service bus, storage accounts, and VMs) and we are experiencing gaps several times a day in all of them. For some of these gaps, we clearly see that the problem is on Azure's side (QueryThrottledException), but for others, we don't see anything in the log entries or in the exporter's metrics. This is an example from today, for the ServiceBus:

image

azurerm_api_request_count (rate): image

azurerm_api_request_bucket (30s): image

With the following configuration:

endpoints:
  - interval: 1m
    path: /probe/metrics/resourcegraph
    port: metrics
    scrapeTimeout: 55s
    params:
      name:
        - 'azure-metric'
      template:
        - '{name}_{metric}_{aggregation}_{unit}'
      subscription:         
        - '***************'
      resourceType:
        - 'Microsoft.ServiceBus/Namespaces'
      metric:
        - 'ActiveMessages'
        - 'DeadletteredMessages'
        - 'ScheduledMessages'
        - 'IncomingMessages'
        - 'OutgoingMessages'
      interval:
        - 'PT1M'
      timespan:
        - 'PT1M'
      aggregation:
        - 'average'
        - 'total'
      metricFilter:
        - EntityName eq '*'
      metricTop:
        - '500'

No failure found in the metrics exporter logs, and we have the same problems using '/probe/metrics/list' in other resources.

Can you please help me with this?

mblaschke commented 1 year ago

was there a reset of the exporter in that time?

AM-Dani commented 1 year ago

No, it is very stable. The last reboot was to test version 22.12.0-beta0, we wanted to check if we were able to solve the gaps with this version. We couldn't, but we kept using it.

jangaraj commented 1 year ago

Do you see metric gaps also in the Azure console?

cdavid commented 1 year ago

I hit something similar in my usage of the library - sometimes the metrics are missing. I believe our timeout for Prometheus scraping (20 seconds) might be too short in cases when Service Discovery is needed.

@mblaschke - I was considering contributing some extra logging and/or some other ways of understanding what happens under the hood (is service discovery slow, is the metrics fetching slow etc. - maybe restrict it to only when doing --log.debug?). Before I do anything, do you have some thoughts, guidelines, ideas regarding this area?

Thanks!

mblaschke commented 1 year ago

@cdavid if scape exceeds timeout duration you can lookup metrics scrape_duration_seconds from Prometheus. if it's at your limit the scrape took too long.

with latest version you can now switch to subscription scoped metrics (path /probe/metrics) which requests all metrics from the subscription instead from each resource. this doesn't cover all use cases but reduces the api calls and is much faster.

so i suggest to try the subscription scoped metrics first. if that's not enough you can still increase concurrency so more requests are triggered at the same time.