Querying multi dimensions in a single job

manvitha9347 commented 2 years ago

Hi, I am working on config to get metrics from multi dimension from a single resource type my config looks like this

- job_name: azure-metrics-example-dimensions
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /probe/metrics/list
    params:
      template:
        - 'azuremetricsexplist_{metric}_{aggregation}_{unit}'
      cache:
        - 5s
      subscription:
        - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      resourceType:
        - Microsoft.DocumentDB/databaseAccounts
      metric:
        - ServerSideLatency
      interval: ["PT1M"]
      timespan: ["PT1M"]
      aggregation:
        - average
      metricFilter:
        - DatabaseName eq 'mydb' and CollectionName eq '*' and OperationType eq '*' and ConnectionMode eq '*'
    static_configs:
      - targets: ["172.17.0.2:8080"]

when i use this config, there is a mismatch of data between azure and exporter data is there a way to specify all dimensions(CollectionName,OperationType,ConnectionMode)of a particular metric(Server side latency) in a single job name ? can you help me with this? @mblaschke

manvitha9347 commented 2 years ago

and also, for the same resourceType, I was not able to access the azure via via resourceGroup scope with /resource end point. any idea why this happens?

- job_name: azure-metrics-example-mongodb
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /probe/metrics/resource
    params:
      template:
        - 'azuremetricsexp_{metric}_{aggregation}_{unit}'
      cache:
        - 5s
      subscription:
        - xxxxxxxxxxxxxxxxxxxxxxxxxxxx
      target:
         - /subscriptions/xxxxxxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/oscocosmosxxxxxx
      resourceType:
        - Microsoft.DocumentDB/databaseAccounts
      metric:
        - ServerSideLatency
      interval: ["PT1M"]
      timespan: ["PT1M"]
      aggregation:
        - average
      metricFilter:
        - DatabaseName eq 'osco' and CollectionName eq '*'
    static_configs:
      - targets: ["172.17.0.2:8080"]

mblaschke commented 2 years ago

what mismatch do you get? can you give me more details?

(added code blocks to your posts)

manvitha9347 commented 2 years ago

i need to configure multiple dimensions for multiple metrics in a single job name for example, for ServerSideLatency metric- Dimesions are DataBaseName,CollectionName,ConnectionMode,OperationType NormalisedRUConsumption-Dimensions are DataBaseName,CollectionName,PartitionKey

Everything needs to be configured under single prometheus job name I tried like this:

- job_name: azure-metrics-example-dimensions
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /probe/metrics/list
    params:
      template:
        - 'azuremetricsexplist_{metric}_{aggregation}_{unit}'
      cache:
        - 5s
      subscription:
        - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      resourceType:
        - Microsoft.DocumentDB/databaseAccounts
      metric:
        - ServerSideLatency
        - NormalizedRUConsumption
      interval: ["PT1M"]
      timespan: ["PT1M"]
      aggregation:
        - average
      metricFilter:
        - DatabaseName eq 'mydb' and CollectionName eq '*' and OperationType eq '*' and ConnectionMode eq '*' and PartitionKey eq '*'
    static_configs:
      - targets: ["172.17.0.2:8080"]

 This is not working (getting no data).Also if I try to configure it only for one metric(as mentioned in the previous comment),that is showing wrong data(higher values due to duplicacy)
 Is there a way to configure this? @mblaschke

mblaschke commented 2 years ago

which version are you using?

you should get a warning message in the console/container logs that the metric query wasn't possible:

for ServerSideLatency

Metric: ServerSideLatency does not support requested dimension combination: 
databasename,collectionname,operationtype,connectionmode,partitionkey, 
supported ones are: 
DatabaseName,CollectionName,Region,ConnectionMode,OperationType,PublicAPIType

for NormalizedRUConsumption

Metric: NormalizedRUConsumption does not support requested dimension combination: 
databasename,collectionname,operationtype,connectionmode,partitionkey, 
supported ones are: 
CollectionName,DatabaseName,Region,PartitionKeyRangeId,CollectionRid

mblaschke commented 2 years ago

as a hint: you can try and execute queries with http://azure-metrics-exporter-url/query with your browser if you enable --development.webui(will be always on with next update).

manvitha9347 commented 2 years ago

yes,the combinations were not possible as metric doesnot support requested dimension combination to make it work i am using a different job for each metric and each combination(2 metrics, 2 jobs)

if the combination is different,i need to write a new job
for suppose, for metric:ServerSideLatency, I need 2 combinations 1.DatabaseName eq 'osco' and CollectionName eq '' 2.DatabaseName eq 'osco' and ConnectionMode eq '' for metric NormalisedRUConsumption I need 2 combinations 1.DatabaseName eq 'osco' and ConnectionMode eq '' 2.DatabaseName eq 'osco' and PartitionKey eq ''

for this combination I need to write 4 different jobs only then data is matching Is there a way to configure in single job/should we need to do using 4 different jobs? @mblaschke

manvitha9347 commented 2 years ago

lastest docker image is used

mblaschke commented 2 years ago

you need at least two jobs because the dimensions are different

suggestion for metric:ServerSideLatency, use

DatabaseName eq 'osco' and CollectionName eq '*' and and ConnectionMode eq '*'

for metric NormalisedRUConsumption use

DatabaseName eq 'osco' and ConnectionMode eq '*' and PartitionKey eq '*'

on Prometheus side you can combine the samples using sum() and avg() or other functions.

azure-metrics-exporter itself is just a client to Azure Monitor API and doesn't do any additional transformations. It only fetches the metrics and provides them for Prometheus. So you can transform/combine them with PromQL.

manvitha9347 commented 2 years ago

DatabaseName eq 'osco' and CollectionName eq '' and ConnectionMode eq ''

this is giving wrong data actually

if i use it as DatabaseName eq 'osco' and CollectionName eq '' DatabaseName eq 'osco' and and ConnectionMode eq ''

this gives correct data when compared to azure @mblaschke

mblaschke commented 2 years ago

Are you checking the metrics in Prometheus? And you don't get the combined metrics when you sum() the averages together so it matches the Metrics in Azure?

manvitha9347 commented 2 years ago

Hi @mblaschke i am using the following config of prometheus

job_name: azmetricsexp_ServerSideLatency scrape_interval: 5m scrape_timeout: 5m metrics_path: /probe/metrics/list params: template:
- 'azuremetricsexplist{metric}{aggregation}_{unit}' cache:
- 5s subscription:
- xxxxxxxxxxxxxxxxxxxxxxxxxx resourceType:
- Microsoft.DocumentDB/databaseAccounts metric:
- ServerSideLatency interval: ["PT5M"] timespan: ["PT5M"] aggregation:
- average
- count
- maximum
- total metricFilter:
- DatabaseName eq 'osco' or DatabaseName eq 'orderff' or DatabaseName eq 'auth' and CollectionName eq '' and ConnectionMode eq '' and OperationType eq '*' static_configs:
  - targets: ["192.168.0.104:8080"]
  this is the error i see: time="2022-06-14T12:33:12+05:30" level=warning msg="insights.MetricsClient#List: Failure responding to request: StatusCode=529 -- Original Error: autorest/azure: Service returned an error. Status=529 Code=\"Unknown\" Message=\"Unknown service error\" Details=[{\"cost\":0,\"interval\":\"PT5M\",\"namespace\":\"Microsoft.DocumentDb/databaseAccounts\",\"resourceregion\":\"westus\",\"timespan\":\"2022-06-14T06:57:12Z/2022-06-14T07:02:12Z\",\"value\":[{\"displayDescription\":\"Server Side Latency\",\"errorCode\":\"Throttled\",\"errorMessage\":\"Query was throttled with reason: ServerBusy. Requested Metric:CosmosDBCustomer|AzureMonitor|ServerSideLatency. Output Dimensions: collectionname,connectionmode,databasename,operationtype. Dimension Filters: . FirstOutputSamplingType: NullableAverage. Start time: 6/14/2022 6:57:12 AM End time: 6/14/2022 7:01:12 AM. Resolution: 00:05:00, Last Value Mode: False.

and also i see a lot of gaps in metrics when viewed in grafana.I tried decreasing the scrape interval to 1m but my jobs are getting down very fastly showing "context deadline exceed" they are working better with 5m scrape which cant be changed.

How can i resolve this??

mblaschke commented 2 years ago

the following error message is coming from the Azure API, not from the exporter itself. The Azure API is failing here so you might want to approach your Azure support.

Error: autorest/azure: Service returned an error. Status=529 Code="Unknown" Message="Unknown service error" Details=[{"cost":0,"interval":"PT5M","namespace":"Microsoft.DocumentDb/databaseAccounts","resourceregion":"westus","timespan":"2022-06-14T06:57:12Z/2022-06-14T07:02:12Z","value":[{"displayDescription":"Server Side Latency","errorCode":"Throttled","errorMessage":"Query was throttled with reason: ServerBusy. Requested Metric:CosmosDBCustomer|AzureMonitor|ServerSideLatency. Output Dimensions: collectionname,connectionmode,databasename,operationtype. Dimension Filters: . FirstOutputSamplingType: NullableAverage. Start time: 6/14/2022 6:57:12 AM End time: 6/14/2022 7:01:12 AM. Resolution: 00:05:00, Last Value Mode: False.

Something is Azure is broken, it's not the exporter. The exporter cannot fix anything if the Azure API is down or is not responding (the error message itself is also "procuded" from autorest/azure which is the azure-sdk-for-go).

For the gaps:

which version are you using?
where is the exporter running? eg. as container inside AKS?
how many cpu cores are assigned to the exporter and is the container throtteling?
are you using caching? was the cache set to 5s (which disables the cache if you set scraping interval to 5m)
how many resources are you requesting for one run?

If, for any reason, the Azure API is failing (see error messge) the exporter cannot do anything and it will produce gaps as the Azure API is not responding. Normally this is not happening often but also the Azure API can be down, for outages check https://status.azure.com/en-us/status

For caching: If caching is enabled in the exporter (eg. via env var ENABLE_CACHING=1) you should set:

scrape_interval: 1m
scrape_timeout: 1m

and set cache to the same time as the interval:

cache: ["5m"]

then the exporter will be queried every minute but will deliver the same metric until the cache invalidates (5 minutes).

manvitha9347 commented 2 years ago

Hi @mblaschke thankyou for consistent response

For this dimension: DatabaseName eq 'osco' or DatabaseName eq 'orderff' or DatabaseName eq 'auth' and CollectionName eq '' and ConnectionMode eq '' and OperationType eq '*'

I am seeing data for osco but very less data for orderff or auth.I am able to see 1-2 scrapes in 1 hour interval What could be the reason for this? As i am able to see ample amount of data in azure

and can you help me to understand what is the difference between these key value pairs? interval: ["PT5M"] timespan: ["PT5M"] Time interval – The period of time between the gathering of two metric values. Time Span- aggregation span(like if it is 1m ,the aggregation will be one for every 1m and data is sent)

is the understanding correct?

mblaschke commented 2 years ago

If you use dimensions you get the top N results from the API, see https://docs.microsoft.com/en-us/rest/api/monitor/metrics/list (azure-metrics-exporter is just an Azure Monitor Metrics API client).

If you don't specify metricTop: [10], then you get the top 10 results from the Azure Monitor API.

For interval and timespan also see https://docs.microsoft.com/en-us/rest/api/monitor/metrics/list

mblaschke commented 1 year ago

closed due to inactivity

webdevops / azure-metrics-exporter

Querying multi dimensions in a single job #21