webdevops / azure-resourcemanager-exporter

Prometheus exporter for Azure ResourceManager informations (infos, quotas, limits, usages, public IPs, portscanner)
MIT License
33 stars 17 forks source link

Improve cost metrics (was: Container crashing due to data collection error) #20

Closed jorgesouzattech closed 1 year ago

jorgesouzattech commented 1 year ago

Hello everyone,

We're using your Github project "webdevops/azure-resourcemanger-exporte" to collect costs data from Azure Cost Management. In the past, we already used the old version of this project, where the "Go" is not compiled like a Docker container. Currently, with the "Go" as a container, we're receiving the follow error, especific in the cost data collect: "RESPONSE 429: 429 Too Many Requests\nERROR CODE: 429". We already increase the duration time that this data are collect to 5 minutes, but the problem persist. The container stoped to drop, but the "panic" in the collect still occurr, and when happen, create a gap in our dashboards. We'd like that you help us with this issue. Can we count on your help?

mblaschke commented 1 year ago

please try docker tag 22.12.0-beta4 and set cost scrape time to hour instead of minutes. Azure is limiting requests to costs/consumption API with a strict ratelimit which is set on tenant and not on subscription level.

22.12.0-beta4 is also implementing a caching mechanism to prevent refetching the values when the exporter is restarting. it will restore the metrics from the cache and adjust the next scrape time to the expiry date of the cached metrics to avoid unwanted queries. also you can set COSTS_REQUEST_DELAY to delay every request by eg. 10s or maybe 30s for bigger tenant to relax pressure on this API.

jorgesouzattech commented 1 year ago

We are using export version 22.11 and increased the request interval to Azure to +5 min which solved the problem and kept the container up and no longer found "panic" errors in our logs.

Thanks for the support.

mblaschke commented 1 year ago

I recommend fetching the metrics every 12 hours and set the COSTS_REQUEST_DELAY to 30s and enable caching.

I will check an Azure StorageAccount solution for cache backend which would make it easier to store data, see #21

aloysioc commented 1 year ago

Hi Markus. The customer has specific tags to define different environments in its structure (eg: "Prod" and "Dev"). I'm trying to get this information through the azure.resource.tag parameter, but it's not working. How should I configure this parameter to work? The attribut of this tag is "environment" and the values are "Prod", "Dev" and "Backup". I'm setting this parameter like '--azure.resource.tag="environment"' in the run command. It's correct?

Thanks in advance.

aloysioc commented 1 year ago

I recommend fetching the metrics every 12 hours and set the COSTS_REQUEST_DELAY to 30s and enable caching.

I will check an Azure StorageAccount solution for cache backend which would make it easier to store data, see #21

How to we enable the cache? After four days working good, the problem reappear. :( These settings do not cause gaps in the Grafana's dashboards, correct?

mblaschke commented 1 year ago

as mentioned above caching is implemented starting container tag :2022.12.0-beta4

I've just implemented a way to write/restore cache data to Azure StorageAccounts starting with :2022.12.0-beta6, see #21

if you specify a file path cache data will be stored on local disk so you have to make sure this storage is available after a restart (in Kubernetes you have to attach a Persistent Volume).

aloysioc commented 1 year ago

Hi Markus. Is it possible to extract metrics from Azure through the exporter such as meter, meter category and meter subcategory? For example: We need to show which is the most expensive artifact in the month.

Thanks in advance

mblaschke commented 1 year ago

can you show me a screenshot of an example in Azure portal?

aloysioc commented 1 year ago

image image image

mblaschke commented 1 year ago

have to check how to get the values, thanks.

aloysioc commented 1 year ago

Ok. I'll appreciate that.

Kind regards.

aloysioc commented 1 year ago

Hi Markus. Do we've news about this issue?

Kind regards,

mblaschke commented 1 year ago

still working on it and thinking about how to integrate more complex queries

mblaschke commented 1 year ago

you could try COSTS_DIMENSION=Meter to get the Meter values by resourcegroup

aloysioc commented 1 year ago

I didn't find this "metric".

aloysioc commented 1 year ago

Hi Markus.

When I put only one metric in the "cost.dimension" parameter (eg --cost.dimension='Meter'), it worked fine. But when I tried to put another metric with space as delimiter, I got the following error:

"Invalid query definition: Invalid dataset grouping: 'Meter MeterCategory ResourceId ConsumedService'; valid values: 'ResourceGroup','ResourceGroupName','ResourceLocation','ConsumedService','ResourceType','ResourceId','MeterId','BillingMonth','MeterCategory','MeterSubcategory','Meter','AccountName','DepartmentName','SubscriptionId','SubscriptionName','ServiceName','ServiceTier','EnrollmentAccountName','BillingAccountId','ResourceGuid','BillingPeriod','InvoiceNumber','ChargeType','PublisherType','ReservationId','ReservationName','Frequency','PartNumber','CostAllocationRuleName','MarkupRuleName','PricingModel','BenefitId','BenefitName',''.

Can you help us?

Thanks in advance.

aloysioc commented 1 year ago

We achieved this using several "--cost.dimension" parameters, one for each desired metric. Command line got pretty big, but no problem.

Thank you very much.

mblaschke commented 1 year ago

you can use space separation when using environment variables

for command line you have to specify them multiple times

mblaschke commented 1 year ago

Next version (available with 23.0.0-beta2) will have a different approch for cost reporting by offering "query" support:

queries can be defined by using env vars:

COSTS_QUERY_by_resourcegroup="ResourceGroupName"
COSTS_QUERY_by_meter_and_resourcegroup="Meter,ResourceGroupName"

metric name will be azurerm_costs_{queryName}:

COSTS_QUERY_by_resourcegroup --> azurerm_costs_by_resourcegroup
COSTS_QUERY_by_resourcegroup --> azurerm_costs_by_meter_and_resourcegroup
aloysioc commented 1 year ago

Hi Markus.

Thank you very much by your dedication about our requests.

We still have a problem with the tags. When using the "COSTS_QUERY_by" metrics, we were unable to get the tag metrics. Only the "owner" tag is displayed in queries. We put these settings in a "vars.env" configuration file and run the exporter with the following command:

"sudo docker run -d --name go1 --env-file vars.env -v go-volume:/go -p 8081:8080 webdevops/azure-resourcemanager-exporter:23.0.0-beta2 --azure.tenant="xxxx-xxxx-xxxx-xxxx" --log.debug --scrape.time=0 --scrape.time.costs=300s --scrape.time.resource=300s --azure.resource.tag=AMBIENTE"

Our last need is to get the cumulative values per tag, and all our problems are solved. :)

Thanks in advance.

aloysioc commented 1 year ago

Hi Markus.

We still have a problem with the tags. We're using the following command to call the Go's container:

"sudo docker run -d --name go1 --env-file vars.env -v go-volume:/go -p 8081:8080 webdevops/azure-resourcemanager-exporter:23.0.0-beta2 --azure.tenant="xxxx-xxxx-xxxx-xxxx" --log.debug --scrape.time=0 --scrape.time.costs=300s --scrape.time.resource=300s --azure.resource.tag=AMBIENTE"

But, we continue without seeing the tags in resources.

Our last need is to get the cumulative values per tag, and all our problems are solved. :)

Thanks in advance.

aloysioc commented 1 year ago

Hi Markus. Please, help us with the TAG's issue.

mblaschke commented 1 year ago

please post the env var which you're using for the cost query (starting with COSTS_QUERY_)

aloysioc commented 1 year ago

These vars are in a variable file (vars.env).

AZURE_TENANT_ID=xxxx-xxxx-xxxx-xxxx AZURE_CLIENT_ID=xxxx-xxxx-xxxx-xxxx AZURE_CLIENT_SECRET=xxxx-xxxx-xxxx-xxxx AZURE_SUBSCRIPTION_ID=xxxx-xxxx-xxxx-xxxx AZURE_RESOURCE_TAG=AMBIENTE DEPARTAMENTO COSTS_QUERY_by_MT_RSID_RG=Meter,ResourceId,ResourceGroupName COSTS_QUERY_by_RSID_RG=ResourceId,ResourceGroup COSTS_QUERY_by_MTC_MTSC_RSID_RG=MeterCategory,MeterSubcategory,ResourceId,ResourceGroup

mblaschke commented 1 year ago

To summarize, you're running a costs query which eg. uses the ResourceGroupName/Resource as dimension you expect that the exporter also adds selected tags from the ResourceGroup AZURE_RESOURCEGROUP_TAG and/or Resource AZURE_RESOURCE_TAG config?

aloysioc commented 1 year ago

In real, I'd like to calculate the costs of resources that have a specific tag. For example: what is the cost of PROD environment resources.

We tried to relate data from azure-resourcemanager-exporter and azure-resourcegraph-exporter (which brings tags in its context), using ResourceId as key, but in one this metric is written differently from the other ("ResourceId" in azure-resourcemanager-exporter and "resourceID" in azure-resourcegraph-exporter) and this causes confusion when we make the relationship.

mblaschke commented 1 year ago

with :23.2.0-beta1 you can eg try COSTS_QUERY_by_owner=tag:owner which queries costs by tag owner from the Azure costs API.

aloysioc commented 1 year ago

COSTS_QUERY_PRD_owner=tag:owner

I'm receiving the following error:

{"collector":"Costs","file":"/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:205","func":"main.(MetricsCollectorAzureRmCosts).collectSubscription","level":"info","msg":"fetching cost report for query prd_owner","subscriptionID":"xxxx-xxxx-xxxx-xxxx","subscriptionName":"Acesso ao Azure Active Directory(Converted to EA)"} {"collector":"Costs","costreport":"ActualCost","file":"/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:351","func":"main.(MetricsCollectorAzureRmCosts).collectCostManagementMetrics","level":"panic","msg":"POST https://management.azure.com/subscriptions/xxxx-xxxx-xxxx-xxxx/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"code\": \"BadRequest\",\n \"message\": \"Invalid query definition: Invalid dataset grouping: 'owner'; valid values: 'ResourceGroup','ResourceGroupName','ResourceLocation','ConsumedService','ResourceType','ResourceId','MeterId','BillingMonth','MeterCategory','MeterSubcategory','Meter','AccountName','DepartmentName','SubscriptionId','SubscriptionName','ServiceName','ServiceTier','EnrollmentAccountName','BillingAccountId','ResourceGuid','BillingPeriod','InvoiceNumber','ChargeType','PublisherType','ReservationId','ReservationName','Frequency','PartNumber','CostAllocationRuleName','MarkupRuleName','PricingModel','BenefitId','BenefitName',''.\r\n\r\n (Request ID: e924ace7-6fdc-43e1-8368-49cdebcd26ae)\"\n }\n}\n--------------------------------------------------------------------------------\n","subscriptionID":"10d45ba9-1ad0-40f9-a0d3-e495ad07613a","subscriptionName":"Acesso ao Azure Active Directory(Converted to EA)"} {"collector":"Costs","file":"/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230211175655-c2cb48b1ad72/prometheus/collector/collector.go:193","func":"collector.(*Collector).collectRun.func1.1","level":"error","msg":"panic occurred (panic threshold 1 of 5): POST https://management.azure.com/subscriptions/xxxx-xxxx-xxxx-xxxx/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"code\": \"BadRequest\",\n \"message\": \"Invalid query definition: Invalid dataset grouping: 'owner'; valid values: 'ResourceGroup','ResourceGroupName','ResourceLocation','ConsumedService','ResourceType','ResourceId','MeterId','BillingMonth','MeterCategory','MeterSubcategory','Meter','AccountName','DepartmentName','SubscriptionId','SubscriptionName','ServiceName','ServiceTier','EnrollmentAccountName','BillingAccountId','ResourceGuid','BillingPeriod','InvoiceNumber','ChargeType','PublisherType','ReservationId','ReservationName','Frequency','PartNumber','CostAllocationRuleName','MarkupRuleName','PricingModel','BenefitId','BenefitName',''.\r\n\r\n (Request ID: e924ace7-6fdc-43e1-8368-49cdebcd26ae)\"\n }\n}\n--------------------------------------------------------------------------------\n"}

mblaschke commented 1 year ago

already working on a fix, the constant from the azure-sdk is not correct and the columns are not matching. will be released in some hours

mblaschke commented 1 year ago

please try :23.2.0-beta2

aloysioc commented 1 year ago

Hi, Markus.

Thank you very much for helping us. Just a detail: We have values in the tags that are written with an underscore "_", and the query ignored those records. Would it be possible to consider these cases as well?

Thanks in advance.

mblaschke commented 1 year ago

do you have an example for used tags? feel free to post the json value of the tags from the resource arm definition you're talking about tag values or the tag name?

aloysioc commented 1 year ago

image

image

mblaschke commented 1 year ago

can you see these tags in the azure cost analysis when you select the group by tags? the exporter is not ignoring these tags and it will take some time to generate a cost reporting with a similar tag setup.

aloysioc commented 1 year ago

You are right. In the portal these TAGs do not appear too.

mblaschke commented 1 year ago

are these tags on resources or on resourcegroups?

what i could do: adding the resourcegroups tags to cost reports when using ResourceGroupName as grouping.

aloysioc commented 1 year ago

Yes, we request that the customer apply this method, but we are not sure if this will apply.

mblaschke commented 1 year ago

how are the tags currently applied? on what scope?

aloysioc commented 1 year ago

We've determined five mandatory tags and these are being applied to resources.

aloysioc commented 1 year ago

Hi, Markus.

We are very pleased with his help. What you did worked fine.

But, there's just one small detail I'd like to report to you: currently, we're only able to fetch one tag per metric. Would you be able to bring multiple tags in just one metric line?

For example: I have tag 1 and 2. If I want to fetch these tags, I need to create two metric calls: COSTS_QUERY_by_MT_MTC_RSID_RG_1=Meter,MeterCategory,ResourceId,ResourceGroup,tag:1 COSTS_QUERY_by_MT_MTC_RSID_RG_2=Meter,MeterCategory,ResourceId,ResourceGroup,tag:2 It would not be possible to: COSTS_QUERY_by_MT_MTC_RSID_RG_TAGS=Meter,MeterCategory,ResourceId,ResourceGroup,tag:1,tag:2?

aloysioc commented 1 year ago

Hi Markus. Do you have news for us about the tags extraction?

Thanks in advance.

mblaschke commented 1 year ago

@aloysioc the problem here is that this query is sent to the Azure REST API: https://learn.microsoft.com/en-us/rest/api/cost-management/query/usage?tabs=HTTP

So this query is not executed by the exporter but at the backend systems in Azure. Only you would need to address this to the Azure support.

But nevertheless I've added AZURE_RESOURCEGROUP_TAG for dimension ResourceGroup and AZURE_RESOURCE_TAG for dimension ResourceId. But that's all i can do here.

theok-nice commented 1 year ago

@mblaschke What is the latest working image tag

I was using tag azure-resourcemanager-exporter:23.0.0-beta2 and confg

        - name: AZURE_RESOURCEGROUP_TAG
          value: creator
        - name: AZURE_RESOURCE_TAG
          value: creator
        - name: COSTS_QUERY_by_billingmonth
          value: BillingMonth
        - name: COSTS_QUERY_by_consumedservice
          value: ConsumedService
        - name: COSTS_QUERY_by_resourcegroup
          value: ResourceGroupName
        - name: COSTS_QUERY_by_resourcelocation
          value: ResourceLocation
        - name: COSTS_QUERY_by_resourcetype
          value: ResourceGroupName,ResourceType
        - name: COSTS_QUERY_by_resourcetype_resourceid
          value: ResourceGroupName,ResourceType,ResourceId
        - name: COSTS_QUERY_by_servicename
          value: ServiceName
        - name: COSTS_QUERY_by_servicename_resourceid
          value: ServiceName,ResourceId
        - name: COSTS_REQUEST_DELAY
          value: 30s
        - name: SCRAPE_TIME_COSTS
          value: 1h
        - name: SCRAPE_TIME_DEFENDER
          value: "0"
        - name: SCRAPE_TIME_GRAPH
          value: "0"
        - name: SCRAPE_TIME_IAM
          value: "0"
        - name: SCRAPE_TIME_QUOTA
          value: "0"
        - name: SCRAPE_TIME_RESOURCEHEALTH
          value: "0"
        - name: SCRAPE_TIME_SECURITY
          value: "0"

It was working for a few days,but since yesterday I am hitting some request (?) limits.

Today I started using image tag 23.2.0-beta2, configured my pod to use cache through PVC and set COSTS_REQUEST_DELAY to 5m, but I am still hitting the same error:

{
  "collector": "Costs",
  "file": "/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:218",
  "func": "main.(*MetricsCollectorAzureRmCosts).collectSubscription",
  "level": "info",
  "msg": "fetching cost report for query by_resourcegroup",
  "subscriptionID": "*****",
  "subscriptionName": "******"
}
{
  "collector": "Costs",
  "costreport": "ActualCost",
  "file": "/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:381",
  "func": "main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics",
  "level": "panic",
  "msg": "POST https://management.azure.com/subscriptions/******/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 429: 429 Too Many Requests\nERROR CODE: 429\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"429\",\n    \"message\": \"Too many requests. Please retry.\"\n  }\n}\n--------------------------------------------------------------------------------\n",
  "subscriptionID": "***",
  "subscriptionName": "****"
}
{
  "collector": "Costs",
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/collector.go:193",
  "func": "collector.(*Collector).collectRun.func1.1",
  "level": "error",
  "msg": "panic occurred (panic threshold 1 of 5): POST https://management.azure.com/subscriptions/******/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 429: 429 Too Many Requests\nERROR CODE: 429\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"429\",\n    \"message\": \"Too many requests. Please retry.\"\n  }\n}\n--------------------------------------------------------------------------------\n"
}
{
  "collector": "Costs",
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/cache.go:160",
  "func": "collector.(*Collector).collectionSaveCache",
  "level": "info",
  "msg": "saved state to cache: file://data/costs.json (expiring 2023-03-28 10:47:35.939969587 +0000 UTC)"
}
{
  "collector": "Costs",
  "duration": 14.541016868,
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/collector.go:325",
  "func": "collector.(*Collector).collectionFinish",
  "level": "info",
  "msg": "finished metrics collection, next run in 1h0m0s",
  "nextRun": "2023-03-28T10:47:35.944520782Z"
}

All images with 23.3.0-beta wont even start for me.

PS Thank you for your contribution on this project.

aloysioc commented 1 year ago

@mblaschke What is the latest working image tag

I was using tag azure-resourcemanager-exporter:23.0.0-beta2 and confg

        - name: AZURE_RESOURCEGROUP_TAG
          value: creator
        - name: AZURE_RESOURCE_TAG
          value: creator
        - name: COSTS_QUERY_by_billingmonth
          value: BillingMonth
        - name: COSTS_QUERY_by_consumedservice
          value: ConsumedService
        - name: COSTS_QUERY_by_resourcegroup
          value: ResourceGroupName
        - name: COSTS_QUERY_by_resourcelocation
          value: ResourceLocation
        - name: COSTS_QUERY_by_resourcetype
          value: ResourceGroupName,ResourceType
        - name: COSTS_QUERY_by_resourcetype_resourceid
          value: ResourceGroupName,ResourceType,ResourceId
        - name: COSTS_QUERY_by_servicename
          value: ServiceName
        - name: COSTS_QUERY_by_servicename_resourceid
          value: ServiceName,ResourceId
        - name: COSTS_REQUEST_DELAY
          value: 30s
        - name: SCRAPE_TIME_COSTS
          value: 1h
        - name: SCRAPE_TIME_DEFENDER
          value: "0"
        - name: SCRAPE_TIME_GRAPH
          value: "0"
        - name: SCRAPE_TIME_IAM
          value: "0"
        - name: SCRAPE_TIME_QUOTA
          value: "0"
        - name: SCRAPE_TIME_RESOURCEHEALTH
          value: "0"
        - name: SCRAPE_TIME_SECURITY
          value: "0"

It was working for a few days,but since yesterday I am hitting some request (?) limits.

Today I started using image tag 23.2.0-beta2, configured my pod to use cache through PVC and set COSTS_REQUEST_DELAY to 5m, but I am still hitting the same error:

{
  "collector": "Costs",
  "file": "/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:218",
  "func": "main.(*MetricsCollectorAzureRmCosts).collectSubscription",
  "level": "info",
  "msg": "fetching cost report for query by_resourcegroup",
  "subscriptionID": "*****",
  "subscriptionName": "******"
}
{
  "collector": "Costs",
  "costreport": "ActualCost",
  "file": "/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:381",
  "func": "main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics",
  "level": "panic",
  "msg": "POST https://management.azure.com/subscriptions/******/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 429: 429 Too Many Requests\nERROR CODE: 429\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"429\",\n    \"message\": \"Too many requests. Please retry.\"\n  }\n}\n--------------------------------------------------------------------------------\n",
  "subscriptionID": "***",
  "subscriptionName": "****"
}
{
  "collector": "Costs",
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/collector.go:193",
  "func": "collector.(*Collector).collectRun.func1.1",
  "level": "error",
  "msg": "panic occurred (panic threshold 1 of 5): POST https://management.azure.com/subscriptions/******/providers/Microsoft.CostManagement/query\n--------------------------------------------------------------------------------\nRESPONSE 429: 429 Too Many Requests\nERROR CODE: 429\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"429\",\n    \"message\": \"Too many requests. Please retry.\"\n  }\n}\n--------------------------------------------------------------------------------\n"
}
{
  "collector": "Costs",
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/cache.go:160",
  "func": "collector.(*Collector).collectionSaveCache",
  "level": "info",
  "msg": "saved state to cache: file://data/costs.json (expiring 2023-03-28 10:47:35.939969587 +0000 UTC)"
}
{
  "collector": "Costs",
  "duration": 14.541016868,
  "file": "/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230212164333-176c199fce96/prometheus/collector/collector.go:325",
  "func": "collector.(*Collector).collectionFinish",
  "level": "info",
  "msg": "finished metrics collection, next run in 1h0m0s",
  "nextRun": "2023-03-28T10:47:35.944520782Z"
}

All images with 23.3.0-beta wont even start for me.

PS Thank you for your contribution on this project.

Hi Theok-nice and Markus,

Thank you for your contribution. Since two days that the error 429 is occurring whit high frequence. Does anyone know if Microsoft change the data extraction rules of the Azure portal? This problem begning after March 25th. I've tried reducing the number of metrics we're pulling per exporter, but the problem persists.

Thanks in advance,

aloysioc commented 1 year ago

Look the behaviour of the data extraction after March 25th:

image

mblaschke commented 1 year ago

Costs/consumption ratelimits are AzureAD tenant limits. The more requests you do the higher is the possibility to hit the tenant wide limit.

Set the scrape time of costs to at least 12h or 24h, you might not need hourly cost reporting here if you're in a big AzureAD instance. Also increasing COSTS_REQUEST_DELAY is a good idea for big tenants.

This is a Azure limit, it's not easy for the exporter to tackle the situation without consuming more and more requests and blocking all other applications as well. This also includes every view/requests on the cost analysis dashboard in Azure portal which "consume" requests.

aloysioc commented 1 year ago

So, but I had no changes to my config file. The number of metrics defined by us is the same. Now, I tried reduction this metrics to solved this problem, but I'm not succeeding. :(

mblaschke commented 1 year ago

If you're right before exceeding the rate limit it could mean that you're going to hit it Azure tenant wide if someone uses the cost analysis module in Azure portal or uses Powershell to do some cost analysis. Azure doesn't care if someone does it in a development subscriptions as these limits are tenant wide.

Also requests have a cost that also can affect how early you will hit the rate limit.

mblaschke commented 1 year ago

maybe as idea: check the Azure/AzureAD logs who is consuming the requests and if it is the exporter reduce the scrape time for example.

theok-nice commented 1 year ago

I am just thinking:

I changed my config to have only one request

  COSTS_REQUEST_DELAY: "10m"
  COSTS_QUERY_by_billingmonth: "BillingMonth"

Then I delete my PVC and the pod. When the exporter starts, I get 429 instantly (within a few seconds).

If this fails for the simplest query without even waiting the 10mins that COSTS_REQUEST_DELAY sets, I don't think that scrape time is the issue.

To my knowledge, my tests with the exporter is the only process that query cost API for the subscription. It doesn't make sense to hit any rate limit.