Improve cost metrics (was: Container crashing due to data collection error)

jorgesouzattech commented 1 year ago

Hello everyone,

We're using your Github project "webdevops/azure-resourcemanger-exporte" to collect costs data from Azure Cost Management. In the past, we already used the old version of this project, where the "Go" is not compiled like a Docker container. Currently, with the "Go" as a container, we're receiving the follow error, especific in the cost data collect: "RESPONSE 429: 429 Too Many Requests\nERROR CODE: 429". We already increase the duration time that this data are collect to 5 minutes, but the problem persist. The container stoped to drop, but the "panic" in the collect still occurr, and when happen, create a gap in our dashboards. We'd like that you help us with this issue. Can we count on your help?

theok-nice commented 1 year ago

Is there a way to get more logs from the azure-resourcemanager-expo pod?

According to this https://learn.microsoft.com/en-us/rest/api/cost-management/query/usage?tabs=HTTP#errorresponse

The error response should have included when we could retry

Error responses:

    429 TooManyRequests - Request is throttled. Retry after waiting for the time specified in the "x-ms-ratelimit-microsoft.consumption-retry-after" header.

aloysioc commented 1 year ago

I need extract only costs and resources data. My call is mounted as following:

sudo docker run -d --restart always --name go --network=host --env-file pars.env -v go:/go -p 8080:8080 webdevops/azure-resourcemanager-exporter:23.3.0-beta3 --azure.tenant="xxx" --log.debug --scrape.time=0 --scrape.time.costs=300s --scrape.time.resource=300s

But, I'm also getting the 429 instantly. This has not happened for two days. :(

aloysioc commented 1 year ago

I did a new test, and I think the problem is in the cost data collection. I discarded this collection and now, it doesn't show any errors.

aloysioc commented 1 year ago

But, if I do the oposted and enable only costs collect, look the log:

aloysioc commented 1 year ago

The problem is definitely just in cost data collection. And it started to occur from March 25th, even using versions that worked previously.

mblaschke commented 1 year ago

This is because there is a very strict tenant wide rate limit by Azure which nobody (except Azure) can change. Unlike Azure REST API rate limit (which is application/user based) this rate limit is globally per Tenant and every user/client consumes request. So this rate limit is using the same bucket over all of your subscriptions. It doesn't care if you have 5 or 10.000 subscriptions, it's at the end the same bucket. Every user who clicks on "cost analysis" in the Azure Portal is reducing the requests for the exporter:

If you fetch cost data more often than 1 hour you will hit the rate limit by high confidence sooner or later, especially if you have "high cost" queries against the Azure Billing API.

see https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/manage-automation#data-latency-and-rate-limits

So i strictly recommend, especially for bigger Azure tenants, set the cost metrics to at least 12h. With >100.000 AzureAD users you might even want to set it to 24h.

The azure-resourcemanager-exporter is also publishing azure rate limits in metric azurerm_api_ratelimit (if you don't disable the rate limit metrics) so you see if you are exhausting your rate limit: https://github.com/webdevops/go-common/blob/main/azuresdk/prometheus/tracing/policy.go#L139 and https://github.com/webdevops/go-common/blob/main/azuresdk/prometheus/tracing/policy.go#L145

aloysioc commented 1 year ago

So you are recommending that we change the scrape_time parameter to 12h. And that? But, if we change this parameter, will we still have data in smaller intervals, like 5 minutes or half an hour for example? These data will be cached? I notice that, even increasing the parameter, in the first collection I already receive the 429 instantly and it stops collecting data. :(

mblaschke commented 1 year ago

I'm not sure if Azure is providing you cost information in such small timeframes. You only have to set cost scrape time to 12h, not the rest of the collectors.

As long as Azure is not changing the rate limit the only thing you could do is to spin up one AzureAD tenant per subscription but that would be very expensive and not very handy.

You still have COSTS_REQUEST_DELAY set to 5m? This should not happen so please check the rate limit metrics and maybe check if something else is also consuming cost requests.

see also the Azure documentation:

12 QPU per 10 seconds
60 QPU per 1 min
600 QPU per 1 hour

so you can easily exceed the hourly rate limit because of a very low rate limit.

But try :23.3.0-beta4 (build will take some minutes), i've found a possible issue where the exporter was not waiting after queries (if you're using multiple queries).

aloysioc commented 1 year ago

The version 23.3.x have an error when I try disable the security metrics.

aloysioc commented 1 year ago

This is the set up of my config file:

When I try run the container, I'm receiving the error immediately. :(

mblaschke commented 1 year ago

SCRAPE_TIME_SECURITY was renamed to SCRAPE_TIME_DEFENDER as the security collector was refactored into the defender collector (and cleaned up). There should be a notification about that if you try to use SCRAPE_TIME_SECURITY.

For cost queries: If you only have one subscription in your AzureAD tenant these queries might work, if you work in a corporate environment you're killing the cost reporting for the whole company with it.

COSTS_QUERY_PRD_MT_MTC_MTSC_RSID_RG is querying the Cost Usage API with high cost and returns lots of results because you're query by ResourceId (there is currently a limit of 1000 entries by the azure sdk but paging would make it worse i guess).

You cannot scrape these metrics every 15 minutes because of Azure limitations in a medium or bigger Azure environment.

Check the rate limit metrics, they will tell you how close you are to the tenant rate limit.

theok-nice commented 1 year ago

So say we see the following numbers

# HELP azurerm_api_ratelimit AzureRM API ratelimit
# TYPE azurerm_api_ratelimit gauge
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="costmanagement",subscriptionID="**********",tenantID="+++++++",type="entity-requests.DefaultQuota"} 0
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="costmanagement",subscriptionID="**********",tenantID="+++++++",type="tenant-requests.DefaultQuota"} 16
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="subscription",subscriptionID="**********",tenantID="+++++++",type="reads"} 11986
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="subscription",subscriptionID="**********",tenantID="+++++++",type="resourceRequests"} 96
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="tenant",subscriptionID="",tenantID="+++++++",type="reads"} 11992

What does that mean? The subscription is consuming 11986 from the 11992 available for the tenant?

mblaschke commented 1 year ago

These are Azure REST API ratelimits, for cost ratelimits they will have scope costmanagement and for consumption they will have scope consumption.

These numbers tells you how many requests are still available

Keep in mind that these metrics are reset after every metrics collection run because they are short lived, so check your prometheus instance for the collected metrics.

theok-nice commented 1 year ago

I changed my setup to use image tag "23.3.0-beta4"

and the following config

  SCRAPE_TIME_RESOURCEHEALTH: "0"
  SCRAPE_TIME_COSTS: "12h"
  SCRAPE_TIME_IAM: "0"
  # SCRAPE_TIME_SECURITY: "0"
  SCRAPE_TIME_DEFENDER: "0"
  SCRAPE_TIME_GRAPH: "0"
  SCRAPE_TIME_QUOTA: "0"
  SCRAPE_TIME_RESOURCE: "1h"
  SCRAPE_TIME_GENERAL: "1h"
  COSTS_REQUEST_DELAY: "10m"
  AZURE_RESOURCEGROUP_TAG: "creator"
  AZURE_RESOURCE_TAG: "creator"
  COSTS_QUERY_by_billingmonth: "BillingMonth"
  COSTS_QUERY_by_servicename: "ServiceName"
  COSTS_QUERY_by_consumedservice: "ConsumedService"
  COSTS_QUERY_by_resourcelocation: "ResourceLocation"
  COSTS_QUERY_by_resourcegroup: "ResourceGroupName"
  COSTS_QUERY_by_resourcetype: "ResourceGroupName,ResourceType"
  COSTS_QUERY_by_resourcetype_resourceid: "ResourceType,ResourceId"
  COSTS_QUERY_by_servicename_resourceid: "ServiceName,ResourceId"
  CACHE_PATH: "file://data"

I guess I will find out in 12hrs

But from the first run I see the following:

{
  "level": "info",
  "caller": "collector/collector.go:361",
  "msg": "starting metrics collection",
  "collector": "Costs"
}
{
  "level": "info",
  "caller": "azure-resourcemanager-exporter/metrics_azurerm_costs.go:262",
  "msg": "fetching cost report for query \"by_resourcelocation\" and timeframe \"MonthToDate\"",
  "collector": "Costs",
  "subscriptionID": "*******",
  "subscriptionName": "++++++++++++"
}
{
  "level": "panic",
  "caller": "azure-resourcemanager-exporter/metrics_azurerm_costs.go:415",
  "msg": "POST https://management.azure.com/subscriptions/*******/providers/Microsoft.CostManagement/query
--------------------------------------------------------------------------------
RESPONSE 429: 429 Too Many Requests
ERROR CODE: 429
--------------------------------------------------------------------------------
{
  \"error\": {
    \"code\": \"429\",
    \"message\": \"Too many requests. Please retry.\"
  }
}
--------------------------------------------------------------------------------
",
  "collector": "Costs",
  "subscriptionID": "*******",
  "subscriptionName": "++++++++++++",
  "costreport": "ActualCost",
  "stacktrace": "main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:415
main.(*MetricsCollectorAzureRmCosts).collectSubscription
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:263
main.(*MetricsCollectorAzureRmCosts).Collect.func1
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:252
github.com/webdevops/go-common/azuresdk/armclient.(*SubscriptionsIterator).ForEach
   /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230323215350-23bb4d4209c4/azuresdk/armclient/iterator.subscriptions.go:65
main.(*MetricsCollectorAzureRmCosts).Collect
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:251
github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun.func1
   /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230323215350-23bb4d4209c4/prometheus/collector/collector.go:246"
}

{
  "level": "error",
  "caller": "collector/collector.go:234",
  "msg": "panic occurred (panic threshold 1 of 5): POST https://management.azure.com/subscriptions/*******/providers/Microsoft.CostManagement/query
--------------------------------------------------------------------------------
RESPONSE 429: 429 Too Many Requests
ERROR CODE: 429
--------------------------------------------------------------------------------
{
  \"error\": {
    \"code\": \"429\",
    \"message\": \"Too many requests. Please retry.\"
  }
}
--------------------------------------------------------------------------------
",
  "collector": "Costs",
  "stacktrace": "github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun.func1.1
   /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230323215350-23bb4d4209c4/prometheus/collector/collector.go:234
runtime.gopanic
   /usr/local/go/src/runtime/panic.go:884
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite
   /go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198
go.uber.org/zap/zapcore.(*CheckedEntry).Write
   /go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264
go.uber.org/zap.(*SugaredLogger).log
   /go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:295
go.uber.org/zap.(*SugaredLogger).Panic
   /go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:153
main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:415
main.(*MetricsCollectorAzureRmCosts).collectSubscription
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:263
main.(*MetricsCollectorAzureRmCosts).Collect.func1
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:252
github.com/webdevops/go-common/azuresdk/armclient.(*SubscriptionsIterator).ForEach
   /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230323215350-23bb4d4209c4/azuresdk/armclient/iterator.subscriptions.go:65
main.(*MetricsCollectorAzureRmCosts).Collect
   /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:251
github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun.func1
   /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230323215350-23bb4d4209c4/prometheus/collector/collector.go:246"
}
{
  "level": "info",
  "caller": "collector/cache.go:169",
  "msg": "saved state to cache: file://data/costs.json (expiring 2023-03-29 22:01:04.364161054 +0000 UTC)",
  "collector": "Costs"
}
{
  "level": "info",
  "caller": "collector/collector.go:378",
  "msg": "finished metrics collection, next run in 12h0m0s",
  "collector": "Costs",
  "duration": 15.59760094,
  "nextRun": "2023-03-29T22:01:04.364Z"
}

I am posting this error here because I hope I don't have an error in my config, only to find out 12hs later.

aloysioc commented 1 year ago

So say we see the following numbers

# HELP azurerm_api_ratelimit AzureRM API ratelimit
# TYPE azurerm_api_ratelimit gauge
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="costmanagement",subscriptionID="**********",tenantID="+++++++",type="entity-requests.DefaultQuota"} 0
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="costmanagement",subscriptionID="**********",tenantID="+++++++",type="tenant-requests.DefaultQuota"} 16
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="subscription",subscriptionID="**********",tenantID="+++++++",type="reads"} 11986
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="subscription",subscriptionID="**********",tenantID="+++++++",type="resourceRequests"} 96
azurerm_api_ratelimit{apiEndpoint="management.azure.com",scope="tenant",subscriptionID="",tenantID="+++++++",type="reads"} 11992

What does that mean? The subscription is consuming 11986 from the 11992 available for the tenant?

I'm not receiving any data when I run the query. :(

aloysioc commented 1 year ago

Why doesn't my export collect ratelimit data?

jkroepke commented 1 year ago

Is it possible to to fetch the costs for all subscriptions through one call? Something like considered billing at AWS side? That would lower the amount of calls.

aloysioc commented 1 year ago

The errors stopped again. I don't know why, but they stopped.

theok-nice commented 1 year ago

Same experience here. But it's way slower it took 90min to return results for the following config - at least its working

  tag: "23.0.0-beta2"

  SCRAPE_TIME_COSTS: "12h"
  COSTS_REQUEST_DELAY: "5m"
  AZURE_RESOURCEGROUP_TAG: "creator"
  AZURE_RESOURCE_TAG: "creator"
  COSTS_QUERY_by_billingmonth: "BillingMonth"
  COSTS_QUERY_by_consumedservice: "ConsumedService"
  COSTS_QUERY_by_resourcegroup: "ResourceGroupName"
  COSTS_QUERY_by_resourcelocation: "ResourceLocation"
  COSTS_QUERY_by_resourcetype: "ResourceGroupName,ResourceType"
  COSTS_QUERY_by_resourcetype_resourceid: "ResourceGroupName,ResourceType,ResourceId"
  COSTS_QUERY_by_servicename: "ServiceName"
  COSTS_QUERY_by_servicename_resourceid: "ServiceName,ResourceId"

Latest tag 23.3.0-beta4 never returned any results for me.

aloysioc commented 1 year ago

Same experience here. But it's way slower it took 90min to return results for the following config - at least its working

  tag: "23.0.0-beta2"

  SCRAPE_TIME_COSTS: "12h"
  COSTS_REQUEST_DELAY: "5m"
  AZURE_RESOURCEGROUP_TAG: "creator"
  AZURE_RESOURCE_TAG: "creator"
  COSTS_QUERY_by_billingmonth: "BillingMonth"
  COSTS_QUERY_by_consumedservice: "ConsumedService"
  COSTS_QUERY_by_resourcegroup: "ResourceGroupName"
  COSTS_QUERY_by_resourcelocation: "ResourceLocation"
  COSTS_QUERY_by_resourcetype: "ResourceGroupName,ResourceType"
  COSTS_QUERY_by_resourcetype_resourceid: "ResourceGroupName,ResourceType,ResourceId"
  COSTS_QUERY_by_servicename: "ServiceName"
  COSTS_QUERY_by_servicename_resourceid: "ServiceName,ResourceId"

Latest tag 23.3.0-beta4 never returned any results for me.

Yeah! It's back up and running with no changes or explanation as to why this 429 error is occurring. However, I recommend that you remove "resourceId" from your queries. This can cause a problem with the amount of requests generated by your queries.

aloysioc commented 1 year ago

But, I also noticed this slowness in returning data.

mblaschke commented 1 year ago

before 23.3.0-beta4 the delay time of COSTS_REQUEST_DELAY was used for every timeframe run, now it's used for every cost query run so you can reduce this delay now if it's important. so for every full run it will take ``8 (request-time + 5m) 2 * x)

8 = your number of cost queries request-time = time it takes for requesting the cost information 5m = your COSTS_REQUEST_DELAY 2 = timeframes (default: MonthToDay, YearToDate) x = number of subscriptions

also the Azure SDK is now set to higher delay times for retries of cost queries because the default retry time is too short for this kind of strict rate limit.

just some suggestions if you use prometheus, you can combine following queries:

  COSTS_QUERY_by_resourcegroup: "ResourceGroupName"
  COSTS_QUERY_by_resourcetype: "ResourceGroupName,ResourceType"
  COSTS_QUERY_by_resourcetype_resourceid: "ResourceGroupName,ResourceType,ResourceId"

to

  COSTS_QUERY_by_resourcetype_resourceid: "ResourceGroupName,ResourceType,ResourceId"

  COSTS_QUERY_by_servicename: "ServiceName"
  COSTS_QUERY_by_servicename_resourceid: "ServiceName,ResourceId"

to

  COSTS_QUERY_by_servicename_resourceid: "ServiceName,ResourceId"

you can sum by labels to also get the value, i don't see any reason why this should not be possible.

Still i recommend not to use ResourceId because it will fetch the cost for every deployed resource and will generate lots of metrics.

for ratelimit metrics: these metrics are reset/removed after they were collected (if someone access /metrics) because these values have a "short expiry" as they are only valid for several seconds up to a minute. If you want to ignore the "short expiry" time you can disable autoreset: https://github.com/webdevops/azure-resourcemanager-exporter#settings

for the explaination of the rate limit values please consult Azure documentation: Azure REST API: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling Azure cost mgmt API: https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/manage-automation#data-latency-and-rate-limits

theok-nice commented 1 year ago

I created separate cost queries, to simplify my Prometheus(/Grafana) queries.

But generally speaking COSTS_QUERY_by_resourcetype_resourceid: "ResourceGroupName,ResourceType,ResourceId"

is crashing my pod with tag: "23.3.0-beta4". I couldn't figure out why though.

This one wouldn't crash my pod COSTS_QUERY_by_resourcetype_resourceid: "ResourceType,ResourceId"

aloysioc commented 1 year ago

When I tried to use this version "23.3.0-beta4", I got an error saying that it was not possible to fetch "resourceID", even though I don't have this metric in my settings. I went back to 23.2.0-beta2.

jkroepke commented 1 year ago

@aloysioc

I use 23.3.0-beta4 and I have no issues with COSTS_QUERY_resource_id: "ResourceGroup,ResourceLocation,ServiceFamily,ServiceName,PublisherType,ResourceType,ResourceId"

But before I could use this in production, I need to look into https://github.com/webdevops/azure-resourcemanager-exporter/issues/22

aloysioc commented 1 year ago

Now, I'm using version 23.3.0-beta4. The error has stopped, but I notice that there is now a delay between the values information from the portal and the data collected by the exporter, shown in Prometheus. Have any idea as to why this is happening? Previously, information was provided practically in real time.

Look data from Prometheus:

Look data from Portal:

mblaschke commented 1 year ago

The values are collected in the defined scrape time, you if you define 12 hours the metrics are updated every 12 hours. If you set it to 24h your metrics are only updated once a day.

If you scrape every 15 minutes your values are up to date but you might consume the whole requests of your tenant so nobody else can get cost information.

This all depends on your use case and the size of your tenant.

aloysioc commented 1 year ago

I'm scraping every 10 minutes (now, without errors), but the values are not updated in the same frequency.

The unique detail is this:

{"level":"warn","caller":"armclient/client.tags.go:286","msg":"unable to fetch resource tags for resource \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourceGroups/\": unable to parse Azure resourceID \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourcegroups/\"","component":"armClientTagManager"} {"level":"warn","caller":"armclient/client.tags.go:286","msg":"unable to fetch resource tags for resource \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourceGroups/\": unable to parse Azure resourceID \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourcegroups/\"","component":"armClientTagManager"} {"level":"warn","caller":"armclient/client.tags.go:286","msg":"unable to fetch resource tags for resource \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourceGroups/\": unable to parse Azure resourceID \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourcegroups/\"","component":"armClientTagManager"} {"level":"warn","caller":"armclient/client.tags.go:286","msg":"unable to fetch resource tags for resource \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourceGroups/\": unable to parse Azure resourceID \"/subscriptions/10d45ba9-1ad0-40f9-a0d3-e495ad07613a/resourcegroups/\"","component":"armClientTagManager"}

mblaschke commented 1 year ago

this is already fixed with #32 and released with :23.4.0-beta0 which now uses a config file instead of environment variables so this is a breaking change. the whole cost reporting is now using paging and offers own scopes (eg management groups) and you can define a list of subscriptions which should be used instead of the global subscription list.

the original topic for this issue is already fixed so i'm closing this issue now. please create new issues for issues/feature requests.

webdevops / azure-resourcemanager-exporter

Improve cost metrics (was: Container crashing due to data collection error) #20