webdevops / azure-resourcemanager-exporter

Prometheus exporter for Azure ResourceManager informations (infos, quotas, limits, usages, public IPs, portscanner)
MIT License
33 stars 17 forks source link

Do not panic on AzureRM Errors #44

Open jkroepke opened 1 year ago

jkroepke commented 1 year ago

Hi,

from time to time we are receiving error from Azure for cost management related errors.

Some of them are persistent and after five retries, the exporter will be panic and exited.

I would like to ask, if the behavior can be changed from panic level to error level. I do not need any benefit of letting the exporter terminate.

Example panic trace after 5 retries:

goroutine 1265 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0xc0001eac00?, {0x0?, 0x0?, 0xc00046e5c0?})
    /go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000fcd00, {0x0, 0x0, 0x0})
    /go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*SugaredLogger).log(0xc0000b8258, 0x4, {0x0?, 0xc0000d1300?}, {0xc000265350?, 0xc0001d7600?, 0xc000375370?}, {0x0, 0x0, 0x0})
    /go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
go.uber.org/zap.(*SugaredLogger).Panic(...)
    /go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:153
main.(*MetricsCollectorAzureRmCosts).sendCostQuery(0x23?, {0x17649b0, 0xc000048048}, 0xc0000b8258, {0xc0000d1300, 0x33}, {0xc0001d7600, 0xc000375370, 0xc000375230, 0x0}, ...)
    /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:495 +0x451
main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics(0xc0001d6480, 0x11?, 0xc000118c40, {0xc0000d1300, 0x33}, {0x15d1413, 0xa}, 0xc000265ef0, {0xc00033b090, 0xb}, ...)
    /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:370 +0x868
main.(*MetricsCollectorAzureRmCosts).collectRunCostQuery.func1(0xc0002bd9e0, 0xc00042fb48?)
    /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:225 +0x1db
github.com/webdevops/go-common/azuresdk/armclient.(*SubscriptionsIterator).ForEach(0xc0000b8220?, 0xc00042fd20?, 0xc000265d30)
    /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/azuresdk/armclient/iterator.subscriptions.go:65 +0x1e2
main.(*MetricsCollectorAzureRmCosts).collectRunCostQuery(0xc0001d6480, 0xc000265ef0, 0xc0001c5e10?)
    /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:223 +0x2ed
main.(*MetricsCollectorAzureRmCosts).Collect(0xc0001d6480, 0x1753340?)
    /go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:183 +0xe5
github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun.func1()
    /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/prometheus/collector/collector.go:380 +0x98
created by github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun
    /go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/prometheus/collector/collector.go:351 +0x13b
tesharp commented 1 year ago

Seeing the same. Creates gaps in all metrics gathered. Would prefer if it just wrote an error

HelenaSeidel commented 11 months ago

The underlying issue is that panic() is used to handle errors, further up in the callstack it is attempted to "catch" those with recover() which does not work

Recover is a built-in function that regains control of a panicking goroutine. Recover is only useful inside deferred functions. During normal execution, a call to recover will return nil and have no other effect. If the current goroutine is panicking, a call to recover will capture the value given to panic and resume normal execution. further details

This is basically what is being done here and putting it into the playground quickly shows that it fails.

package main

import (
    "fmt"
)

func foo() {
    panic("foobar")
}

func main() {
    foo()
    recover := recover()
    fmt.Println(recover)
}

The really BIG problem is that this incorrect treatment of errors is used for the back-off-retry mechanic in github.com/webdevops/go-common/prometheus/collector IMO this entire behavior should be broken

Currently I am working on putting a bandaid at least onto the cost collector to get rid of this issue. However an entire rework of error handling in (at least) this project, the azure-resourcegraph-exporter and go-common is required

Furthermore I don't understand why the prometheus go client library is not used more extensively here