yugabyte / yb-tools

Tools for YugabyteDB database maintenance and support
Apache License 2.0
19 stars 18 forks source link

promdump: improve error handling for retriable and fatal errors #98

Closed ionthegeek closed 2 months ago

ionthegeek commented 1 year ago

Prometheus will return 503 Service Unavailable if a PromQL query times out or aborts. This is particularly prevalent for environments with very large numbers of nodes, where the tserver_export and node_export queries may return a lot of data.

The promdump utility should be enhanced to improve handling for retriable errors such as 503 errors. For example, by adding retry logic (with backoff?) to avoid overloading the Prometheus server.

Additionally, errors thrown in the logs are often ignored at runtime, meaning the resulting tarball is incomplete. Error handling should be improved so that fatal errors abort the promdump run, giving a clear indication to the end user that something has gone wrong.

eugeneckim commented 6 months ago

Recent thread which touched on this. https://yugabyte.slack.com/archives/C072YT48W57/p1715611566420589?thread_ts=1715500866.247639&cid=C072YT48W57