Prometheus will return 503 Service Unavailable if a PromQL query times out or aborts. This is particularly prevalent for environments with very large numbers of nodes, where the tserver_export and node_export queries may return a lot of data.
The promdump utility should be enhanced to improve handling for retriable errors such as 503 errors. For example, by adding retry logic (with backoff?) to avoid overloading the Prometheus server.
Additionally, errors thrown in the logs are often ignored at runtime, meaning the resulting tarball is incomplete. Error handling should be improved so that fatal errors abort the promdump run, giving a clear indication to the end user that something has gone wrong.
Prometheus will return
503 Service Unavailable
if a PromQL query times out or aborts. This is particularly prevalent for environments with very large numbers of nodes, where thetserver_export
andnode_export
queries may return a lot of data.The
promdump
utility should be enhanced to improve handling for retriable errors such as 503 errors. For example, by adding retry logic (with backoff?) to avoid overloading the Prometheus server.Additionally, errors thrown in the logs are often ignored at runtime, meaning the resulting tarball is incomplete. Error handling should be improved so that fatal errors abort the
promdump
run, giving a clear indication to the end user that something has gone wrong.