Closed davejhahn closed 9 months ago
Going to close this. I found the Grafana dashboard and although it doesn't give me everything (failed builds), I think there are plenty of examples that I can look at to figure it out. Thanks!
For anyone else: https://gist.githubusercontent.com/mblaschke/ef86524d9350d45143cbbc5d2da674b1/raw/bc1c850d49dc0e5240c041218eee02c73d9040eb/example-dashboard.json
I've deployed the metrics exporter to my cluster, it pulls the metrics, i can see the metrics from the exporter, and it is ingesting into Prometheus.
That's where I kind of get stuck. I'm not really sure what these metrics are supposed to be representing, or maybe what I want to do and what it shows are not aligned.
Basically I would like to be able to:
e.g. 12:00 # failed builds 3 12:10 # failed builds 0 12:20 # failed builds 0 12:30 # failed builds 10
And other things such as this.
With that foundational information, I would then create Prometheus rules to trigger alerts based on thresholds that indicate a problem.
e.g. be able to look at the overall health of a large build system and if there are patterns of build failures that are not normal, trigger an alert (where it would then go to our on-call person via Pager Duty).
But I cannot understand how to get these basic numbers out of what is available for metrics. We have dozens of other exporters and never had to even really think about it, it just worked as expected. I'm guessing how the data defined is different.
It seems that the metrics are just perpetual. e.g. I can graph out the number of failed builds, but it just increases from the start to the end--not showing the breakdown by period of time, e.g. hourly or whatever. It just keeps increasing. But I don't want to know the number of failed builds since the beginning of time, but at any given point in time.
I've tried numerous ways of obtaining this with Prometheus functions and nothing seems to work.
Does anyone know if I can obtain this basic information or is my use case completely different? I mean this is the kind of data Prometheus is for, so that's why I am confused.