zapier / prom-aggregation-gateway

An aggregating push gateway for Prometheus
Mozilla Public License 2.0
116 stars 26 forks source link

[Feature] Support replacing rather than summing gauges #65

Open Sinjo opened 1 year ago

Sinjo commented 1 year ago

Hey 👋🏻

I've been looking at this project as one option for situations where scraping metrics is awkward (short-lived cron jobs in particular), but ran into one snag that's documented right at the top of the README:

Gauges are also added up (but this may not make any sense)

In many situations, summing up gauges doesn't make sense. The one I'm specifically running into is use of gauges to record the last time a job ran. In that situation, the behaviour I'd like is last-write-wins (i.e. whatever is currently in the gauge gets overwritten by a push to the same grouping key).

I was looking at the code, and it looks like it'd be fairly simple to make this behaviour conditional.

I'd be happy to send over a PR to do that. What I want to check is 1) whether you'd accept a PR that added a CLI flag to toggle the behaviour and 2) whether you have any preferences on implementation (e.g. should it be a global setting for all gauges in a given instance of the aggregation gateway).

SpangleLabs commented 1 year ago

Not sure if this is relevant, but while browsing the code for #5 I did notice that the aggregate functionality actually says gauges would be cleared out when read: https://github.com/zapier/prom-aggregation-gateway/blame/main/metrics/merge.go#L86 Though, there's just a todo note in the code where that reset would be performed: https://github.com/zapier/prom-aggregation-gateway/blame/main/metrics/aggregate.go#L149

So it seems like maybe the original intent was something inbetween replacing and summing? (Though I would definitely prefer if it were one or the other, rather than reading the metrics being a write operation, which definitely seems like it would be odd behaviour.)

Also, as a bystander, I feel like having the setting you suggest in question 2 would be much nicer as a per-gauge one, rather than a global one. There are definitely situations where a gauge might like to have one behaviour or the other, but I'm not sure how you would best delineate that

SpangleLabs commented 1 year ago

Of note also, is that the standard prometheus push gateway always implements the replace mechanic, one could theoretically use both if you want some metrics to replace with latest, and some to aggregate?

Sinjo commented 1 year ago

Yeah, it's something I've considered, but I'd rather just run one extra binary alongside any jobs that need to push metrics somewhere.

I'll leave this open, but having spoken to the maintainers of Vector on this issue thread, I'm going to go that route instead.