zapier / prom-aggregation-gateway

An aggregating push gateway for Prometheus
Mozilla Public License 2.0
116 stars 26 forks source link

feat: flag to enable gauge replacement #77

Open mcfedr opened 1 year ago

mcfedr commented 1 year ago

Enable different modes for feature flags

fixes https://github.com/zapier/prom-aggregation-gateway/issues/65

mplachter commented 1 year ago

@mcfedr Thank you for the MR.

We have had a ton of feedback on the gauges. We need to do something different... 🤔

Currently (to your point), we add the gauges up (which does not make sense to do)

This approach of just having the new gauge as the value, I think, can be misleading, as well as having a randomly high or low value as it's not an aggregate from all the containers\functions\pods sending metrics into the gateway. I think this is better than currently summing the gauge, as it makes no sense to do so...

My ideal situation would be to make a running average or median of the metric gauge over a given interval. I do think this will take some time to implement I do have a draft MR https://github.com/zapier/prom-aggregation-gateway/pull/57 to add this but to keep it performant I had to duplicate the memory footprint which isn't ideal (we may have to rewrite the structs and move away from slices of metrics)

We could change this behavior based on a CLI flag so we can still implement your desired change but make it so it doesn't change the current behavior unless a CLI flag is passed.

@djeebus thoughts?

djeebus commented 1 year ago

Yeah, I agree w/ all your points. A flag, something like --gauge-behavior=sum|replace, would probably be a good start, which leaves us an opening to do something like median, max, etc.

mcfedr commented 1 year ago

Yes, in my use case the value is an calculated value from some data, so no matter which process (in my case lambda functions) sends it, its the latest value - so i certainly wouldnt want any manipulation of the value

but i also thought about a cli flag, might be helpful to have different options

SpangleLabs commented 1 year ago

People keep suggesting this: #65 #71 and it seems like if you want the gauge to replace, then you should push it to a prometheus push gateway, not an aggregation gateway. It feels like an aggregation gateway should aggregate.

I think there are cases where aggregating the gauges is exactly what's wanted, especially in summaries for example.

It seems weird to add a CLI flag that turns the aggregation gateway into a push gateway though. I can understand wanting some metrics to be aggregated and some to be latest, in which case, push some to an aggregation gateway, and some to a push gateway?

The idea of the aggregation gateway showing the average gauge value of the last N pushes or M minutes also seems odd? It would need to be synced to the fetching or the pushing, and liable to go out of sync with either. If you want a running total, that seems like something an aggregated summary could do, or a prometheus query over a latest value gauge?

mcfedr commented 1 year ago

@SpangleLabs Gauge are different from Counters though, thats clearly why this is happening

I could use both push gateway and aggregation gateway - but it does mean my app has to make two http requests to send all the metrics, and thats a big disadvantage

SpangleLabs commented 1 year ago

Yeah, I'm just quite hesitant on the idea that an aggregation gateway should also do the job of a push gateway?

I guess if you need to save on web requests then it would be helpful to have a gateway that can serve the purposes of both.. But even then that would seem best configured at a metric or job level, rather than a gateway level? Though I've no idea what syntax could enable that. Tying the operation of your application and monitoring, to the configuration of your aggregation gateway seems a weird choice, and doesn't seem to properly separate concerns.

I feel it boils down to an idea of "do one thing well"

(And certainly, making a breaking change to the behaviour seems unwise)

In fairness though, I've been digging through to find my concrete use-cases for aggregating gauges, and most of them are in places where we were rolling our own summaries, as the aggregation gateway did not support Summary metrics at the time. I might be suffering some anchoring based on that

mcfedr commented 1 year ago

So what about something like this? I'm not 100% sure its the best way to pass the flag though, but its functional like this.

pablote commented 1 year ago

It'd be great if this or something very similar got merged. Having two push gateways, one that aggregates and one that doesn't does not make sense.

SpangleLabs commented 1 year ago

Honestly, at this point, I'm growing kind of unsure when one would want a push gateway at all, alongside or instead of an aggregation gateway.

After much thought (and overthought), I think this solution looks great. Having a flag to allow backwards compatibility, but moving towards the better usage of the thing.

(Even the stuff I said before about summaries being simulated with a counter and a gauge is entirely wrong. A summary is 2 counters)

mcfedr commented 1 year ago

@mplachter maybe I can bump this for a review?

mplachter commented 1 year ago

@mcfedr This looks good if we're able to add a unit_test for this in aggregate_test.go we can get this merged in, If we add this functionality, I want to make sure we do not undo it without the proper knowledge + documentation during a future release :)

hxnir commented 8 months ago

Hey, it looks really good and would really help one of my usecases. Any updates on the progress? If the missing unit test is the problem I am more than happy to add it myself.

mcfedr commented 8 months ago

@hxnir thanks for bumping this, it kept slipping down my list of things to do. I've added a couple of tests to check the new behavoir, @mplachter hopefully good for a merge now.

also with a rebase on main

JDLK7 commented 6 months ago

Hey, do you plan on merging this anytime soon? Thanks!

xstephen95x commented 4 months ago

Also curious if this will be picked up.

KepptnKool commented 2 months ago

@djeebus @mplachter What are the chances of this PR getting merged? Do you need any support? I guess there are quite some users who would like this feature.