WIP Add batching for gauge metrics

mplachter commented 1 year ago

POC Adding batching for metrics on a interval

TODO:

Add AVG during the batch interval
Add TimeStamp Filtering for old metrics

faangbait commented 1 year ago

Hey, I see people are trying to solve the same problem I wandered over here from.

The solution I had was also "callback on GET /metrics that wipes after a scrape." If you're aggregating and squashing labels, that's the only way to prevent double counting.

As far as "what if a user touches the endpoint," that's easily solved with HTTP Basic Auth.

I'm sure everybody's in the same boat with "they don't build computers with the amount of memory we need, so let's aggregate some stuff." Wipe-on-scrape has the added benefit of being homeostatic.

Specifically, as the fleet that's being aggregated horizontally scales, memory usage increases at the aggregator level. You could tune memory usage by scraping more often, causing the aggregators to wipe more often. But this would increase memory requirements at the central server.

But then you just aggregate again through another layer of aggregators. From there, it's aggregators all the way down and the SREs are happy!

mplachter commented 1 year ago

@faangbait, thanks for your insight here. I agree, I don't think duplicating all the metrics is valuable.

We do have a few issues with gauges tho, as we don't want to add them up we need to calculate a floating average for a given probed period. We also can't just keep averaging them out between probe intervals as thats dependent on the endpoint being scrapped, which could lead to an average over different time durations.

zapier / prom-aggregation-gateway

WIP Add batching for gauge metrics #57