Closed botimer closed 3 years ago
Wow! Thank you for very detailed explanation!
My bad, I wasn't tracking Prometheus Ruby client development closely and missed addition of new :most_recent
aggregation mode in version 2.1.0.
I believe that it should fix #8
Please try version 0.6.0 of this gem that has your PR merged and share your experience.
TLDR: all of the puma gauges should use
aggregation: :most_recent
to be accurate.When using the DirectFileStore to support multi-process mode (aka clustered, multiple workers), the pid labels included by default lead to incorrect and stale information. This is related to #8, yabeda-prometheus#10, and others.
As-is, each process reports info about all processes, giving some multiplicative data, and workers that are terminated are not pruned from the metrics (at least until a total restart cleans all of the bin files). As an example, say you run with 4 workers and 8 threads, and three have crashed/restarted. With the pid labels, these will appear as 8x4 threads for every worker, giving 7x32 (4 live workers, and 3 expired), or 224 threads, rather than the actual 32.
The main issue is that prometheus-client cannot predict the correct type of aggregation for gauge metrics, so they do the safe default of including the metric with a pid label. In the clustered puma scenario, this is not helpful, as each process gets all of the data, yielding some cross-talk. The pid is volatile, as well as irrelevant. What we want is only the index label to identify each worker and to export only the most recent data.
Fortunately, the design of prometheus-client allows these aggregation settings to pass through and no-op for the Synchronized and SingleThreaded stores, and the MOST_RECENT aggregation is appropriate for all 7 of the gauges. This makes the patch simple and unconditional.