Open fritshoogland-yugabyte opened 2 years ago
Added caveat: there are customers who want to scrape the prometheus endpoints additional to prometheus for platform/ybanywhere using tools such as data dog, which will then empty the statistics. In my opinion that is not how it is supposed to be behaving, and can lead to huge confusion.
Jira Link: [DB-349]
Description
For observability, we provide different statistics. Two simple ones are counters and gauges, and we generate a metric that we call 'coarse_histogram' internally such as:
For which the function seems to be to create the statistics in a way suitable for prometheus to be scraped; the equivalent prometheus metric for the above metric is:
The first thing to notice is that the 'min' figure is not exposed in the prometheus format.
However, the first actual topic of this issue is that this is not a histogram by general definition and by prometheus definition, but this is what prometheus calls a 'summary'. See: https://prometheus.io/docs/practices/histograms/
The second topic of this issue is that currently, the statistics for (as far as I know) min, max, mean and the percentiles are reset on read of the metrics. This makes sense in the way that this way a single extreme outlier for both high or low latency doesn't pollute these statistics for the lifetime of the statistic, however, this is not how this is supposed to be working. By doing it in this way, it works in a non-standard way, but more importantly it is easy to misunderstand this, and can lead to all sorts of other errors, such as wrong measurements in a setup where there are multiple prometheus hosts, such as a high-available setup, or if statistics are fetched manually.
The way this is supposed to be working is documented in the prometheus documentation:
The third topic is histograms, being counts of time stored in time/latency buckets. The earlier linked document about prometheus summaries and histograms documents that summaries can be created based on histograms, but histograms cannot be created from summaries.
This means that if we change the provided statistics to be actual histograms, we can still provide the same statistics as we do today, but additionally can create time based histograms, so that if the buckets are chosen well, we can spot multiple latencies using the histograms and gain more understanding. This is useful for understanding IO latencies for example, where an IO might be cached or escalates as a physical IO. Using time buckets both latencies can be observed.