yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[DocDB] Define coarse_histograms as summaries and create actual histograms #12513

Open fritshoogland-yugabyte opened 2 years ago

fritshoogland-yugabyte commented 2 years ago

Jira Link: [DB-349]

Description

For observability, we provide different statistics. Two simple ones are counters and gauges, and we generate a metric that we call 'coarse_histogram' internally such as:

            {
                "name": "handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession",
                "total_count": 0,
                "min": 0,
                "mean": 0.0,
                "percentile_75": 0,
                "percentile_95": 0,
                "percentile_99": 0,
                "percentile_99_9": 0,
                "percentile_99_99": 0,
                "max": 0,
                "total_sum": 0
            }

For which the function seems to be to create the statistics in a way suitable for prometheus to be scraped; the equivalent prometheus metric for the above metric is:

handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession_sum{metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession_count{metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession{quantile="p50",metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession{quantile="p95",metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession{quantile="p99",metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession{quantile="mean",metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628
handler_latency_yb_tserver_RemoteBootstrapService_BeginRemoteBootstrapSession{quantile="max",metric_id="yb.tabletserver",metric_type="server",exported_instance="yb-1.local:9000"} 0 1652528737628

The first thing to notice is that the 'min' figure is not exposed in the prometheus format.

However, the first actual topic of this issue is that this is not a histogram by general definition and by prometheus definition, but this is what prometheus calls a 'summary'. See: https://prometheus.io/docs/practices/histograms/

The second topic of this issue is that currently, the statistics for (as far as I know) min, max, mean and the percentiles are reset on read of the metrics. This makes sense in the way that this way a single extreme outlier for both high or low latency doesn't pollute these statistics for the lifetime of the statistic, however, this is not how this is supposed to be working. By doing it in this way, it works in a non-standard way, but more importantly it is easy to misunderstand this, and can lead to all sorts of other errors, such as wrong measurements in a setup where there are multiple prometheus hosts, such as a high-available setup, or if statistics are fetched manually.

The way this is supposed to be working is documented in the prometheus documentation:

Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later. Which means that in order to get accurate, representative quantiles and min/mean/max values, measurements should expire based on a sliding (time) window.

The third topic is histograms, being counts of time stored in time/latency buckets. The earlier linked document about prometheus summaries and histograms documents that summaries can be created based on histograms, but histograms cannot be created from summaries.

This means that if we change the provided statistics to be actual histograms, we can still provide the same statistics as we do today, but additionally can create time based histograms, so that if the buckets are chosen well, we can spot multiple latencies using the histograms and gain more understanding. This is useful for understanding IO latencies for example, where an IO might be cached or escalates as a physical IO. Using time buckets both latencies can be observed.

fritshoogland-yugabyte commented 2 years ago

Added caveat: there are customers who want to scrape the prometheus endpoints additional to prometheus for platform/ybanywhere using tools such as data dog, which will then empty the statistics. In my opinion that is not how it is supposed to be behaving, and can lead to huge confusion.