yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[DocDB] Stop exporting quantile metrics for auto-generated RPC metrics that are not used by YBA or YBM #19917

Open bmatican opened 11 months ago

bmatican commented 11 months ago

Jira Link: DB-8860

Description

Discussed internally. Right now, all our auto-generated RPC metrics will generate metrics like the following

handler_latency_yb_tserver_PgClientService_ListLiveTabletServers_sum{metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers_count{metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="p50",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="p95",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="p99",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="mean",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="max",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217
handler_latency_yb_tserver_PgClientService_ListLiveTabletServers{quantile="min",metric_id="yb.tabletserver",metric_type="server",exported_instance="..."} 0 1695742840217

However, for the vast majority, 6/8, the quantiles, are not necessary for YBA or YBM. We should get rid of those, as they needlessly increase the total number of metrics each node exports. Currently, we would only need to retain them for the top level YSQL/YCQL/YEDIS operations.

To keep the quantiles for the metrics we do want, it would be nice if we had a way to tag the relevant RPC methods. One interesting way could be a custom protobuf option (see an example in Kudu: https://github.com/apache/kudu/commit/cef7b10239a1cf860bfcb526d503b07503442a49). This would allow us to assume only RPC methods that are tagged, require them, making it a very explicit dev choice, that's cleanly documented in the .proto files themselves.

cc @es1024 @yusong-yan

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

mdbridge commented 11 months ago

Okay, @bogdan wanted to know how many time series we have from handlerlatency metrics. I ran the following Prometheus query: count by (quantile) ( label_replace( {savedname =~ "handler.", exported_instance = "yb-dev-mlillibridge-core2-3000-tbl-30pct-8123044812968211792-n1"}, "quantile", "$1", "savedname", ".*(sum|count)" ) ) which counts how many time series there are of this type for each quantile/sum/count. (The fancy label_replace part converts the sum and count parts into fake quantiles.)

Running against a 3000 tablet box after running some extensive sysbench stress tests gave: summing (just take out the "by (quantile)" part) gives 3,352 timeseries

count ( {savedname =~ "(proxy|service)(request|response)_.*", exported_instance = "yb-dev-mlillibridge-core2-3000-tbl-30pct-8123044812968211792-n1"} ) gives 1,538 timeseries on this box...