zhmcclient / zhmc-prometheus-exporter

A Prometheus exporter for the IBM Z HMC
Apache License 2.0
15 stars 9 forks source link

Support for additional labels at metric level #224

Closed andy-maier closed 2 years ago

andy-maier commented 2 years ago

Right now, the "linecord-XXX-name" metric is commented out in the sample metric definition file, leading to warnings such as:

.../zhmc_prometheus_exporter/zhmc_prometheus_exporter.py:727: UserWarning: Skipping metric 'linecord-one-name' of metric group 'environmental-power-status' returned by the HMC that is not defined in the 'metrics' section of metric definition file /opt/zhmc-prometheus-exporter/metrics.yaml

The reason this metric is commented out is that its value is an identifier and not an actual metric. It should actually be set as a label on some of the other line cord metrics. However, the ability to add labels only exists at the metric group level.

This item is to add support for labels at the metric level.

This could not only be used for the linecord name metrics, but also for representing other data as labels, such as the status as a string on the has-acceptable-status metric.

andy-maier commented 2 years ago

Proposal for representing a mapped value on a metric as an additional label/tag:

Examples:

# HELP zhmc_cpc_status_int Status as integer (0=active/=operating, 1=degraded, 2=service-required, 10=service, 11=exceptions, 12=not-communicating, 13=status-check, 14=not-operating, 15=no-power, 99=unsupported value)
# TYPE zhmc_cpc_status_int gauge
zhmc_cpc_status_int{cpc="CPCA",hmc="XYZ",value="service-required"} 2.0
-> interpreted as string "service-required"
# HELP zhmc_cpc_has_unacceptable_status Boolean indicating whether the CPC has an unacceptable status (0=false, 1=true)
# TYPE zhmc_cpc_has_unacceptable_status gauge
zhmc_cpc_has_unacceptable_status{cpc="CPCA",hmc="XYZ",type="bool"} 1.0
-> interpreted as boolean True
# HELP zhmc_partition_ifl_processor_count Number of IFL processors allocated to the active partition
# TYPE zhmc_partition_ifl_processor_count gauge
zhmc_partition_ifl_processor_count{cpc="CPCA",hmc="XYZ",partition="part1",type="int"} 4.0
-> interpreted as integer 4

This proposal allows implementing generic support for interpreting the mapped values without special-casing each metric.

I am not very positive about the possibility to add labels to metric values that actually represent an additional, separate metric value. That makes it harder to interpret the result as two values, and requires special-casing the metric.

We could expand the proposal in the future to use "value" and "type" together for representing more complex values, such as lists of strings (e.g. for "acceptable-status" property). In that case, the Prometheus floating point value could have a dummy value if the actual value cannot be represented as a single floating point value. This could be done with type="json" and value being the JSON representation, but that introduces a lot of quote escaping. Maybe we use just type="list" and value="a,b,c" or so, where the strings could not contain comma or double quote. Anyway, that's for a future expansion of the proposal.

xiqing-zhang commented 2 years ago

Hi Andy,

Per prometheus data model, the suffix of the metric name describes the unit. e.g. http_request_duration_seconds node_memory_usage_bytes http_requests_total(for a unit-less accumulating count)

In my opinion the metric is usually shown in a panel/dashboard . In the panel, the label can be used to group the metric data. If the label is json or list, it's hard to understand the json or list content at a glance for grouping purpose.