weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Improve and extend Prometheus metrics #2557

Open brb opened 8 years ago

brb commented 8 years ago

2547 has introduced Prometheus metrics for weave-net. Some metrics can be improved though. Suggestions include in no particular order:

  1. weave_{packet,bytes}_total: remove the "flow" label.
  2. weave_{packet,bytes}_total: add the "overlay" label.
  3. weave_{packet,bytes}_total{overlay="fastdp"}: restore and change calculation to flushedFlowsTotal + sum(activeFlows) as flows get flushed periodically . flushedFlowsTotal should be updated before flushing flows.
  4. Include residual metrics https://github.com/weaveworks/weave/issues/2535#issuecomment-255731008 as appropriate
  5. Write tests for metrics (?)
  6. Distribute the metrics code (prog/weaver/metrics.go) across parts in the code which metrics are instrumenting (same as logging). Keep only those parts from prog/weave/metrics.go which are necessary for actual serving of metrics to Prometheus. The change helps with debugging. Keep in mind, that it changes the internal metrics collection model from pull to push.
  7. Initialize labeled metrics if all label values are known in advance. That prevents from subtle bugs, e.g. dividing two metrics when one value is missing.
  8. When exposing a metric with bunch of labels, create a static counter with all labels and collect the metric with only a reference label. When querying, you can join two metrics in order to access the labels.
  9. Split up metrics containing a "total" label into two separate metrics, as it simplifies querying and prevents from matching mistakes. E.g. weave_dns_entries{state="local"} / ignoring(state) weave_dns_entries{state="total"} vs weave_local_dns_entries / weave_dns_entries.
juliusv commented 7 years ago

Additionally, weave_connection_termination_count should be weave_connection_terminations_total.

brb commented 7 years ago

Additionally, weave_connection_termination_count should be weave_connection_terminations_total.

This is addressed by https://github.com/weaveworks/weave/pull/2568

frittentheke commented 6 years ago

@brb @bboreham I am unsure whether I am supposed to open a new issue. This one is quite old and seems to be "resolved" mostly?

Anyways: At KubeCon 2018 in Kopenhagen we spoke briefly about the automatic cleanup (rmpeer) of peers when running inside Kubernetes on nodes that are part of an AWS ASG. It would be very nice if you could add a metric / count of the number of peers which were removed due to this mechanism to actively monitor this. @bboreham you suggested to simply raise an issue to request this feature.

bboreham commented 6 years ago

@frittentheke generally it's best to open a new issue; it's easier to deal with accidental duplicates than the other way round.

Since this particular issue is a random set of "stuff" I think that's an even better reason to open a new one.