zalando / skipper

An HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress
https://opensource.zalando.com/skipper/
Other
3.11k stars 351 forks source link

metrics route-ID string should be labels for PromQL querying #1779

Open universam1 opened 3 years ago

universam1 commented 3 years ago

Is your feature request related to a problem? Please describe. Prometheus metrics contain one central metric route that becomes one concatenated string of various independent fields such as

This implementation is problematic as it is

Describe alternatives you've considered (optional) Above properties should become distinct Prometheus labels, the current routeID should be used internally only.

*RouteID could become a struct with Stringer() method to render to current string where needed internally, but will provide a map of individual labels for Prometheus.

Idea: current: {route="kube_appsvcs__docs__techdocs_myservice_org_____appdocs"} new: {namespace="appsvcs", ingress="docs", host="techdocs.myservice.org", path="/", backend="appdocs"}

Additional context (optional) examples where devs struggle to understand their usage:

route="kube_appsvcs__docs__techdocs_myservice_org_____docs"
route="kube_rg__redmine__ldsc__all__0_0"

Note, other projects try to disassemble https://github.com/fluxcd/flagger/blob/e6c740d917bdb3da5163bdef41b66ce8d20f31bf/pkg/metrics/observers/skipper.go#L30

Would you like to work on it? possibly

AlexanderYastrebov commented 3 years ago

Hi, thank you for the feature request, I'll add a couple of notes.

Ingress properties (namespace, name, path, etc) used to construct route id are not preserved up to the point where metrics are collected. Metrics collection attempts to fit two implementations (coda hale and prometheus) under the same interface and thus lacks labels concept which would complicate the feature implementation.

Note, other projects try to disassemble

This is a slippery road and most probably will break down the line as route id generation is an implementation detail and is not even public in the library.

route id generation implementation https://github.com/zalando/skipper/blob/960ef409f56d38b13c2af2fb44f95751cc783d1e/dataclients/kubernetes/ingress.go#L86-L101 https://github.com/zalando/skipper/blob/960ef409f56d38b13c2af2fb44f95751cc783d1e/dataclients/kubernetes/eastwest.go#L10-L12 https://github.com/zalando/skipper/blob/960ef409f56d38b13c2af2fb44f95751cc783d1e/dataclients/kubernetes/routegroup.go#L123-L146
universam1 commented 3 years ago

Thank you for the feedback and the references!

This is a slippery road and most probably will break down the line

Fully agree - that is eventually the core problem described in this issue that there is currently neither a way via standard patterns such as Prometheus labels nor a published API to solve this demand. My preference would though certainly be to extend the labels following canonical concept.

attempts to fit two implementations (coda hale and prometheus) under the same interface and thus lacks labels concept which would complicate the feature implementation

Looking through the sources one idea could be to replace the routeId string with an interface like this:

type RouteIDer interface {
       // implements current signatures from rgRouteID, eastWestRouteID, routeID
    String() string
    // native fields exporter
    PrometheusLabels() []string
        CodaHaleLabels() xyz
}

RGs, EastWest, Ingress can implement via a struct fulfilling above iface. Looks like at this point for instance the labels can be extended then? https://github.com/zalando/skipper/blob/master/metrics/prometheus.go#L421

szuecs commented 3 years ago

@universam1 , first I like the idea and route was just for operations to make sure they can figure out in case they need to have something and have not a metrics explosion in cardinality. I would expect a bit of math how much the mentioned dimensions (labels) will cost in memory. We should think about it with different patterns of input. One of ours is having a cluster with 40k ingress (created by 400 fabric gateways, that on average would create 100 ingress. These are some kind of internal requirements) and I would like to keep the scalability like this.).

aryszka commented 3 years ago

I would approach this slightly differently regarding the implementation details:

Ideally, we will be able to drop the mandatory and unique route ID from the eskip.Route object (and therefore also from the eskip format) in the future. To pass in values for the monitoring, I would rather create symbolic filters, which can be used during converting the ingress objects, but also in free-form configuration. OTOH, we can add new methods to the routing.Route objects, which in contrary to the eskip.Route, is meant to be more like an internal "runtime" representation of the route. The PrometheusLabels() (or just simply Labels()) idea is nice and fits the routing.Route type.

AlexanderYastrebov commented 3 years ago

I would rather create symbolic filters, which can be used during converting the ingress objects, but also in free-form configuration.

I think (not 100% sure though) Prometheus requires a set of labels to be known upfront so we may come up up with a config key like -prometheus-route-labels=namespace,service,whitelabel and a "marker" filter like metricsLabels("namespace=whatever", "service=nginx", "whitelabel=mycustomlabelforthisroute"). Ingress/routegroup processor may inject them unconditionally or based on some annotation. Also not all routes come from k8s objects so some labels may not be applicable and I guess would have empty values.

I also saw a case recently where users would like to get metrics for a specific stack version created by https://github.com/zalando-incubator/stackset-controller

I would expect a bit of math how much the mentioned dimensions (labels) will cost in memory.

Looking at the subject (success rate and latency for a backend service) it may happen that absolute minimum set of labels is namespace and service (or endpoint), i.e. ingress name is not needed.

universam1 commented 3 years ago

Prometheus requires a set of labels to be known upfront

That's my conclusion after trying a couple of ideas how to make the labels dynamic, since as you stated the common denominator of route sources is rather short. Their early init appears hard to support variadic labels to me.

absolute minimum set of labels is namespace and service (or endpoint)

We'd be highly interested in path or maybe its matcher. Think about microservice architecture where we need to measure the individual services mounted at different paths, bringing light into a monolith blackbox.

we can add new methods to the routing.Route objects, which in contrary to the eskip.Route

Great detail, thanks. I'm trying quite a bit to wrap my head around the abstractions and internal concepts, I'd love to provide contribution here with this issue but guess I'm struggling to grasp where and how the generators are used... 🤔

aryszka commented 3 years ago

indeed, I haven't thought of the fixed label keys of Prometheus. We may need to do something 'ugliish', like:

I am a proponent of the first option.

universam1 commented 3 years ago

would be great if Skipper could be added to the list of supported Ingresses for SLO operator

https://github.com/slok/go-http-metrics

szuecs commented 3 years ago

I am not sure if I understand what you mean @universam1 . The linked project looks not really useful for skipper. It looks more like a project to expose metrics of your backend application.

I don't undstand what the mentioned SLO operator is. Seems like another project, right?

universam1 commented 3 years ago

Pardon for not providing enough context. We are leveraging this SLO Operator https://github.com/slok/sloth where we define the SLI's mostly on Skipper route metrics. As initially stated they are challenging for the devs to generate currently.

I linked the plugin sdk of this Operator, which when this issue would be implemented, would enable one to create a plugin tailored to Skipper, as being done for other Ingress projects.

Basically just a teaser or reason for this to happen 😁

szuecs commented 3 years ago

Sounds great to me!

Btw. path metrics: For the path metrics, it might also make sense to have a config options to enable it like @jonathanbeber did for other metrics to reduce the number of keys. We had issues with Prometheus.