add monitoring - Githubissues

theRealWardo commented 6 years ago

what does zalando do for postgres monitoring with any databases run via this operator?

I was thinking of building https://github.com/wrouesnel/postgres_exporter into the database container and having that be monitored via our prometheus operator.

is there any existing plans to add monitoring directly into this project in some way? if not, is there a need for a more detailed discussion/approach prior to contribution or shall I do as the contribution guidelines say and just hack away and send a PR?

Jan-M commented 6 years ago

Quick answer is no, there is no intent to make the operator "monitor" anything. Ideally the operator focuses on "operation" and more specifically on the provisioning and modifying part. The "ops" part we largely leave to Patroni which is very well suited for taking care of the cluster itself.

The operator however does contain a very slim API to allow monitoring it from the outside.

At Zalando we use ZMON (zmon.io) for all monitoring. But there is other options here like Prometheus.

We are running Postgres with the bg_mon extension exposing a lot of Postgres data via a rest API on port 8080 so this helps a lot I think.

theRealWardo commented 6 years ago

thanks for the quick reply! to be clear I'm not proposing monitoring the operator itself but rather the database it is operating on. if there is something in the operator that you monitor and feel others should monitor please do let me know! otherwise our system will probably just be monitoring that the pod is up and running.

what I'd like to add to this operator to facilitate that is flag would would add a simple named monitoring port on the ServiceSpec. that would enable me to have a ServiceMonitor (custom resource) which my Prometheus operator would then be able to turn into scrape targets for my Prometheus instance. does that sound reasonable?

Jan-M commented 6 years ago

I forgot one tool here, just teasing it as we have not release it yet, but teams rely on our new pgview web interface to monitor their DBs too and it has proven very useful.

theRealWardo commented 6 years ago

for that kind of web dashboard thing we've been running https://github.com/ankane/pghero which has definitely helped us a couple times but it doesn't hook into our alerting systems which is what I'm really trying to achieve here.

Jan-M commented 6 years ago

Operator monitoring: We have not figured this out completely, one part here is def. user experience making sure the operator is quick to provisioning new clusters and applying changes triggered by the user but other than that we more or less monitor that the pod is running which is not that helpful and informative.

Database monitoring: We don't consider this a task of the operator and our operator is not required once the database is "deployed" as Patroni does all the magic for high availability and failover, which makes the operator itself much smaller in scope and much less important.

To monitor clusters as said above, both postgres and patroni have REST apis that are easy to monitor.

stefanhipfel commented 6 years ago

I adapted the operator to deploy the postgres exporter as a sidecar container (Instead of running it inside the spilo container). With this we can get metrics to prometheus. So the operator is not monitoring anything just helps with the deployment. What you guys think?

Jan-M commented 6 years ago

We had the discussion once for arbitrary side car definition support, but scratched this until the need arises. Feel free to PR this or frame it in an issue, as this could become anything from simple to very generic.

Maybe we can also go for "prometheus" sidecar similar static as the Scalyr side car. Can you dump your side car definition here so we can have a look?

Jan-M commented 6 years ago

I am closing this.

The sidecar feature that we currently use for scalyr only in a hard coded way may see some improvements and become more generic, and then also serve the purpose of adding e.g. the postgres exporter as a sidecar via the operator.

theRealWardo commented 6 years ago

how about we keep this open and I send you a PR? I'll try to get you one this week which will add a monitoring side car option if you are okay with that.

Jan-M commented 6 years ago

Sure, PRs or Idea Sketches are very welcome. Maybe you can outline your idea briefly, as we have some ongoing discussions internally on how sidecars should look like: from toggled hard coded examples like Scalyr now to a very generic approach.

hjacobs commented 6 years ago

@Jan-M would be great to see that discussion here in the Open Source project, so others can comment/join.

theRealWardo commented 6 years ago

sure! so if I were to bring up the most important things for adding monitoring to this project:

make it easy for some common use case(s)
make it clear how to add other monitoring solutions

I think we should start by focusing on 2 common use cases, documenting them, and changing the project's current language of Monitoring of clusters is not in scope, for this good tools already exist from ZMON to Prometheus and more Postgres specific options.:

your case of using ZMON + pg_view and friends seems like it can be achieved simply via a modified image, right? I think this case is supported in the current design. this is interesting because it doesn't require additional permissions and instead builds it into postgres. let's document how to do this one.
I think common use case for a lot of us is a sidecar container. this would enable my goal of prometheus monitoring with something the exporter I linked above or a telegraf container. I'd propose we start by extending the current sidecar support with a monitoring specific sidecar that can be enabled. this will be trickier than the baked in approach because most of these processes running in the sidecar will require a connection URL. I believe using the superuser here is a bad idea as it can impact Patroni fail overs, correct? so using the correct user/permissions has to be figured out for this...

a bit more technical details of what I am proposing for monitoring side cars specifically:

no one wants to copy pasta a ton of config, so provide two options - configure monitoring sidecars on the operator or the cluster.
the default should just work so simply specifying monitoring_docker_image to whatever image should be run as a sidecar should just work assuming:
- the image is passed the following environment variables: POSTGRES_USER, POSTGRES_PASSWORD (and it obviously is configured to use them correctly)
- that POSTGRES_USER is granted the correct permissions
for those of us running the Prometheus Operator, we'll apply a specific label to make our ServiceMonitor pick up these pods

going to sketch some code and share it shortly to get a bit more specific and hopefully keep the discussion going. thoughts here though?

Jan-M commented 6 years ago

Just a very quick remark: Imho monitoring is still not in scope of the operator, despite that the side cars should be supported and are a good idea.

For me the essence is that the operator should itself not start to "monitor" metrics or become e.g. a metric gateway/proxy.

alexeyklyukin commented 6 years ago

Hi @theRealWardo,

I would some similar thoughts along the line of supporting any sidecar, not necessary monitoring (for instance, ours is doing log exporting and others may also do something like regular manual vacuuming, index rebuild or backup or backup, or even running 3rd party applications that do something (i.e. export the data somewhere else). Most of them, in general, need access to the PGDATA/logs and many also need access to the database itself.

The set of parameters you came with looks good to me. We could also pass the role name that should be defined inside the infrastructure roles, and the operator would perform the job of passing the role name and the password from their to the cluster. However, in some cases it might be necessary to connect as a superuser, whose password is per-cluster.

Another idea is to expose the unix socket inside the volume mount of github.com/zalando/spilo, so that other containers running in the same pod can connect with a unix socket and user postgres without a password.

In order to fully support this, we would also need something along the line of pod_environment_configmap (custom environment variables injected in every pod) to be propagated to the sidecar, and also have a similar options for passing global secret object (as in many cases values like external API keys cannot be trusted to mere configmaps) to expose secrets inside it to each container as environment variables.

I am not sure about the labels. It is not possible to apply labels to individual containers within the pod, what we could do is to apply a sidecar label with the name of the sidecar. However, it looks redundant to me, since one can always instruct monitoring to look for pods with the set of cluster_labels configured in the operator.

I'll look into your PR and will also do the global secrets when I have time.

theRealWardo commented 6 years ago

so I modified my PR to add generic sidecar support. it allows users to add as many sidecars as they like to each of the pods running their clusters. this is sufficient to meet our use cases, and could be used by your team in place of the current Scalyr specific stuff.

we are going to try and run 2 sidecar containers actually. we'll be running one that does log shipping via Filebeat and another that does monitoring via Postgres Exporter.

hopefully this PR will enable other interesting uses too.

pitabwire commented 5 years ago

@theRealWardo how are you passing in the env vars to Postgres Exporter like DATA_SOURCE_NAME as the ones available from the postgres operator are different and i.e. POSTGRES_* or do you create another container based on the one available for postgres exporter for inclusion as a sidecar?

theRealWardo commented 5 years ago

right @pitabwire - we use a sidecar, 2 of them actually. one that ships logs and one that does monitoring.

pitabwire commented 5 years ago

@theRealWardo you could guide on this. I tried to pass in the environment variables but for some reason they are not being picked in the container for postgres exporter, I get the error below

kubectl logs -n datastore -f tester-events-cluster-0 pg-exporter time="2019-03-07T07:13:56Z" level=info msg="Established new database connection." source="postgres_exporter.go:1035" time="2019-03-07T07:13:56Z" level=info msg="Error while closing non-pinging DB connection: <nil>" source="postgres_exporter.go:1041" time="2019-03-07T07:13:56Z" level=info msg="Error opening connection to database (postgresql://:PASSWORD_REMOVED@127.0.0.1:5432/postgres?sslmode=disable): pq: Could not detect default username. Please provide one explicitly" source="postgres_exporter.go:1070" time="2019-03-07T07:13:56Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1178"

my docker file is shown below:

` FROM ubuntu:18.04 as builder

ENV PG_EXPORTER_VERSION=v0.4.7 RUN apt-get update && apt-get install -y curl \ && curl -sL https://github.com/wrouesnel/postgres_exporter/releases/download/${PG_EXPORTER_VERSION}/postgres_exporter_${PG_EXPORTER_VERSION}_linux-amd64.tar.gz \ | tar -xz

FROM scratch

ENV PG_EXPORTER_VERSION=v0.4.7 ENV POSTGRES_USER="" ENV POSTGRES_PASSWORD="" ENV DATA_SOURCE_NAME="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@127.0.0.1:5432/postgres?sslmode=disable"

COPY --from=builder /postgresexporter${PG_EXPORTER_VERSION}_linux-amd64/postgres_exporter /postgres_exporter

EXPOSE 9187

ENTRYPOINT [ "/postgres_exporter" ]`

tritruong commented 5 years ago

I'm using sidecar to run postgres_exporter. The config look like this

apiVersion: "acid.zalan.do/v1"
kind: postgresql
spec:
    ...
    sidecars:
    - name: "prometheus-postgres-exporter"
      image: "wrouesnel/postgres_exporter:v0.4.7"
      env:
        - name: "PG_EXPORTER_EXTEND_QUERY_PATH"
          value: "/etc/config.yaml"
        - name: "DATA_SOURCE_NAME"
          value: "postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable"
      ports:
        - name: http
          containerPort: 9187
          protocol: TCP
    ...

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

pitabwire commented 5 years ago

@tritruong the challange with doing it this way is you have to do it for every cluster definition, I would like to do it globally and in an automated way so that any new cluster definitions are automatically picked up by the prometheus monitor and alerting system.

Jan-M commented 5 years ago

And dont put the password into env vars like this.

I am in general in favor of having global generic sidecar def. for whatever you need.

For monitoring though, or other tooling, the K8S API delivers you a nice way to discover services and clusters you want to monitor and the one exporter or tool per cluster may not be the best idea anymore. But this depends arguably.

tritruong commented 5 years ago

@Jan-M Yes, I could use mount secret file. Is there any way I could do to disable the default environment variables that always passed to sidecars (POSTGRES_USER and POSTGRES_PASSWORD)? https://github.com/zalando/postgres-operator/blob/31e568157b336592debbb37f2c44c1ca1769c00d/docs/user.md#sidecar-support

frittentheke commented 5 years ago

@tritruong Maybe using a trust configuration with role-mapping in pg_hba.conf could grant the exporter sidecar just the required read-only access, potentially even without password-based authentication?

And yes @Jan-M, I believe @tritruong does have a point. Giving every little sidecar containing just a piece of monitoring software full on admin rights to the database might not be desired :-)

rporres commented 5 years ago

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

@tritruong I created a separate service for the exporter to work around that fact.

jtomsa commented 5 years ago

If any1 would be interested in monitoring of Patroni itself, I've written a patroni-exporter for prometheus that scrapes the Patroni API. Someone could find it useful :) https://github.com/Showmax/patroni-exporter

Yannig commented 5 years ago

Here is a complete example we use internaly to enable prometheus exporter:

---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: postgres
spec:
  teamId: "myteam"
  numberOfInstances: 1
  enableMasterLoadBalancer: false
  volume:
    size: 200Mi
  users:
    user_database: ["superuser", "createdb"]
  databases:
    database: user_database
  postgresql:
    version: "11"

  sidecars:
    - name: "exporter"
      image: "wrouesnel/postgres_exporter"
      ports:
        - name: exporter
          containerPort: 9187
          protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 256M
        requests:
          cpu: 100m
          memory: 200M
      env:
        - name: "DATA_SOURCE_URI"
          value: "postgres/database?sslmode=disable"
        - name: "DATA_SOURCE_USER"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: username
        - name: "DATA_SOURCE_PASS"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: password

---
apiVersion: v1
kind: Service
metadata:
  name: pg-exporter
  labels:
    app: pg-exporter
spec:
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
    - name: exporter
      port: 9187
      targetPort: exporter
  selector:
    application: spilo
    team: myteam

tanordheim commented 5 years ago

I opted into baking postgres_exporter into a custom built Spilo image and have the supervisord in the Spilo image automatically start it up. Then I tweaked the Prometheus job rules to add a custom scrape target that scrapes the postgres_exporter metrics on all application=spilo pods - it seems to work quite well and lets me configure monitoring as an operator wide feature instead of having each cluster have to define this themselves.

ekeih commented 4 years ago

When we upgraded our Kubernetes cluster to 1.16 the postgres-operator (1.2.0, #674) was not able to find the existing StatefulSets anymore (because of the API changes between 1.15 and 1.16). This led to a situation where all postgres clusters were marked as SyncFailed.

Status:
  Postgres Cluster Status:  SyncFailed

I think it would be very helpful if the operator exposed a /metrics endpoint for Prometheus which would make it possible to alert on such things. This is not an issue if the database cluster but of the operator, so monitoring the database does not expose this kind of issue.

frittentheke commented 4 years ago

@theRealWardo there are two PRs open, that combined should allow most monitoring / log-shipping use cases to be configured:

Fully speced sidecars: https://github.com/zalando/postgres-operator/pull/890
Additional Volumes: https://github.com/zalando/postgres-operator/pull/736 (i.e. to expose the PostgreSQL socket to postgres_exporter or some other tool)

theRealWardo commented 4 years ago

awesome thanks @frittentheke!

vitobotta commented 3 years ago

@Yannig Hi! Can you suggest a Grafana dashboard that works with your config? Thanks!

jkroepke commented 3 years ago

Does someone try that with the OperatorConfig?

https://github.com/zalando/postgres-operator/blob/ebb3204cdd7002742499c85b1df15b43a68f005b/docs/administrator.md#sidecars-for-postgres-clusters

apiVersion: v1
kind: OperatorConfiguration
metadata:
  name: postgresql-operator-configuration
spec:
  sidecars:
  - name: exporter
    image: prometheuscommunity/postgres-exporter:v0.9.0
    ports:
    - name: exporter
      containerPort: 9187
      protocol: TCP
    resources:
      requests:
        cpu: 50m
        memory: 200M
    env:
    - name: "DATA_SOURCE_URI"
      value: "$(POD_NAME)/postgres?sslmode=disable"
    - name: "DATA_SOURCE_USER"
      value: "$(POSTGRES_USER)"
    - name: "DATA_SOURCE_PASS"
      value: "$(POSTGRES_PASSWORD)"
    - name: "PG_EXPORTER_AUTO_DISCOVER_DATABASES"
      value "true"

Is a Service for port 9187 required? Any disadvantage using PodMonitor? Recently, I used PodMonitor for our kafka operator monitoring setup, too.

davidkarlsen commented 3 years ago

@jkroepke that works - but not via configmaps in the later versions of the operator. But yes, you will need to add *Monitor resources to activate scraping

binhnguyenduc commented 3 years ago

For anyone looking for a Grafana Dashboard to get started with Yannig's config, try this: https://grafana.com/grafana/dashboards/9628

Simply set up a Prometheus target to scrape /metrics from the pg-exporter service, Import the Grafana dashboard and voila!

MPV commented 3 years ago

@davidkarlsen, in which way/why wouldn't that work via configmaps?

that works - but not via configmaps in the later versions of the operator.

MPV commented 3 years ago

I think it would be very helpful if the operator exposed a /metrics endpoint for Prometheus which would make it possible to alert on such things. This is not an issue if the database cluster but of the operator, so monitoring the database does not expose this kind of issue.

@ekeih Take a look at https://github.com/zalando/postgres-operator/pull/1529

cdmikechen commented 2 years ago

I add a config like @jkroepke , but use pid. I feel more refined in this way.

  sidecars:
  - name: exporter
    image: postgres_exporter:v0.10.1
    ports:
    - name: pg-exporter
      containerPort: 9187
      protocol: TCP
    resources:
      requests:
        cpu: 50m
        memory: 200M
    env:
    - name: CLUSTER_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['cluster-name']
    - name: DATA_SOURCE_NAME
      value: >-
        host=/var/run/postgresql user=postgres
        application_name=postgres_exporter
    - name: PG_EXPORTER_CONSTANT_LABELS
      value: 'release=$(CLUSTER_NAME),namespace=$(POD_NAMESPACE)'

You can add a volume in CRD like this:

  additionalVolumes:
    - name: socket-directory
      mountPath: /var/run/postgresql
      targetContainers:
        - all
      volumeSource:
        emptyDir: {}

vitargelo commented 2 years ago

I'm using this helm chart: https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-postgres-exporter

Connected to pooler-replica service. Works fine

jkroepke commented 2 years ago

How it works for you?

if you create a new database how the new exporter will be deployed? Running helm install after apply the CR is not the idea of an operator

abh commented 1 year ago

@jkroepke you can configure a sidecar in the operator configuration that gets applied to all postgres pods the operator starts.

dragoangel commented 3 months ago

I'm using this helm chart: https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-postgres-exporter

Connected to pooler-replica service. Works fine

@vitargelo to properly monitor servers you need connect directly to each postgres server, not just to random one, because you can get into issues on one of the replica, but not on another one, as results metrics MUST be taken from each server, and sidecar there fits the best.

zalando / postgres-operator

add monitoring #264