On doing Network Loss or container kill against master role pod, it is not showing expected behaviour

oumkale commented 3 years ago

On Network loss against the master, Leader, and Follower both will become master role.
On doing postgres container kill for a couple of seconds pods are getting stuck to master/slave itself not promoting themself
Metrics is not getting exported sometimes. Completely Misbehave

While having HA master and slave we expected to have metrics in some proper way so that one can conclude in HA cluster queries are working properly. But by looking into grafana dashboard graph it is confusing. Attached both images.

Which image of the operator are you using? registry.opensource.zalan.do/acid/logical-backup:v1.6.3
Where do you run it - cloud or metal? AWS K8s
Are you running Postgres Operator in production? yes
Type of issue? Bug report

After doing network loss against the master, Both becoming master, leader and follower. After completion of network loss it is regaining original role.

NAME                     READY   STATUS    RESTARTS   AGE    SPILO-ROLE
postgres-application-0   2/2     Running   0          156m   master
postgres-application-1   2/2     Running   0          13m    master

And after network loss is completed there is a sudden hike in metrics found. Running exporter as a sidecars: Prometheus query:

sum(rate(pg_stat_database_tup_inserted{datname=~"$datname", instance="postgres-pg-exporter.postgres.svc.cluster.local:9187"}[90s]))

Screenshot from 2021-09-16 11-27-50

Also found sometimes that master and slave both will not export metrics. Screenshot from 2021-09-15 11-57-35

Manifest :

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-application
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 3Gi
  numberOfInstances: 2
  users:
    zalando:  # database owner
    - superuser
    - createdb
    - inherit
    - login
    - createrole
    - replication
    - bypassrls
    foo_user: 
    - superuser
    - createdb
    - inherit
    - login
    - createrole
    - replication
    - bypassrls # role for application foo
  databases:
    foo: zalando  # dbname: owner
  preparedDatabases:
    bar: {}
  postgresql:
    version: "13"
  sidecars:
    - name: "exporter"
      image: "wrouesnel/postgres_exporter"
      ports:
        - name: exporter
          containerPort: 9187
          protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 256M
        requests:
          cpu: 100m
          memory: 200M
      env:
        - name: "DATA_SOURCE_URI"
          value: "localhost/postgres?sslmode=disable"
        - name: "DATA_SOURCE_USER"
          valueFrom: 
            secretKeyRef: 
              key: username
              name: zalando.postgres-application.credentials
        - name: "DATA_SOURCE_PASS"
          valueFrom: 
            secretKeyRef: 
              key: password
              name: zalando.postgres-application.credentials
  # additionalVolumes:
  # - name: data
  #   mountPath: /exporter
  #   targetContainers: ["exporter"]
  #   volumeSource: 
  #     persistentVolumeClaim: 
  #       claimName: "exporter-pvc"
#
---
apiVersion: v1
kind: Service
metadata:
  name: postgres-pg-exporter
  labels:
    app: pg-exporter
spec:
  type: NodePort
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
    - name: exporter
      port: 9187
      targetPort: 9187
  selector:
    application: spilo
    team: postgres

CyberDem0n commented 3 years ago

Having a stale role is absolutely normal in this case. Since you cut the network Patroni can't update its own labels, but at the same time, I am 100% sure that it demoted postgres to read-only.

oumkale commented 3 years ago

There are 2 replicas master and slave so, After network loss for a couple of seconds throwing error:

psql: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?

But exporter service is not exporting metrics for nearly 90seconds. And a sudden hike in dashboard due to? @CyberDem0n

And sometimes no metrics from both the servers?

CyberDem0n commented 3 years ago

I can't add anything more. There is something weird with your config. For example, your postgres-pg-exporter will grab all spilo pods that belong to a team postgres.

oumkale commented 3 years ago

Most of the time it is exporting but after multiple time(network loss) it can possible. Still didn't get why there is a hike in metrics after achiving role Please take a look in config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator
data:
  api_port: "8080"
  aws_region: eu-central-1
  cluster_domain: cluster.local
  cluster_history_entries: "1000"
  cluster_labels: application:spilo
  cluster_name_label: cluster-name
  connection_pooler_image: "registry.opensource.zalan.do/acid/pgbouncer:master-18"
  db_hosted_zone: db.example.com
  debug_logging: "true"
  enable_sidecars: "true"
  docker_image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1
  enable_ebs_gp3_migration: "false"
  enable_master_load_balancer: "false"
  enable_pgversion_env_var: "true"
  enable_replica_load_balancer: "false"
  enable_spilo_wal_path_compat: "true"
  enable_team_member_deprecation: "false"
  enable_teams_api: "false"
  external_traffic_policy: "Cluster"
  logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup:v1.6.3"
  logical_backup_job_prefix: "logical-backup-"
  logical_backup_provider: "s3"
  logical_backup_s3_bucket: "my-bucket-url"
  logical_backup_s3_sse: "AES256"
  logical_backup_schedule: "30 00 * * *"
  major_version_upgrade_mode: "manual"
  master_dns_name_format: "{cluster}.{team}.{hostedzone}"
  pdb_name_format: "postgres-{cluster}-pdb"
  pod_deletion_wait_timeout: 10m
  pod_label_wait_timeout: 10m
  pod_management_policy: "ordered_ready"
  pod_role_label: spilo-role
  pod_service_account_name: "postgres-pod"
  pod_terminate_grace_period: 10m
  ready_wait_interval: 3s
  ready_wait_timeout: 30s
  repair_period: 5m
  replica_dns_name_format: "{cluster}-repl.{team}.{hostedzone}"
  replication_username: standby
  resource_check_interval: 3s
  resource_check_timeout: 10m
  resync_period: 30m
  ring_log_lines: "100"
  role_deletion_suffix: "_deleted"
  secret_name_template: "{username}.{cluster}.credentials"
  spilo_allow_privilege_escalation: "true"
  spilo_privileged: "false"
  storage_resize_mode: "pvc"
  super_username: postgres
  watched_namespace: "*"  
  workers: "8"

@CyberDem0n

oumkale commented 3 years ago

Also, I found many wired things during multiple time killing container (postgres) or network loss that it will not promote to other roles. Master will be master or slave will be slave only even there is disturbance.

CyberDem0n commented 3 years ago

The failure detection isn't instant. Please learn how Patroni works: https://github.com/patroni-training/2019 (there are also plenty material available on the internet on this topic)

oumkale commented 3 years ago

Yes sure, I got it but still,

Please will you explain more about this hike in graph so that we will have a story around it while understanding it's working. Down fall is during network loss to master:

Screenshot from 2021-09-16 11-27-50

I found many times that just take an example app-0 is master and app-1 is slave then network lost to the master(during this metrics will be non zero) now after some time it app-1 is master and app-0 is slave so now I again do network loss to master which is app-1 now it is possible that metrics will not come why?

Seems that the application is not resilient sometimes: Downfall is network loss during the new master. Screenshot from 2021-09-15 11-57-35

CyberDem0n commented 3 years ago

I can't explain your graph because I don't understand what you are doing. Also, I don't really want to spend my time trying to understand it. Most of your questions will be answered when you learn how Patroni works (and K8s Services).

oumkale commented 3 years ago

Hey @CyberDem0n, I know k8s services and working of Patroni. Also know your time is important and thank you for replying but I found a couple of bugs that I wanted to share with you.

I have been doing chaos testing using LitmusChaos for our demo. We wanted to see that in resilient case like HA if the master fails there is a standby that will be taking care like what patroni do. And wanted to see monitoring it is actually happening or not but during this, I found sometimes different behavior.

But during this actually, an application behaving weak so if you are available for sometime we can have google meet or something where I will explain you the bugs. @CyberDem0n

CyberDem0n commented 3 years ago

We wanted to see that in resilient case like HA if the master fails there is a standby that will be taking care like what patroni do

It is not "like", there is Patroni inside.

If you think that found a bug, you should:

Check logs on all pods of spilo cluster (spilo pods!, not operator pod)
Check postgres logs in every spilo pod
prepare a reproducible test case

First two points are very important, they are about making sure that Patroni isn't doing what it is supposed to.

oumkale commented 3 years ago

Yes, you are right @CyberDem0n , I went through it.

My point is it should behave the same as every time I inject network loss chaos for a couple of seconds. Sometimes it is working fine and sometimes not. If I kill container for multiple time then every time it should update themself.
Two services master and replica. So I have a load generator that is actually generating load without failure during network loss every sec. But why it is not reflecting in metrics.

Please understand what I exactly wanted to share. I went through zalando docs and pretty hands on with concepts.

CyberDem0n commented 3 years ago

it should behave the same as every time I inject network loss chaos

Welcome to the distributed world. It is not supposed to behave always the same.

I inject network loss chaos for a couple of seconds.

First of all, it is not clear what "chaos" is, second - if you literally mean two seconds, nothing should ever happen. Patroni is configured to be resilient to network problems that are shorter than 10s by default.

If I kill container for multiple time then every time it should update themself.

I don't get, who should update what.

We are running in circles, please provide a reproducable test case.

zalando / postgres-operator

On doing Network Loss or container kill against master role pod, it is not showing expected behaviour #1619