Can't failover due to replica lag

davidkarlsen commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? 1.6.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] openshift
Are you running Postgres Operator in production? [yes | no] no
Type of issue? [Bug report, question, feature request, etc.] question/bug

I am cordoning the node which hosts the master pod, and the operator reports:

time="2021-04-29T12:03:36Z" level=info msg="moving pods: node \"/alt-eos-g-c01oco03\" became unschedulable and does not have a ready label: map[]" pkg=controller
time="2021-04-29T12:03:36Z" level=info msg="starting process to migrate master pod \"adc-dev/adc-batchinator-db-1\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="Found 1 running replica pods" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=info msg="check failed: pod \"adc-dev/adc-batchinator-db-0\" is already on a live node" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="switching over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="making POST http request: http://10.200.12.5:8008/failover" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:36Z" level=debug msg="subscribing to pod \"adc-dev/adc-batchinator-db-0\"" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=debug msg="unsubscribing from pod \"adc-dev/adc-batchinator-db-0\" events" cluster-name=adc-dev/adc-batchinator-db pkg=cluster
time="2021-04-29T12:03:37Z" level=error msg="could not move master pod \"adc-dev/adc-batchinator-db-1\": could not failover to pod \"adc-dev/adc-batchinator-db-0\": could not switch over from \"adc-batchinator-db-1\" to \"adc-dev/adc-batchinator-db-0\": patroni returned 'Failover failed'" pkg=controller
time="2021-04-29T12:03:37Z" level=info msg="0/1 master pods have been moved out from the \"/alt-eos-g-c01oco03\" node" pkg=controller
time="2021-04-29T12:03:37Z" level=warning msg="failed to move master pods from the node \"alt-eos-g-c01oco03\": could not move master 1/1 pods from the \"/alt-eos-g-c01oco03\" node" pkg=controller

The reason for it failing seems to be replication lag:

2021-04-29 12:03:36,505 INFO: received failover request with leader=adc-batchinator-db-1 candidate=adc-batchinator-db-0 scheduled_at=None
2021-04-29 12:03:36,515 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,653 INFO: Lock owner: adc-batchinator-db-1; I am adc-batchinator-db-1
2021-04-29 12:03:36,711 INFO: Got response from adc-batchinator-db-0 http://10.200.16.5:8008/patroni: {"state": "running", "postmaster_start_time": "2021-04-29 09:13:56.578 UTC", "role": "replica", "server_version": 130001, "cluster_unlocked": false, "xlog": {"received_location": 5100273664, "replayed_location": 5100273664, "replayed_timestamp": null, "paused": false}, "timeline": 205, "database_system_identifier": "6929937442175926342", "patroni": {"version": "2.0.1", "scope": "adc-batchinator-db"}}
2021-04-29 12:03:36,801 INFO: Member adc-batchinator-db-0 exceeds maximum replication lag
2021-04-29 12:03:36,801 WARNING: manual failover: no healthy members found, failover is not possible

my manifest is:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  namespace: "adc-dev"
  name: "adc-batchinator-db"
spec:
  teamId: "adc"
  volume:
    storageClass: "openebs-local"
    size: "2Gi"
  numberOfInstances: 2
  users:
    batchinator:
    - superuser
    - createdb
    batchinator_user: []
  databases:
    # name: owner
    batchinator: batchinator
  postgresql:
    version: "13"
  patroni:
    pg_hba:
      - "local    all all trust"
      - "host     all all locahost trust"
      - "host     postgres all localhost ident"
      - "hostssl  replication standby all md5"
      - "hostssl  all all 0.0.0.0/0 md5"
      - "host     all all 0.0.0.0/0 md5"
      - "hostssl  all +pamrole all pam"

what could cause the replication lag, and why is it not picking up? the database is basically idle is there a metric one can track, in order to raise alerts when there is too large lag?

davidkarlsen commented 3 years ago

ok, I did a spilo-resync on the replica, and now it failed over just fine.

davidkarlsen commented 3 years ago

so this is not a operator problem, but maybe if anyone knows a good metric to check when wal sync is lagging, then we can be on top of it.

wzrdtales commented 1 year ago

@davidkarlsen how did you execute spilo-resync on the replica? there seems to be no doc about this

davidkarlsen commented 1 year ago

I no longer work at the place and don't have my notes, but exec into one of the pods, ps xa and see what users run, su to the spolo one ( I think it was a separate user patroni or the likes, or it was postgres user). then run spilo / patroni with --help

wzrdtales commented 1 year ago

thanks i have fallen back to simply bootstrapping the node again, by renaming the data dir. Will check out your comments though

zalando / postgres-operator

Can't failover due to replica lag #1476