Replica node never recover after rewind

DiegoDAF commented 3 years ago

Hi all, I making a POC, in my test I killed the primary node, other node take the primary role, the new replica rewind.... but dead whit this messages:


2021-03-15 17:29:58,156 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:29:58,156 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:08,156 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:08,156 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:18,156 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:18,156 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:28,156 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:28,156 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:28,928 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:28,929 INFO: Still starting up as a standby.
2021-03-15 17:30:28,929 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:28,929 INFO: does not have lock
2021-03-15 17:30:28,929 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:29,250 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:29,252 WARNING: Retry got exception: 'connection problems'
2021-03-15 17:30:29,252 INFO: Error communicating with PostgreSQL. Will try again later
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:38,929 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:38,929 INFO: Still starting up as a standby.
2021-03-15 17:30:38,930 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:38,930 INFO: does not have lock
2021-03-15 17:30:38,930 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:39,451 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:39,452 WARNING: Retry got exception: 'connection problems'
2021-03-15 17:30:39,453 INFO: Error communicating with PostgreSQL. Will try again later
/var/run/postgresql:5432 - rejecting connections
2021-03-15 17:30:48,929 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:48,929 INFO: Still starting up as a standby.
2021-03-15 17:30:48,930 INFO: Lock owner: poc-db-test-22-1tb-1; I am poc-db-test-22-1tb-3
2021-03-15 17:30:48,930 INFO: does not have lock
2021-03-15 17:30:48,930 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:49,571 INFO: establishing a new patroni connection to the postgres cluster
2021-03-15 17:30:49,573 WARNING: Retry got exception: 'connection problems'
2021-03-15 17:30:49,573 INFO: Error communicating with PostgreSQL. Will try again later
/var/run/postgresql:5432 - rejecting connections

the deploy yml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: poc-db-test-22-1tb
  namespace: mdbs
spec:
  teamId: mdbs
  volume:
    size: 1Ti
  numberOfInstances: 4
  enableConnectionPooler: true  # enable/disable connection pooler deployment
  enableReplicaConnectionPooler: true # set to enable connectionPooler for replica service

  enableMasterLoadBalancer: true
  enableReplicaLoadBalancer: true
  allowedSourceRanges:  # load balancers' source ranges for both master and replica services
  - 127.0.0.1/32
  - 0.0.0.0/0

  users:
    daf:  # database owner
    - superuser
    - createdb
    app_user: #
    other_app: []
    one_app: #

  databases:
    someappdb1: app_user
    someappdb2: daf
    someappdb3: other_app

  postgresql:
    version: "12"

Can anyone help me trying to understand what happened?

DiegoDAF commented 3 years ago

After check, I have 2 nodes down with the same problem

kubectl get pods -l application=spilo -L spilo-role -n mdbs -o wide                                                                                                       

NAME                             READY   STATUS    RESTARTS   AGE     IP               NODE           NOMINATED NODE   READINESS GATES   SPILO-ROLE
poc-db-test-22-1tb-0   1/1     Running   0          30m     10.237.192.216  r13-u15   <none>           <none>
poc-db-test-22-1tb-1   1/1     Running   0          56m     10.237.197.90    r11-u27   <none>           <none>            master
poc-db-test-22-1tb-2   1/1     Running   0          56m     10.237.197.91    r11-u26   <none>           <none>            replica
poc-db-test-22-1tb-3   1/1     Running   0          82s     10.237.197.82    r12-u24   <none>           <none>

DiegoDAF commented 3 years ago

At this moment, I don't know how to recover this dead nodes.

CyberDem0n commented 3 years ago

Running HA system like flying a modern airplane, mostly autopilot, but you have to know how to fly manually is something goes wrong. Specifically, you can reinitialise the node by running patronictl reinit. But, you have to understand how it got into such state. That would require analysing logs before the issue started. All logs, including postgres.

DiegoDAF commented 3 years ago

Many thanks!!! Totally agree!!! I owe you a beer !!

zalando / postgres-operator

Replica node never recover after rewind #1406