Cluster fails to restart if lock is held by other pod than the first one in the statefulset

christiaangoossens commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? k0s
Are you running Postgres Operator in production? yes
Type of issue? Bug

Some general remarks when posting a bug report:

Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.
If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

Cluster fails to restart if the lock is held by pod-1, while the statefulset is restarting. Pod-0 will start, recognise that pod-1 holds the lock and wait forever, eventually falling into a state where it knows its not up to date and thus waits forever.

Due to using podManagementPolicy Ordered, instead of Parallel, pod-1 will never be started, and thus this configuration stays broken.

In my case this happened when a node failed, there was a failover to pod-1 and then both died.

christiaangoossens commented 2 years ago

2022-07-28 07:02:08,735 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-07-28 07:02:09,417 - bootstrapping - INFO - Looks like your running aws
2022-07-28 07:02:09,480 - bootstrapping - INFO - Configuring bootstrap
2022-07-28 07:02:09,480 - bootstrapping - INFO - Configuring crontab
2022-07-28 07:02:09,480 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring standby-cluster
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring log
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring pam-oauth2
2022-07-28 07:02:09,482 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-07-28 07:02:09,482 - bootstrapping - INFO - Configuring patroni
2022-07-28 07:02:09,492 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-07-28 07:02:09,493 - bootstrapping - INFO - Configuring pgbouncer
2022-07-28 07:02:09,493 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-07-28 07:02:09,494 - bootstrapping - INFO - Configuring wal-e
2022-07-28 07:02:09,494 - bootstrapping - INFO - Configuring certificate
2022-07-28 07:02:09,494 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-07-28 07:02:09,544 - bootstrapping - INFO - Configuring pgqd
2022-07-28 07:02:10,970 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-07-28 07:02:11,042 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-07-28 07:02:11,093 WARNING: Postgresql is not running.
2022-07-28 07:02:11,094 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:11,103 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7112788016677834822
  Database cluster state: shut down in recovery
  pg_control last modified: Thu Jul 28 07:01:48 2022
  Latest checkpoint location: 0/130000C8
  Latest checkpoint's REDO location: 0/13000058
  Latest checkpoint's REDO WAL file: 000000350000000000000013
  Latest checkpoint's TimeLineID: 53
  Latest checkpoint's PrevTimeLineID: 53
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:17155
  Latest checkpoint's NextOID: 425984
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 727
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 17155
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jul 21 01:20:50 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/1303B698
  Min recovery ending loc's timeline: 53
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 286bbf5272614f592b9176b6f045c1204056bf7dc3fa2afac433183494486afc

2022-07-28 07:02:11,148 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:11,353 INFO: starting as a secondary
2022-07-28 07:02:11,671 INFO: postmaster pid=60
/var/run/postgresql:5432 - no response
2022-07-28 07:02:11 GMT [60]: [1-1] 62e23473.3c 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2022-07-28 07:02:11 GMT [60]: [2-1] 62e23473.3c 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2022-07-28 07:02:12 GMT [60]: [3-1] 62e23473.3c 0     LOG:  redirecting log output to logging collector process
2022-07-28 07:02:12 GMT [60]: [4-1] 62e23473.3c 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:21,588 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:21,653 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:31,587 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:31,588 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:41,586 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:02:41,587 INFO: not healthy enough for leader race
2022-07-28 07:02:41,587 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:51,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:02:51,587 INFO: not healthy enough for leader race
2022-07-28 07:02:51,587 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:01,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:01,588 INFO: not healthy enough for leader race
2022-07-28 07:03:01,589 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:11,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:11,588 INFO: not healthy enough for leader race
2022-07-28 07:03:11,588 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:21,606 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:21,606 INFO: Still starting up as a standby.
2022-07-28 07:03:21,609 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:22,142 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:22,145 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:22,146 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:22,147 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:31,601 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:31,602 INFO: Still starting up as a standby.
2022-07-28 07:03:31,604 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:31,847 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:31,850 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:31,850 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:31,851 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:41,606 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:41,607 INFO: Still starting up as a standby.
2022-07-28 07:03:41,609 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:41,762 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:41,764 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:41,765 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:41,766 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:51,605 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:51,606 INFO: Still starting up as a standby.
2022-07-28 07:03:51,607 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:52,491 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:52,494 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:52,494 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:52,496 INFO: following a different leader because i am not the healthiest node

KlavsKlavsen commented 1 year ago

So the above toogle SHOULD have fixed it for v1.8 ? I seem to have something like this problem ? https://github.com/zalando/postgres-operator/issues/2190

christiaangoossens commented 1 year ago

@KlavsKlavsen Kubernetes will never destroy pod 0 of a statefulset if you asked for one or more pods. It's ordered.

Removing the readyness probe does not solve this issue, it just introduces the problem that you have, where 0 will never be restarted.

Only real solution is using Parallel scheduling with proper lifeness and readiness probes (although as the PR says, we cannot change that for existing clusters)

KlavsKlavsen commented 1 year ago

I tried to kill the 0 pod, but the new pod fails the same way - so something else seems to also be an issue.

christiaangoossens commented 1 year ago

I tried to kill the 0 pod, but the new pod fails the same way - so something else seems to also be an issue.

I think your issue is actually with Patroni not properly fixing that node at startup, not with the k8s part, but not having a readiness probe will cause your app to go erratic when this happens instead of just staying up.

jicki commented 1 year ago

me too

2023-06-20 08:27:37,251 INFO: establishing a new patroni connection to the postgres cluster
2023-06-20 08:27:37,555 INFO: establishing a new patroni connection to the postgres cluster
2023-06-20 08:27:37,559 WARNING: Retry got exception: 'connection problems'
2023-06-20 08:27:37,560 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role

How should I solve this problem?

manhtukhang commented 1 year ago

FYI @christiaangoossens @KlavsKlavsen @jicki,

I faced this issue in one of our clusters. The solution is using patronictl reinit command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)

Here are the action logs:

List cluster:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+

Re-init the failed members (repeat for each):

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+
Are you sure you want to reinitialize members infra-shared-0? [y/N]: y
Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member infra-shared-0

Now check agin to show if all members are in running status:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ----+-----+-----------+
| Member         | Host       | Role    | State   |  TL | Lag in MB |
+----------------+------------+---------+---------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | running | 109 |         0 |
| infra-shared-1 | 10.42.5.81 | Leader  | running | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | running | 109 |         0 |
+----------------+------------+---------+---------+-----+-----------+

Hope it will be helpful for you!

jicki commented 1 year ago

FYI @christiaangoossens @KlavsKlavsen @jicki,

I faced this issue in one of our clusters. The solution is using patronictl reinit command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)

Here are the action logs:

List cluster:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+

Re-init the failed members (repeat for each):

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+
Are you sure you want to reinitialize members infra-shared-0? [y/N]: y
Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member infra-shared-0

Now check agin to show if all members are in running status:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ----+-----+-----------+
| Member         | Host       | Role    | State   |  TL | Lag in MB |
+----------------+------------+---------+---------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | running | 109 |         0 |
| infra-shared-1 | 10.42.5.81 | Leader  | running | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | running | 109 |         0 |
+----------------+------------+---------+---------+-----+-----------+

Hope it will be helpful for you!

Thank you very much, that's very helpful.

Charles-Johnson commented 9 months ago

FYI @christiaangoossens @KlavsKlavsen @jicki,

I faced this issue in one of our clusters. The solution is using patronictl reinit command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)

Here are the action logs:

List cluster:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+

Re-init the failed members (repeat for each):

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member         | Host       | Role    | State        |  TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed |     |   unknown |
| infra-shared-1 | 10.42.5.81 | Leader  | running      | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | starting     |     |   unknown |
+----------------+------------+---------+--------------+-----+-----------+
Are you sure you want to reinitialize members infra-shared-0? [y/N]: y
Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member infra-shared-0

Now check agin to show if all members are in running status:

root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ----+-----+-----------+
| Member         | Host       | Role    | State   |  TL | Lag in MB |
+----------------+------------+---------+---------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | running | 109 |         0 |
| infra-shared-1 | 10.42.5.81 | Leader  | running | 109 |           |
| infra-shared-2 | 10.42.4.5  | Replica | running | 109 |         0 |
+----------------+------------+---------+---------+-----+-----------+

Hope it will be helpful for you!

I tried this which fixed the replica but a few hours later the cluster went back into the same state where the replica was rejecting connections. I am just running a cluster with one replica and one leader. Maybe I need to force the first pod in the statefulset to be the leader to prevent this reoccurring?

The pods are running the image: registry.opensource.zalan.do/acid/spilo-14:2.1-p6.

I also noticed that the replica got OOMKilled and restarted so I should probably look into preventing that specifically

christiaangoossens commented 9 months ago

Hi all, this is issue possibly still exists, but I have moved on to the CloudnativePG operator instead. Its easier to restore in my opinion too and has less components that can break.

Also note this issue does not exist there as they use pods directly instead of statefulsets.

KlavsKlavsen commented 9 months ago

We've had to switch most of our setup to cloudnative operator too for the same reason.

Ramblurr commented 8 months ago

Fwiw we've switched to the crunchydata postgres-operator.

zalando / postgres-operator

Cluster fails to restart if lock is held by other pod than the first one in the statefulset #1978