Open christiaangoossens opened 2 years ago
2022-07-28 07:02:08,735 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-07-28 07:02:09,417 - bootstrapping - INFO - Looks like your running aws
2022-07-28 07:02:09,480 - bootstrapping - INFO - Configuring bootstrap
2022-07-28 07:02:09,480 - bootstrapping - INFO - Configuring crontab
2022-07-28 07:02:09,480 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring standby-cluster
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring log
2022-07-28 07:02:09,481 - bootstrapping - INFO - Configuring pam-oauth2
2022-07-28 07:02:09,482 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-07-28 07:02:09,482 - bootstrapping - INFO - Configuring patroni
2022-07-28 07:02:09,492 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-07-28 07:02:09,493 - bootstrapping - INFO - Configuring pgbouncer
2022-07-28 07:02:09,493 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-07-28 07:02:09,494 - bootstrapping - INFO - Configuring wal-e
2022-07-28 07:02:09,494 - bootstrapping - INFO - Configuring certificate
2022-07-28 07:02:09,494 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-07-28 07:02:09,544 - bootstrapping - INFO - Configuring pgqd
2022-07-28 07:02:10,970 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-07-28 07:02:11,042 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-07-28 07:02:11,093 WARNING: Postgresql is not running.
2022-07-28 07:02:11,094 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:11,103 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202107181
Database system identifier: 7112788016677834822
Database cluster state: shut down in recovery
pg_control last modified: Thu Jul 28 07:01:48 2022
Latest checkpoint location: 0/130000C8
Latest checkpoint's REDO location: 0/13000058
Latest checkpoint's REDO WAL file: 000000350000000000000013
Latest checkpoint's TimeLineID: 53
Latest checkpoint's PrevTimeLineID: 53
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:17155
Latest checkpoint's NextOID: 425984
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 727
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 17155
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Thu Jul 21 01:20:50 2022
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/1303B698
Min recovery ending loc's timeline: 53
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 286bbf5272614f592b9176b6f045c1204056bf7dc3fa2afac433183494486afc
2022-07-28 07:02:11,148 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:11,353 INFO: starting as a secondary
2022-07-28 07:02:11,671 INFO: postmaster pid=60
/var/run/postgresql:5432 - no response
2022-07-28 07:02:11 GMT [60]: [1-1] 62e23473.3c 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2022-07-28 07:02:11 GMT [60]: [2-1] 62e23473.3c 0 LOG: pg_stat_kcache.linux_hz is set to 1000000
2022-07-28 07:02:12 GMT [60]: [3-1] 62e23473.3c 0 LOG: redirecting log output to logging collector process
2022-07-28 07:02:12 GMT [60]: [4-1] 62e23473.3c 0 HINT: Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:21,588 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:21,653 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:31,587 INFO: Lock owner: pgsql-cluster-1; I am pgsql-cluster-0
2022-07-28 07:02:31,588 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:41,586 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:02:41,587 INFO: not healthy enough for leader race
2022-07-28 07:02:41,587 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:02:51,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:02:51,587 INFO: not healthy enough for leader race
2022-07-28 07:02:51,587 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:01,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:01,588 INFO: not healthy enough for leader race
2022-07-28 07:03:01,589 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:11,587 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:11,588 INFO: not healthy enough for leader race
2022-07-28 07:03:11,588 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:21,606 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:21,606 INFO: Still starting up as a standby.
2022-07-28 07:03:21,609 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:22,142 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:22,145 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:22,146 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:22,147 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:31,601 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:31,602 INFO: Still starting up as a standby.
2022-07-28 07:03:31,604 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:31,847 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:31,850 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:31,850 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:31,851 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:41,606 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:41,607 INFO: Still starting up as a standby.
2022-07-28 07:03:41,609 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:41,762 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:41,764 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:41,765 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:41,766 INFO: following a different leader because i am not the healthiest node
/var/run/postgresql:5432 - rejecting connections
2022-07-28 07:03:51,605 INFO: Lock owner: None; I am pgsql-cluster-0
2022-07-28 07:03:51,606 INFO: Still starting up as a standby.
2022-07-28 07:03:51,607 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:52,491 INFO: establishing a new patroni connection to the postgres cluster
2022-07-28 07:03:52,494 WARNING: Retry got exception: 'connection problems'
2022-07-28 07:03:52,494 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2022-07-28 07:03:52,496 INFO: following a different leader because i am not the healthiest node
So the above toogle SHOULD have fixed it for v1.8 ? I seem to have something like this problem ? https://github.com/zalando/postgres-operator/issues/2190
@KlavsKlavsen Kubernetes will never destroy pod 0 of a statefulset if you asked for one or more pods. It's ordered.
Removing the readyness probe does not solve this issue, it just introduces the problem that you have, where 0 will never be restarted.
Only real solution is using Parallel scheduling with proper lifeness and readiness probes (although as the PR says, we cannot change that for existing clusters)
I tried to kill the 0 pod, but the new pod fails the same way - so something else seems to also be an issue.
I tried to kill the 0 pod, but the new pod fails the same way - so something else seems to also be an issue.
I think your issue is actually with Patroni not properly fixing that node at startup, not with the k8s part, but not having a readiness probe will cause your app to go erratic when this happens instead of just staying up.
me too
2023-06-20 08:27:37,251 INFO: establishing a new patroni connection to the postgres cluster
2023-06-20 08:27:37,555 INFO: establishing a new patroni connection to the postgres cluster
2023-06-20 08:27:37,559 WARNING: Retry got exception: 'connection problems'
2023-06-20 08:27:37,560 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
How should I solve this problem?
FYI @christiaangoossens @KlavsKlavsen @jicki,
I faced this issue in one of our clusters. The solution is using patronictl reinit
command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)
Here are the action logs:
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown |
| infra-shared-1 | 10.42.5.81 | Leader | running | 109 | |
| infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown |
+----------------+------------+---------+--------------+-----+-----------+
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0
+ Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+------------+---------+--------------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown |
| infra-shared-1 | 10.42.5.81 | Leader | running | 109 | |
| infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown |
+----------------+------------+---------+--------------+-----+-----------+
Are you sure you want to reinitialize members infra-shared-0? [y/N]: y
Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress)
Do you want to cancel it and reinitialize anyway? [y/N]: y
Success: reinitialize for member infra-shared-0
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list
+ Cluster: infra-shared (7205134707613937727) ----+-----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+------------+---------+---------+-----+-----------+
| infra-shared-0 | 10.42.3.32 | Replica | running | 109 | 0 |
| infra-shared-1 | 10.42.5.81 | Leader | running | 109 | |
| infra-shared-2 | 10.42.4.5 | Replica | running | 109 | 0 |
+----------------+------------+---------+---------+-----+-----------+
Hope it will be helpful for you!
FYI @christiaangoossens @KlavsKlavsen @jicki,
I faced this issue in one of our clusters. The solution is using
patronictl reinit
command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)Here are the action logs:
- List cluster:
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list + Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+--------------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown | +----------------+------------+---------+--------------+-----+-----------+
- Re-init the failed members (repeat for each):
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0 + Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+--------------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown | +----------------+------------+---------+--------------+-----+-----------+ Are you sure you want to reinitialize members infra-shared-0? [y/N]: y Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress) Do you want to cancel it and reinitialize anyway? [y/N]: y Success: reinitialize for member infra-shared-0
- Now check agin to show if all members are in running status:
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list + Cluster: infra-shared (7205134707613937727) ----+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+---------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | running | 109 | 0 | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | running | 109 | 0 | +----------------+------------+---------+---------+-----+-----------+
Hope it will be helpful for you!
Thank you very much, that's very helpful.
FYI @christiaangoossens @KlavsKlavsen @jicki,
I faced this issue in one of our clusters. The solution is using
patronictl reinit
command to re-init the failed members (it will remove the data directory then start copying data from the running leader to rebuild the replica)Here are the action logs:
- List cluster:
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list + Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+--------------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown | +----------------+------------+---------+--------------+-----+-----------+
- Re-init the failed members (repeat for each):
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml reinit infra-shared infra-shared-0 + Cluster: infra-shared (7205134707613937727) ---------+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+--------------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | start failed | | unknown | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | starting | | unknown | +----------------+------------+---------+--------------+-----+-----------+ Are you sure you want to reinitialize members infra-shared-0? [y/N]: y Failed: reinitialize for member infra-shared-0, status code=503, (restarting after failure already in progress) Do you want to cancel it and reinitialize anyway? [y/N]: y Success: reinitialize for member infra-shared-0
- Now check agin to show if all members are in running status:
root@infra-shared-0:/home/postgres# patronictl -c postgres.yml list + Cluster: infra-shared (7205134707613937727) ----+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------+------------+---------+---------+-----+-----------+ | infra-shared-0 | 10.42.3.32 | Replica | running | 109 | 0 | | infra-shared-1 | 10.42.5.81 | Leader | running | 109 | | | infra-shared-2 | 10.42.4.5 | Replica | running | 109 | 0 | +----------------+------------+---------+---------+-----+-----------+
Hope it will be helpful for you!
I tried this which fixed the replica but a few hours later the cluster went back into the same state where the replica was rejecting connections. I am just running a cluster with one replica and one leader. Maybe I need to force the first pod in the statefulset to be the leader to prevent this reoccurring?
The pods are running the image: registry.opensource.zalan.do/acid/spilo-14:2.1-p6.
I also noticed that the replica got OOMKilled and restarted so I should probably look into preventing that specifically
Hi all, this is issue possibly still exists, but I have moved on to the CloudnativePG operator instead. Its easier to restore in my opinion too and has less components that can break.
Also note this issue does not exist there as they use pods directly instead of statefulsets.
We've had to switch most of our setup to cloudnative operator too for the same reason.
Fwiw we've switched to the crunchydata postgres-operator.
Please, answer some short questions which should help us to understand your problem / question better?
Some general remarks when posting a bug report:
Cluster fails to restart if the lock is held by pod-1, while the statefulset is restarting. Pod-0 will start, recognise that pod-1 holds the lock and wait forever, eventually falling into a state where it knows its not up to date and thus waits forever.
Due to using podManagementPolicy Ordered, instead of Parallel, pod-1 will never be started, and thus this configuration stays broken.
In my case this happened when a node failed, there was a failover to pod-1 and then both died.