zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.29k stars 974 forks source link

Postgres pod starts as secondary node instead of primary after Kubernetes cluster upgrade #2288

Open nihaldivyam opened 1 year ago

nihaldivyam commented 1 year ago

Description: We upgraded our Kubernetes cluster from version 1.24.8 to 1.26.3 by following the official docs. When we drained the node running the Postgres pod, it would respawn on another node without any issues. However, the Postgres pod started as a secondary node instead of the primary node, even though we had configured it to run in 1-master mode. As a result, the database rejected all connections from the application. Please see the error log below for more information:

2023-04-14 10:26:25,574 INFO: Lock owner: None; I am mattermost-pgsql-0                                                                                                                    │
│ 2023-04-14 10:26:25,574 INFO: starting as a secondary                                                                                                                                      │
│ 2023-04-14 10:26:25,664 INFO: postmaster pid=85678 
│ 2023-04-12 10:26:25 UTC [15597]: [8-1] 6436b473.3ced 0     HINT:  Future log output will appear in directory "../pg_log".                                                                  │
│ /var/run/postgresql:5432 - rejecting connections                                                                                                                                           │
│ /var/run/postgresql:5432 - rejecting connections                                                                                                                                           │
│ /var/run/postgresql:5432 - rejecting connections                                                                                                                                           │
│ /var/run/postgresql:5432 - rejecting connections                                                                                                                                           │
│ /var/run/postgresql:5432 - no response 
FxKu commented 1 year ago

So you're saying the respawned pod is not elected as the leader and remains the replica? Or does it happen automatically after some short time? Can you check what's in the Postgres logs?

nihaldivyam commented 1 year ago

@FxKu Yes the respawned pod is not elected as the leader and remains the replica, it happened when i drained the node where the Postgres was running, and when it respawned it ran as a replica, not a leader.

yuvraj-vansure commented 1 year ago

Hi @nihaldivyam, were you able to solve the issue? I'm also facing similar issue but with crunchydata/postgres-operator.

nihaldivyam commented 1 year ago

Hi @yuvraj-vansure, Unfortunately, we couldn't resolve the Postgres Operator issue. We deployed a new instance and restored it from backup. This caused some downtime and data was restored successfully.

Feel free to share any solutions you might find with crunchydata/postgres-operator

Thanks!

nihaldivyam commented 9 months ago

@FxKu, the problem persists. Despite restarting the PostgreSQL pod, it consistently runs as secondary even when there's only one pod. We attempted scaling it up to ensure at least one pod operates as primary, but this did not resolve the issue.

kubectl logs api-postgresql-0 -n random  --tail=15
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 1eae14c3d9ef8fdbdcde04620b17425bf9b9434efe03cc3cd1c1ecf4d0fa9eb3

2023-12-27 13:09:17,926 INFO: Lock owner: None; I am api-postgresql-0
2023-12-27 13:09:17,927 INFO: starting as a secondary
2023-12-27 13:09:18,033 INFO: postmaster pid=167017
/var/run/postgresql:5432 - no response
2023-12-27 13:09:18 UTC [167017]: [1-1] 658c21fe.28c69 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2023-12-27 13:09:18 UTC [167017]: [2-1] 658c21fe.28c69 0     LOG:  pg_stat_kcache.linux_hz is set to 333333
2023-12-27 13:09:18 UTC [167017]: [3-1] 658c21fe.28c69 0     LOG:  redirecting log output to logging collector process
2023-12-27 13:09:18 UTC [167017]: [4-1] 658c21fe.28c69 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
bumarcell commented 5 months ago

We're having the same problem. No idea why the only pod is starting as secondary, knowing there's no lock owner 😖

2024-04-26 09:33:54,713 INFO: Lock owner: None; I am zeus-postgres-0
2024-04-26 09:33:54,714 INFO: starting as a secondary
2024-04-26 09:33:54,813 INFO: postmaster pid=26709
/var/run/postgresql:5432 - no response
2024-04-26 09:33:54 UTC [26709]: [1-1] 662b7502.6855 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2024-04-26 09:33:54 UTC [26709]: [2-1] 662b7502.6855 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2024-04-26 09:33:54 UTC [26709]: [3-1] 662b7502.6855 0     LOG:  redirecting log output to logging collector process
2024-04-26 09:33:54 UTC [26709]: [4-1] 662b7502.6855 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
bumarcell commented 5 months ago

Update: the problem seems to have been caused by old data in the s3 bucket. After having removed those the master was promoted.