zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.2k stars 962 forks source link

Issue with failover #993

Open boopathykpm opened 4 years ago

boopathykpm commented 4 years ago

I have installed the latest Zalando Postgresql operator (1.5.0) and tried to create the Postgresql DB cluster using the updated manifest.

I was trying to test out the failover with the latest docker image registry.opensource.zalan.do/acid/spilo-12:1.6-p3. It's unfortunate the failover is not happening as expected.

I have 2 replicas running for my DB and I have deleted the pod which was acting as a master, the failover didn't happen. Instead got the below error, but same is not happening with registry.opensource.zalan.do/acid/spilo-12:1.6-p2 docker image.

With an old docker image, when the master pod goes down, the standby pod takes the master role.

/var/run/postgresql:5432 - rejecting connections
2020-05-26 13:25:53,222 INFO: Lock owner: None; I am test-backup-test-test-postgresql-0
2020-05-26 13:25:53,222 INFO: Still starting up as a standby.
2020-05-26 13:25:53,223 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:25:53,776 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:25:53,821 WARNING: Retry got exception: 'connection problems'
2020-05-26 13:25:53,821 INFO: Error communicating with PostgreSQL. Will try again later
/var/run/postgresql:5432 - rejecting connections
2020-05-26 13:26:03,221 INFO: Lock owner: None; I am test-backup-test-test-postgresql-0
2020-05-26 13:26:03,222 INFO: Still starting up as a standby.
2020-05-26 13:26:03,223 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:26:04,016 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:26:04,017 WARNING: Retry got exception: 'connection problems'
2020-05-26 13:26:04,021 INFO: Error communicating with PostgreSQL. Will try again later
/var/run/postgresql:5432 - rejecting connections
2020-05-26 13:26:13,221 INFO: Lock owner: None; I am test-backup-test-test-postgresql-0
2020-05-26 13:26:13,221 INFO: Still starting up as a standby.
2020-05-26 13:26:13,222 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:26:13,424 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 13:26:13,426 WARNING: Retry got exception: 'connection problems'
2020-05-26 13:26:13,426 INFO: Error communicating with PostgreSQL. Will try again later
Jan-M commented 4 years ago

Quick look it seems your replica pods are not healthy. Can you provide some postgres logs? ideally of the startup sequence?

boopathykpm commented 4 years ago

On both master and replica, I'm getting the same error. The scenario I tested is, I had manually deleted the master pod while doing that the replica pod should take over as the master and the broken master will recover as a replica.

The scenario works perfectly with the image registry.opensource.zalan.do/acid/spilo-12:1.6-p2 failing with the latest image.

boopathykpm commented 4 years ago
2020-06-03 09:38:21,397 INFO: Lock owner: vmn-boopaths-dev-pgsql-0; I am vmn-boopaths-dev-pgsql-0
2020-06-03 09:38:21,444 WARNING: manual failover: members list is empty
2020-06-03 09:38:21,445 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:22,057 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:22,058 WARNING: Retry got exception: 'connection problems'
2020-06-03 09:38:22,059 INFO: master start has timed out, but continuing to wait because failover is not possible
/var/run/postgresql:5432 - rejecting connections
2020-06-03 09:38:31,397 INFO: Lock owner: vmn-boopaths-dev-pgsql-0; I am vmn-boopaths-dev-pgsql-0
2020-06-03 09:38:31,446 WARNING: manual failover: members list is empty
2020-06-03 09:38:31,446 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:31,748 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:31,749 WARNING: Retry got exception: 'connection problems'
2020-06-03 09:38:31,750 INFO: master start has timed out, but continuing to wait because failover is not possible

This is happening when the pod goes down, and it never starts smoothly. It's failing all the time.

Jan-M commented 4 years ago

Can you look into your Postgres log files what is happening in the postgres logs in terms of startup or connection errors.

vijayakumar-fluke commented 3 years ago

This issue also happen to me, is this issue fixed?

toabi commented 2 years ago

I am experiencing the same problem. I had two instances happily running, but after a recent PVC resize, one couldn't join as standby anymore.

toabi commented 2 years ago

Fun fact: Executing a patronictl reinit on the pod and for the pod which could not connect anymore fixed the issue!

CyberDem0n commented 2 years ago

It is barely impossible to implement 100% reliability and automatic recovery from all faulty situations on replicas. Depending on how Postgres is configured it could be that rebuilding the replica (reinit) is the only solution, but in our experience, it happens rarely. For example, we run a few thousand clusters and in 2021 we did only about 10 reinits. This is all despite the fact that K8s worker nodes are regularly replaced (at least once a month).

There is always a reason why pod(Postgres) ends up in one or another unhealthy situation and Postgres logs nearly always have a hint of why. Using postgres-operator or any kind of Database-As-a-Service doesn't free you up from learning the basics of PostgreSQL administration. If you happen to identify the situation which is possible to recover from without reinit, we would happily implement a solution in Patroni/Spilo/Postgres-operator.

The automatic trigger of reinit is out of the table. There is always a chance that it could destroy the last copy of your precious data and we won't take such a responsibility.

renzodf commented 1 year ago

Hi, I experience the same issue. I have a kubernetes cluster with 3 nodes running a single replica postgres cluster deployed with the operator. When I mark the node on which postgres is running as unschedulable (kubectl cordon), the operator tries to move the pod to another node but it fails. The postgres pod shows the same logs as https://github.com/zalando/postgres-operator/issues/993#issuecomment-638086093.

Any suggestion?

xvilo commented 3 months ago

I'm also having the issue, when running patronictl reinit db-cluster doesn't seem to work:

$ patronictl reinit db-cluster
+ Cluster: db-cluster --------+---------+----------+----+-----------+
| Member       | Host         | Role    | State    | TL | Lag in MB |
+--------------+--------------+---------+----------+----+-----------+
| db-cluster-0 | 10.42.3.12   | Leader  | running  | 20 |           |
| db-cluster-1 | 10.42.15.156 | Replica | starting |    |   unknown |
+--------------+--------------+---------+----------+----+-----------+
Which member do you want to reinitialize [db-cluster-1]? []: 
Error:  is not a member of cluster