Open boopathykpm opened 4 years ago
Quick look it seems your replica pods are not healthy. Can you provide some postgres logs? ideally of the startup sequence?
On both master and replica, I'm getting the same error. The scenario I tested is, I had manually deleted the master pod while doing that the replica pod should take over as the master and the broken master will recover as a replica.
The scenario works perfectly with the image registry.opensource.zalan.do/acid/spilo-12:1.6-p2
failing with the latest image.
2020-06-03 09:38:21,397 INFO: Lock owner: vmn-boopaths-dev-pgsql-0; I am vmn-boopaths-dev-pgsql-0
2020-06-03 09:38:21,444 WARNING: manual failover: members list is empty
2020-06-03 09:38:21,445 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:22,057 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:22,058 WARNING: Retry got exception: 'connection problems'
2020-06-03 09:38:22,059 INFO: master start has timed out, but continuing to wait because failover is not possible
/var/run/postgresql:5432 - rejecting connections
2020-06-03 09:38:31,397 INFO: Lock owner: vmn-boopaths-dev-pgsql-0; I am vmn-boopaths-dev-pgsql-0
2020-06-03 09:38:31,446 WARNING: manual failover: members list is empty
2020-06-03 09:38:31,446 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:31,748 INFO: establishing a new patroni connection to the postgres cluster
2020-06-03 09:38:31,749 WARNING: Retry got exception: 'connection problems'
2020-06-03 09:38:31,750 INFO: master start has timed out, but continuing to wait because failover is not possible
This is happening when the pod goes down, and it never starts smoothly. It's failing all the time.
Can you look into your Postgres log files what is happening in the postgres logs in terms of startup or connection errors.
This issue also happen to me, is this issue fixed?
I am experiencing the same problem. I had two instances happily running, but after a recent PVC resize, one couldn't join as standby anymore.
Fun fact: Executing a patronictl reinit
on the pod and for the pod which could not connect anymore fixed the issue!
It is barely impossible to implement 100% reliability and automatic recovery from all faulty situations on replicas. Depending on how Postgres is configured it could be that rebuilding the replica (reinit) is the only solution, but in our experience, it happens rarely. For example, we run a few thousand clusters and in 2021 we did only about 10 reinits. This is all despite the fact that K8s worker nodes are regularly replaced (at least once a month).
There is always a reason why pod(Postgres) ends up in one or another unhealthy situation and Postgres logs nearly always have a hint of why. Using postgres-operator or any kind of Database-As-a-Service doesn't free you up from learning the basics of PostgreSQL administration. If you happen to identify the situation which is possible to recover from without reinit, we would happily implement a solution in Patroni/Spilo/Postgres-operator.
The automatic trigger of reinit is out of the table. There is always a chance that it could destroy the last copy of your precious data and we won't take such a responsibility.
Hi, I experience the same issue. I have a kubernetes cluster with 3 nodes running a single replica postgres cluster deployed with the operator. When I mark the node on which postgres is running as unschedulable (kubectl cordon), the operator tries to move the pod to another node but it fails. The postgres pod shows the same logs as https://github.com/zalando/postgres-operator/issues/993#issuecomment-638086093.
Any suggestion?
I'm also having the issue, when running patronictl reinit db-cluster
doesn't seem to work:
$ patronictl reinit db-cluster
+ Cluster: db-cluster --------+---------+----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------+--------------+---------+----------+----+-----------+
| db-cluster-0 | 10.42.3.12 | Leader | running | 20 | |
| db-cluster-1 | 10.42.15.156 | Replica | starting | | unknown |
+--------------+--------------+---------+----------+----+-----------+
Which member do you want to reinitialize [db-cluster-1]? []:
Error: is not a member of cluster
I have installed the latest Zalando Postgresql operator (1.5.0) and tried to create the Postgresql DB cluster using the updated manifest.
I was trying to test out the failover with the latest docker image
registry.opensource.zalan.do/acid/spilo-12:1.6-p3
. It's unfortunate the failover is not happening as expected.I have 2 replicas running for my DB and I have deleted the pod which was acting as a master, the failover didn't happen. Instead got the below error, but same is not happening with
registry.opensource.zalan.do/acid/spilo-12:1.6-p2
docker image.With an old docker image, when the master pod goes down, the standby pod takes the master role.