Please, answer some short questions which should help us to understand your problem / question better?
Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.12.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? OVH Cloud
Are you running Postgres Operator in production? yes
Type of issue? Bug report
I'm experiencing synchronization problems within my PostgreSQL cluster that's managed by the Zalando Postgres Operator. The cluster comprises two instances: pgsql-p3m01ob6-0 as the leader and pgsql-p3m01ob6-1as the replica. When trying to increase the instances persistentVolume size some kind of race happens between postgres-operator controller and patroni which leads to update fail.
Steps to reproduce the issue:
1 - trigger a postgresqls.acid.zalan.do volume size increasing.
2 -Resize the replicas pgsql-p3m01ob6-1persistentVolume and make a rolling update to mount the new resized volume
3 - once the rolling update of replica is done and before the patroni join it to the cluster, patroni try to switchover over the leader pgsql-p3m01ob6-1 to replicas in order to make a rolling update and mount pgsql-p3m01ob6-1 resized volume, unfortunately, the switchover failed because the replicas had not yet caught up with the primary node with message no switchover candidate found
the issue does not happens within cluster of 3 instances and that's completely normal because we always have at lease one replicaavailable for switchover.
Solution:
introduce a back off exponential retry when get SwitchoverCandidate, and I realized that there is a //TODO for this so I will try to deal with this issue in the coming days
Please, answer some short questions which should help us to understand your problem / question better?
I'm experiencing synchronization problems within my PostgreSQL cluster that's managed by the Zalando Postgres Operator. The cluster comprises two instances:
pgsql-p3m01ob6-0
as the leader andpgsql-p3m01ob6-1
as the replica. When trying to increase the instancespersistentVolume
size some kind of race happens betweenpostgres-operator
controller andpatroni
which leads to update fail.Steps to reproduce the issue:
1 - trigger a
postgresqls.acid.zalan.do
volume size increasing. 2 -Resize the replicaspgsql-p3m01ob6-1
persistentVolume
and make a rolling update to mount the new resized volume 3 - once the rolling update of replica is done and before the patroni join it to the cluster, patroni try to switchover over the leaderpgsql-p3m01ob6-1
to replicas in order to make a rolling update and mountpgsql-p3m01ob6-1
resized volume, unfortunately, theswitchover
failed because the replicas had not yet caught up with the primary node with message no switchover candidate foundthe issue does not happens within cluster of 3 instances and that's completely normal because we always have at
lease one replica
available for switchover.Solution:
introduce a back off exponential retry when get
SwitchoverCandidate
, and I realized that there is a //TODO for this so I will try to deal with this issue in the coming days