Closed ThibaultMontaufray closed 7 months ago
We're having exactly the same issue
I am experiencing the same issue. Could we possibly receive a response or investigate this regression? Please note, this problem occurs even without a significant volume of read/write operations in the database.
Do you have any news about this issue ? Do you need more details ?
spilo doesn't run the election once again.
Leader election is performed by Patroni. So this issue is created in the wrong repo in the first place.
But we need to understand why we received the Fast Shutdown command in the first place
│ postgres 2023-12-18 17:40:28,863 INFO: updated leader lock during manual failover: demote │
because you/the Operator performed a manual failover
To understand the situation better, more Patroni/PG logs are needed. But again, not in this repo.
TLTR : The leader is lost after a
Fast Shutdown
command received, and spilo doesn't run the election once again.Details :
After reading this issue, we're trying the configuration to increase the
ttl
andretry_timeout
delay. for the moment we have the default values :ttl: 30, loop_wait: 10, retry_timeout: 10
But we need to understand why we received the
Fast Shutdown
command in the first place, and avoid the cluster to be locked inERROR connection error: PQconnectPoll on the db leader
state :And the leader is stuck in
stopping
state :All I can see is only a restart of the replica pod :
It cannot be a resource issue since we oversize the staging cluster :
And here is the parameters used :
For now the pod have been restarted and I don't have more logs to provide. I'll try to add everything I have, don't hesitate to ask me if you need more information.