zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.54k stars 382 forks source link

Leader lost : PQconnectPoll on the db leader #957

Closed ThibaultMontaufray closed 7 months ago

ThibaultMontaufray commented 9 months ago

TLTR : The leader is lost after a Fast Shutdown command received, and spilo doesn't run the election once again.

Details :

After reading this issue, we're trying the configuration to increase the ttl and retry_timeout delay. for the moment we have the default values : ttl: 30, loop_wait: 10, retry_timeout: 10

But we need to understand why we received the Fast Shutdown command in the first place, and avoid the cluster to be locked in ERROR connection error: PQconnectPoll on the db leader state :

│ postgres 2023-12-18 17:40:28,837 INFO: Lock owner: mycluster-staging-db-0; I am mycluster-staging-db-0 │ │ postgres 2023-12-18 17:40:28,863 INFO: updated leader lock during manual failover: demote │ │ postgres 2023-12-18 17:40:38,806 INFO: Lock owner: mycluster-staging-db-0; I am mycluster-staging-db-0 │ │ postgres 2023-12-18 17:40:38,828 INFO: updated leader lock during manual failover: demote │ │ postgres 2023-12-18 17:40:40.670 36 ERROR connection error: PQconnectPoll │ │ postgres 2023-12-18 17:40:40.670 36 ERROR │ │ postgres 2023-12-18 17:40:40.670 36 WARNING mdm: default timeout │ │ postgres 2023-12-18 17:40:40.671 36 ERROR connection error: PQconnectPoll │ │ postgres 2023-12-18 17:40:40.671 36 ERROR │ │ postgres 2023-12-18 17:40:40.671 36 ERROR connection error: PQconnectPoll │ │ postgres 2023-12-18 17:40:40.671 36 ERROR │ │ postgres 2023-12-18 17:40:40.671 36 WARNING postgres: default timeout

And the leader is stuck in stopping state :

Cluster: mycluster-staging-db (7244940027248418891) --------+----+-----------+
| Member                 | Host        | Role    | State    | TL | Lag in MB |
+------------------------+-------------+---------+----------+----+-----------+
| mycluster-staging-db-0 | 10.2.11.44  | Leader  | stopping |    |           |
| mycluster-staging-db-1 | 10.2.12.157 | Replica | running  | 27 |         0 |
+------------------------+-------------+---------+----------+----+-----------+

All I can see is only a restart of the replica pod :

Running              mycluster-staging-pool-1-node-2e5ac4         3d2h
Running              mycluster-staging-pool-1-node-eea2a4         77m

It cannot be a resource issue since we oversize the staging cluster :

  resources:
    limits:
      cpu: 7000m
      memory: 30Gi
    requests:
      cpu: 2000m
      memory: 10Gi

And here is the parameters used :

    parameters:
      datestyle: iso, mdy
      deadlock_timeout: 10s
      default_text_search_config: pg_catalog.english
      dynamic_shared_memory_type: posix
      idle_in_transaction_session_timeout: "60000"
      lc_messages: en_US.utf8
      lc_monetary: en_US.utf8
      lc_numeric: en_US.utf8
      lc_time: en_US.utf8
      listen_addresses: '*'
      log_lock_waits: "on"
      log_timezone: Etc/UTC
      max_connections: "500"
      max_prepared_transactions: "100"
      shared_buffers: 7GB
      superuser_reserved_connections: "3"
      timezone: Etc/UTC
      wal_level: logical

For now the pod have been restarted and I don't have more logs to provide. I'll try to add everything I have, don't hesitate to ask me if you need more information.

devlifealways commented 9 months ago

We're having exactly the same issue

mbfmbf commented 9 months ago

I am experiencing the same issue. Could we possibly receive a response or investigate this regression? Please note, this problem occurs even without a significant volume of read/write operations in the database.

ThibaultMontaufray commented 8 months ago

Do you have any news about this issue ? Do you need more details ?

hughcapet commented 7 months ago

spilo doesn't run the election once again.

Leader election is performed by Patroni. So this issue is created in the wrong repo in the first place.

But we need to understand why we received the Fast Shutdown command in the first place

│ postgres 2023-12-18 17:40:28,863 INFO: updated leader lock during manual failover: demote │

because you/the Operator performed a manual failover

To understand the situation better, more Patroni/PG logs are needed. But again, not in this repo.