zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

postgres master/replica keep failing with no reason! They never recovery! #1251

Open MohamedMSaeed opened 3 years ago

MohamedMSaeed commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

ISSUE

Almost every fortnight, one of the PG cluster pods fails for no reason! the database is down and I got several error messages. I don't really know what the reason is, and it is started to be very annoying.

I will post the logs sorted by date. Also, I will copy only one line if the same log was written many times.

Logs from coredb-0. it was the master.

2020-12-11 00:41:36,001 INFO: Lock owner: coredb-0; I am coredb-0
2020-12-11 00:41:36,048 INFO: no action.  i am the leader with the lock
2020-12-11 00:41:43.825 24 LOG {ticks: 0, maint: 0, retry: 0}
2020-12-11 00:41:46,001 INFO: Lock owner: coredb-0; I am coredb-0

2020-12-11 00:41:49,594 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8380edeb8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
....
2020-12-11 00:41:54,638 ERROR: failed to update leader lock
2020-12-11 00:41:54,638 WARNING: Retry got exception: HTTPSConnectionPool(host='10.63.240.1', port=443): Max retries exceeded with url: /api/v1/namespaces/XXX/endpoints/coredb (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc838115c88>: Failed to establish a new connection: [Errno 111] Connection refused',))
....
2020-12-11 00:41:54,762 INFO: demoted self because failed to update leader lock in DCS
2020-12-11 00:41:54,764 INFO: closed patroni connection to the postgresql cluster
2020-12-11 00:41:55 UTC [176259]: [2-1] 5fd2c053.2b083 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2020-12-11 00:41:55 UTC [176259]: [1-1] 5fd2c053.2b083 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-12-11 00:41:55 UTC [176259]: [4-1] 5fd2c053.2b083 0     LOG:  listening on IPv6 address "::", port 5432
2020-12-11 00:41:55 UTC [176259]: [3-1] 5fd2c053.2b083 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-12-11 00:41:55,151 INFO: postmaster pid=176259
2020-12-11 00:41:55 UTC [176259]: [5-1] 5fd2c053.2b083 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-12-11 00:41:55 UTC [176259]: [7-1] 5fd2c053.2b083 0     HINT:  Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections
2020-12-11 00:41:55 UTC [176259]: [6-1] 5fd2c053.2b083 0     LOG:  redirecting log output to logging collector process
2020-12-11 00:41:56,001 INFO: Lock owner: coredb-0; I am ]coredb-0
2020-12-11 00:41:56,004 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8380adbe0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
2020-12-11 00:42:05,251 INFO: starting after demotion in progress
2020-12-11 00:42:05,253 INFO: establishing a new patroni connection to the postgres cluster
2020-12-11 00:42:05,253 INFO: Lock owner: coredb-0; I am coredb-0
2020-12-11 00:42:05,262 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8381150b8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
2020-12-11 00:42:14,848 ERROR: failed to update leader lock
2020-12-11 00:42:14,852 ERROR: Failed to drop replication slot 'coredb_1'
2020-12-11 00:42:14,865 INFO: not promoting because failed to update leader lock in DCS
2020-12-11 00:42:34,347 INFO: following new leader after trying and failing to obtain lock
2020-12-11 00:42:36,414 INFO: Got response from coredb-1 http://10.60.6.176:8008/patroni: {"state": "starting", "role": "replica", "database_system_identifier": "XXX", "patroni": {"version": "1.6.5", "scope": "coredb"}}
2020-12-11 00:42:36,415 WARNING: Request failed to coredb-1: GET http://10.60.6.176:8008/patroni ('xlog')
2020-12-11 00:42:45,808 INFO: Could not take out TTL lock
2020-12-11 00:42:46,037 INFO: following new leader after trying and failing to obtain lock
2020-12-11 00:42:46,038 WARNING: Loop time exceeded, rescheduling immediately.
2020-12-11 00:42:46,055 INFO: Got response from coredb-1 http://10.60.6.176:8008/patroni: {"state": "running", "postmaster_start_time": "2020-12-11 00:42:41.009 UTC", "role": "replica", "server_version": 110007, "cluster_unlocked": true, "xlog": {"received_location": 28018091536, "replayed_location": 28018091536, "replayed_timestamp": "2020-12-11 00:40:49.412 UTC", "paused": false}, "timeline": 23, "database_system_identifier": "XXX", "patroni": {"version": "1.6.5", "scope": "coredb"}}
2020-12-11 00:45:15,469 ERROR: get_cluster
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 498, in _load_cluster
    self._wait_caches()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 492, in _wait_caches
    raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline'

Then I keep getting the following log over and over

2020-12-13 07:57:28,633 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:28,633 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:28,634 INFO: starting as a secondary
2020-12-13 07:57:28 UTC [525363]: [1-1] 5fd5c968.80433 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-12-13 07:57:28 UTC [525363]: [2-1] 5fd5c968.80433 0     LOG:  pg_stat_kcache.linux_hz is set to 333333
2020-12-13 07:57:28 UTC [525363]: [3-1] 5fd5c968.80433 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-12-13 07:57:28 UTC [525363]: [4-1] 5fd5c968.80433 0     LOG:  listening on IPv6 address "::", port 5432
2020-12-13 07:57:28 UTC [525363]: [5-1] 5fd5c968.80433 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-12-13 07:57:28,939 INFO: postmaster pid=525363
2020-12-13 07:57:28 UTC [525363]: [6-1] 5fd5c968.80433 0     LOG:  redirecting log output to logging collector process
2020-12-13 07:57:28 UTC [525363]: [7-1] 5fd5c968.80433 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2020-12-13 07:57:29.295 24 LOG {ticks: 0, maint: 0, retry: 0}
/var/run/postgresql:5432 - no response
2020-12-13 07:57:30,010 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:30,010 INFO: failed to start postgres
2020-12-13 07:57:40,010 WARNING: Postgresql is not running.
2020-12-13 07:57:40,010 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:40,015 INFO: pg_controldata:
  pg_control version number: 1100
  Catalog version number: 201809051
  Database system identifier: XXX
  Database cluster state: in archive recovery
  pg_control last modified: Sun Dec 13 07:57:29 2020
  Latest checkpoint location: 6/850825C0
  Latest checkpoint's REDO location: 6/85082588
  Latest checkpoint's REDO WAL file: 000000170000000600000085
  Latest checkpoint's TimeLineID: 23
  Latest checkpoint's PrevTimeLineID: 23
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:13427
  Latest checkpoint's NextOID: 139538
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 562
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 13427
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Dec 11 00:33:21 2020
  Fake LSN counter for unlogged rels: 0/1
  Minimum recovery ending location: 6/86022610
  Min recovery ending loc's timeline: 23
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 500
  max_worker_processes setting: 8
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: XXX

How do I currently fix this issue:

  1. Delete the pod and let to be re-created. => it may fix the issue
  2. Most of the time "delete pod" doesn't solve the issue, so I delete the disk and recreate it in a very poring process. => this fixes the issue and the replica works.
DrissiReda commented 3 years ago

I also have the same issue, but not sure why this happens, also I noticed that if it works well and I force delete the pod it keeps failing with this message:

2020-12-18 10:07:03,463 ERROR: ObjectCache.run MaxRetryError("HTTPSConnectionPool(host='10.96.0.1', port=443): Max
retries exceede d with url: /api/v1/namespaces/postgresql/pods?labelSelector=application%3Dspilo%2Ccluster-name%3Dkast-
default-postgresql (Caused  by NewConnectionError(': Failed to establish a new connection : [Errno 113] No route to host',))",)
2020-12-18 10:07:04,422 ERROR: get_cluster
          Traceback (most recent call last):  
          File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 528, in _load_cluster  
                self._wait_caches()
          File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 522, in _wait_caches
                raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline' 
2020-12-18 10:07:04,422  ERROR: get_cluster