postgres master/replica keep failing with no reason! They never recovery!

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.5.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? GCP
Are you running Postgres Operator in production? yes
Type of issue? Bug report

ISSUE

Almost every fortnight, one of the PG cluster pods fails for no reason! the database is down and I got several error messages. I don't really know what the reason is, and it is started to be very annoying.

I will post the logs sorted by date. Also, I will copy only one line if the same log was written many times.

Logs from coredb-0. it was the master.

2020-12-11 00:41:36,001 INFO: Lock owner: coredb-0; I am coredb-0
2020-12-11 00:41:36,048 INFO: no action.  i am the leader with the lock
2020-12-11 00:41:43.825 24 LOG {ticks: 0, maint: 0, retry: 0}
2020-12-11 00:41:46,001 INFO: Lock owner: coredb-0; I am coredb-0

2020-12-11 00:41:49,594 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8380edeb8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
....
2020-12-11 00:41:54,638 ERROR: failed to update leader lock
2020-12-11 00:41:54,638 WARNING: Retry got exception: HTTPSConnectionPool(host='10.63.240.1', port=443): Max retries exceeded with url: /api/v1/namespaces/XXX/endpoints/coredb (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc838115c88>: Failed to establish a new connection: [Errno 111] Connection refused',))
....
2020-12-11 00:41:54,762 INFO: demoted self because failed to update leader lock in DCS
2020-12-11 00:41:54,764 INFO: closed patroni connection to the postgresql cluster
2020-12-11 00:41:55 UTC [176259]: [2-1] 5fd2c053.2b083 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2020-12-11 00:41:55 UTC [176259]: [1-1] 5fd2c053.2b083 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-12-11 00:41:55 UTC [176259]: [4-1] 5fd2c053.2b083 0     LOG:  listening on IPv6 address "::", port 5432
2020-12-11 00:41:55 UTC [176259]: [3-1] 5fd2c053.2b083 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-12-11 00:41:55,151 INFO: postmaster pid=176259
2020-12-11 00:41:55 UTC [176259]: [5-1] 5fd2c053.2b083 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-12-11 00:41:55 UTC [176259]: [7-1] 5fd2c053.2b083 0     HINT:  Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections
2020-12-11 00:41:55 UTC [176259]: [6-1] 5fd2c053.2b083 0     LOG:  redirecting log output to logging collector process
2020-12-11 00:41:56,001 INFO: Lock owner: coredb-0; I am ]coredb-0
2020-12-11 00:41:56,004 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8380adbe0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
2020-12-11 00:42:05,251 INFO: starting after demotion in progress
2020-12-11 00:42:05,253 INFO: establishing a new patroni connection to the postgres cluster
2020-12-11 00:42:05,253 INFO: Lock owner: coredb-0; I am coredb-0
2020-12-11 00:42:05,262 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc8381150b8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/XXX/endpoints/coredb
2020-12-11 00:42:14,848 ERROR: failed to update leader lock
2020-12-11 00:42:14,852 ERROR: Failed to drop replication slot 'coredb_1'
2020-12-11 00:42:14,865 INFO: not promoting because failed to update leader lock in DCS
2020-12-11 00:42:34,347 INFO: following new leader after trying and failing to obtain lock
2020-12-11 00:42:36,414 INFO: Got response from coredb-1 http://10.60.6.176:8008/patroni: {"state": "starting", "role": "replica", "database_system_identifier": "XXX", "patroni": {"version": "1.6.5", "scope": "coredb"}}
2020-12-11 00:42:36,415 WARNING: Request failed to coredb-1: GET http://10.60.6.176:8008/patroni ('xlog')
2020-12-11 00:42:45,808 INFO: Could not take out TTL lock
2020-12-11 00:42:46,037 INFO: following new leader after trying and failing to obtain lock
2020-12-11 00:42:46,038 WARNING: Loop time exceeded, rescheduling immediately.
2020-12-11 00:42:46,055 INFO: Got response from coredb-1 http://10.60.6.176:8008/patroni: {"state": "running", "postmaster_start_time": "2020-12-11 00:42:41.009 UTC", "role": "replica", "server_version": 110007, "cluster_unlocked": true, "xlog": {"received_location": 28018091536, "replayed_location": 28018091536, "replayed_timestamp": "2020-12-11 00:40:49.412 UTC", "paused": false}, "timeline": 23, "database_system_identifier": "XXX", "patroni": {"version": "1.6.5", "scope": "coredb"}}
2020-12-11 00:45:15,469 ERROR: get_cluster
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 498, in _load_cluster
    self._wait_caches()
  File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 492, in _wait_caches
    raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline'

Then I keep getting the following log over and over

2020-12-13 07:57:28,633 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:28,633 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:28,634 INFO: starting as a secondary
2020-12-13 07:57:28 UTC [525363]: [1-1] 5fd5c968.80433 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-12-13 07:57:28 UTC [525363]: [2-1] 5fd5c968.80433 0     LOG:  pg_stat_kcache.linux_hz is set to 333333
2020-12-13 07:57:28 UTC [525363]: [3-1] 5fd5c968.80433 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-12-13 07:57:28 UTC [525363]: [4-1] 5fd5c968.80433 0     LOG:  listening on IPv6 address "::", port 5432
2020-12-13 07:57:28 UTC [525363]: [5-1] 5fd5c968.80433 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-12-13 07:57:28,939 INFO: postmaster pid=525363
2020-12-13 07:57:28 UTC [525363]: [6-1] 5fd5c968.80433 0     LOG:  redirecting log output to logging collector process
2020-12-13 07:57:28 UTC [525363]: [7-1] 5fd5c968.80433 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2020-12-13 07:57:29.295 24 LOG {ticks: 0, maint: 0, retry: 0}
/var/run/postgresql:5432 - no response
2020-12-13 07:57:30,010 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:30,010 INFO: failed to start postgres
2020-12-13 07:57:40,010 WARNING: Postgresql is not running.
2020-12-13 07:57:40,010 INFO: Lock owner: coredb-1; I am coredb-0
2020-12-13 07:57:40,015 INFO: pg_controldata:
  pg_control version number: 1100
  Catalog version number: 201809051
  Database system identifier: XXX
  Database cluster state: in archive recovery
  pg_control last modified: Sun Dec 13 07:57:29 2020
  Latest checkpoint location: 6/850825C0
  Latest checkpoint's REDO location: 6/85082588
  Latest checkpoint's REDO WAL file: 000000170000000600000085
  Latest checkpoint's TimeLineID: 23
  Latest checkpoint's PrevTimeLineID: 23
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:13427
  Latest checkpoint's NextOID: 139538
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 562
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 13427
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Dec 11 00:33:21 2020
  Fake LSN counter for unlogged rels: 0/1
  Minimum recovery ending location: 6/86022610
  Min recovery ending loc's timeline: 23
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 500
  max_worker_processes setting: 8
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: XXX

How do I currently fix this issue:

Delete the pod and let to be re-created. => it may fix the issue
Most of the time "delete pod" doesn't solve the issue, so I delete the disk and recreate it in a very poring process. => this fixes the issue and the replica works.

zalando / postgres-operator

postgres master/replica keep failing with no reason! They never recovery! #1251

ISSUE

How do I currently fix this issue: