Standby-Cluster - WARNING: Retry got exception: 'connection problems'

Sonic-Y3k commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.7.1
Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
Are you running Postgres Operator in production? no
Type of issue? question

Hi,

we are trying to create a standby cluster but we are stuck in an infinite loop in which the pod always states "connection error". This seems to appear after the bootstrap_standby_leader-stage has successfully finished. For context: we successfully configured S3 with WAL-E in the main cluster. Also the cloning directly from S3 appears to work flawlessly. The only thing currently not working is deploying a standby cluster.

❯ kubectl logs infra-demo-0 -f
2021-11-16 14:08:07,744 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2021-11-16 14:08:09,750 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2021-11-16 14:08:09,751 - bootstrapping - INFO - No meta-data available for this provider
2021-11-16 14:08:09,752 - bootstrapping - INFO - Looks like your running local
2021-11-16 14:08:09,783 - bootstrapping - INFO - Configuring wal-e
2021-11-16 14:08:09,784 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2021-11-16 14:08:09,784 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2021-11-16 14:08:09,785 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2021-11-16 14:08:09,786 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2021-11-16 14:08:09,787 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2021-11-16 14:08:09,787 - bootstrapping - INFO - Configuring bootstrap
2021-11-16 14:08:09,787 - bootstrapping - INFO - Configuring crontab
2021-11-16 14:08:09,787 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2021-11-16 14:08:09,797 - bootstrapping - INFO - Configuring certificate
2021-11-16 14:08:09,798 - bootstrapping - INFO - Generating ssl self-signed certificate
2021-11-16 14:08:09,834 - bootstrapping - INFO - Configuring pgbouncer
2021-11-16 14:08:09,835 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2021-11-16 14:08:09,835 - bootstrapping - INFO - Configuring standby-cluster
2021-11-16 14:08:09,835 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_PREFIX
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_S3_PREFIX
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_ACCESS_KEY_ID
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_SECRET_ACCESS_KEY
2021-11-16 14:08:09,836 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_S3_ENDPOINT
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_ENDPOINT
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALG_DISABLE_S3_SSE
2021-11-16 14:08:09,837 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/AWS_S3_FORCE_PATH_STYLE
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/USE_WALG_BACKUP
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/WALE_LOG_DESTINATION
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-standby/TMPDIR
2021-11-16 14:08:09,838 - bootstrapping - INFO - Configuring log
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_TMPDIR
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_AWS_HOST
2021-11-16 14:08:09,838 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_S3_KEY
2021-11-16 14:08:09,839 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/LOG_S3_BUCKET
2021-11-16 14:08:09,839 - bootstrapping - INFO - Writing to file /run/etc/log.d/env/PGLOG
2021-11-16 14:08:09,839 - bootstrapping - INFO - Configuring patroni
2021-11-16 14:08:09,848 - bootstrapping - INFO - Writing to file /run/postgres.yml
2021-11-16 14:08:09,848 - bootstrapping - INFO - Configuring pgqd
2021-11-16 14:08:09,848 - bootstrapping - INFO - Configuring pam-oauth2
2021-11-16 14:08:09,849 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2021-11-16 14:08:11,108 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2021-11-16 14:08:11,122 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-16 14:08:11,124 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:11,191 INFO: trying to bootstrap a new standby leader
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-fetch".
        STRUCTURED: time=2021-11-16T14:08:17.557794-00 pid=80
2021-11-16 14:08:21,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:21,625 INFO: not healthy enough for leader race
2021-11-16 14:08:21,643 INFO: bootstrap_standby_leader in progress
2021-11-16 14:08:31,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:31,624 INFO: not healthy enough for leader race
2021-11-16 14:08:31,624 INFO: bootstrap_standby_leader in progress
wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000000.tar.lzo.
        HINT: The absolute S3 key is dbtestpostgres/spilo/infra-demo/3f5f7482-c9f8-4654-97dd-fb0c644caa80/wal/13/basebackups_005/base_00000004000000000000006D_00000040/tar_partitions/part_00000000.tar.lzo.
        STRUCTURED: time=2021-11-16T14:08:36.697552-00 pid=80
2021-11-16 14:08:41,624 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:41,624 INFO: not healthy enough for leader race
2021-11-16 14:08:41,624 INFO: bootstrap_standby_leader in progress
2021-11-16 14:08:48,845 INFO: replica has been created using bootstrap_standby_with_wale
2021-11-16 14:08:48,847 INFO: bootstrapped clone from remote master None
2021-11-16 14:08:49,043 INFO: postmaster pid=97
2021-11-16 14:08:49 UTC [97]: [1-1] 6193bb71.61 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-11-16 14:08:49 UTC [97]: [2-1] 6193bb71.61 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
/var/run/postgresql:5432 - no response
2021-11-16 14:08:49 UTC [97]: [3-1] 6193bb71.61 0     LOG:  redirecting log output to logging collector process
2021-11-16 14:08:49 UTC [97]: [4-1] 6193bb71.61 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2021-11-16 14:08:51,625 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:08:51,625 INFO: not healthy enough for leader race
2021-11-16 14:08:51,643 INFO: bootstrap_standby_leader in progress
/var/run/postgresql:5432 - rejecting connections
[...]
2021-11-16 14:09:01,625 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:09:01,625 INFO: not healthy enough for leader race
2021-11-16 14:09:01,625 INFO: bootstrap_standby_leader in progress
2021-11-16 14:09:49,797 INFO: Lock owner: None; I am infra-demo-0
2021-11-16 14:09:49,797 INFO: Still starting up as a standby.
2021-11-16 14:09:49,798 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2021-11-16 14:09:49,856 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:09:50,529 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:09:50,531 WARNING: Retry got exception: 'connection problems'
2021-11-16 14:09:50,584 INFO: initialized a new cluster
/var/run/postgresql:5432 - rejecting connections
[...]
2021-11-16 14:26:11,097 INFO: Lock owner: infra-demo-0; I am infra-demo-0
2021-11-16 14:26:11,146 WARNING: manual failover: members list is empty
2021-11-16 14:26:11,146 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:26:11,429 INFO: establishing a new patroni connection to the postgres cluster
2021-11-16 14:26:11,430 WARNING: Retry got exception: 'connection problems'
2021-11-16 14:26:11,431 INFO: master start has timed out, but continuing to wait because failover is not possible

The last four messages repeat over and over again until the timeout is reached.

Additionally this is the current config pod_environment_configmap:

---
apiVersion: v1
data:
  AWS_ACCESS_KEY_ID: C34...
  AWS_ENDPOINT: https://someurl
  AWS_S3_FORCE_PATH_STYLE: "true"
  AWS_SECRET_ACCESS_KEY: ZD...
  STANDBY_AWS_ACCESS_KEY_ID: C34...
  STANDBY_AWS_ENDPOINT: https://someurl
  STANDBY_AWS_S3_FORCE_PATH_STYLE: "true"
  STANDBY_AWS_SECRET_ACCESS_KEY: ZD...
  STANDBY_USE_WALG_RESTORE: "false"
  USE_WALG_BACKUP: "false"
  USE_WALG_RESTORE: "false"
kind: ConfigMap
metadata:
  name: postgres-pod-config

What are we doing wrong?

Thanks in advance, Greetings, Sonic-Y3k

FxKu commented 2 years ago

Config looks fine, I think. So the problem is with bootstrapping a new standby cluster? What do the Postgres logs inside the pod tell? Depending on the size of the source cluster it might take a while for the standby to build and showing the typical Patroni heart beat messages in the logs.

Sonic-Y3k commented 2 years ago

Hi @FxKu ,

yeah that's correct. The issue appears while bootstrapping a new standby cluster. The really strange thing, in this context, is that cloning a cluster via S3 works flawlessly.

The Postgres-Cluster isn't that big, a full clone from S3 takes about 5 Minutes until it's up and running.

Here are the logs:

Looking at the postgres log from the standby pod, it seems like /scripts/wal-e-wal-fetch.sh wal-fetch is called with invalid arguments.

Greetings, Sonic-Y3k

FxKu commented 2 years ago

Can you also check postgresql-.csv log files? .log files only show backup logs. And please show the standby section of your postgres manifest.

Sonic-Y3k commented 2 years ago

Hi @FxKu,

absolutely. I just created a new standby cluster to reduce the logout. Here's a gist with the only .log and .csv-File containing anything.

root@infra-demo-0:/home/postgres/pgdata/pgroot/pg_log# ls -lh
total 152K
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-0.csv
-rw-r--r--. 1 postgres postgres 103K Nov 22 08:26 postgresql-1.csv
-rw-r--r--. 1 postgres postgres  37K Nov 22 08:26 postgresql-1.log
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-2.csv
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-3.csv
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-4.csv
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-5.csv
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-6.csv
-rw-r--r--. 1 postgres postgres    0 Nov 22 08:13 postgresql-7.csv

Also here's the complete manifest of the standby cluster:

---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: infra-demo
spec:
  databases:
    demo: demo
  numberOfInstances: 1
  patroni:
    pg_hba:
      - local     all             all       trust
      - host      all             all       10.0.0.0/8         md5
      - hostssl   all             all       10.0.0.0/8         pam
      - hostssl   all             +zalandos 127.0.0.1/32       pam
      - host      all             all       127.0.0.1/32       md5
      - hostssl   all             +zalandos ::1/128            pam
      - host      all             all       ::1/128            md5
      - local     replication     standby   trust
      - hostssl   replication     standby   all                md5
      - hostnossl all             all       all                reject
      - hostssl   all             +zalandos all                pam
      - hostssl   all             all       all                md5
  postgresql:
    version: "13"
  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 2000m
      memory: 4Gi
  teamId: infra
  users:
    demo:
    - superuser
    - createdb
  volume:
    size: 2Gi
  standby:
    s3_wal_path: "s3://replaced_bucket_name/dbtestpostgres/spilo/infra-demo/6f903fc4-2754-4111-993e-1c1e5f2b093f/wal/13"

huetterma commented 2 years ago

Having the same issue, help appreciate @FxKu

Jasper-Ben commented 2 years ago

We are running into the same issue and can confirm that this is an issue with WAL-E configuration.

Without configuring WAL-E everything works as expected. But as soon as we try adding the appropriate config for WAL-E, connection issues appear. After reverting the WAL-E config (we are using terraform, so changes are predictable), things again work as expected.

Environment: k8s Server: v1.22.6-eks-7d68063 Postgres-Operator: v1.7.1

thedatabaseme commented 2 years ago

We have this issue recently too. There are no obvious changes. For us the following applies:

When killing a replica pod, it will try to startup and bootstrap from master.
The same log output appears than the thread owner has posted (WARNING: Retry got exception: 'connection problems')

I've logged into the problematic replica pod and was getting aware, that the following process tree is running. This takes about 10 - 15 mins. in our case. Then magically, the processes are gone and the replica is getting bootstrapped finally:

postgres      77      75  0 11:59 ?        00:00:00 sh -c envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "00000006.history" "pg_wal/RECOVERYHISTORY"
postgres      80      77  0 11:59 ?        00:00:00 timeout 0 /scripts/restore_command.sh 00000006.history pg_wal/RECOVERYHISTORY
postgres      81      80  0 11:59 ?        00:00:00 wal-g wal-fetch 00000006.history pg_wal/RECOVERYHISTORY

What makes me suspicious here, there is no 00000006.history on our S3 storage. I'm not sure what he tries to do here. The cluster is on timeline 5 and therefore a 00000005.history exists.

Any help is appreciated. I'm a bit stuck here.

thedatabaseme commented 2 years ago

An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code 500 instead of 404 when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.

I'm still unsure why wal-g does so.

eigood commented 2 years ago

An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code 500 instead of 404 when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.

I'm still unsure why wal-g does so.

Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.

thedatabaseme commented 2 years ago

An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code 500 instead of 404 when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist. I'm still unsure why wal-g does so.

Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.

No, we're not using Minio here.

zalando / postgres-operator

Standby-Cluster - WARNING: Retry got exception: 'connection problems' #1691