Open Sonic-Y3k opened 2 years ago
Config looks fine, I think. So the problem is with bootstrapping a new standby cluster? What do the Postgres logs inside the pod tell? Depending on the size of the source cluster it might take a while for the standby to build and showing the typical Patroni heart beat messages in the logs.
Hi @FxKu ,
yeah that's correct. The issue appears while bootstrapping a new standby cluster. The really strange thing, in this context, is that cloning a cluster via S3 works flawlessly.
The Postgres-Cluster isn't that big, a full clone from S3 takes about 5 Minutes until it's up and running.
Here are the logs:
Looking at the postgres log from the standby pod, it seems like /scripts/wal-e-wal-fetch.sh wal-fetch
is called with invalid arguments.
Greetings, Sonic-Y3k
Can you also check postgresql-.log
files only show backup logs. And please show the standby
section of your postgres manifest.
Hi @FxKu,
absolutely. I just created a new standby cluster to reduce the logout. Here's a gist with the only .log and .csv-File containing anything.
root@infra-demo-0:/home/postgres/pgdata/pgroot/pg_log# ls -lh
total 152K
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-0.csv
-rw-r--r--. 1 postgres postgres 103K Nov 22 08:26 postgresql-1.csv
-rw-r--r--. 1 postgres postgres 37K Nov 22 08:26 postgresql-1.log
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-2.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-3.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-4.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-5.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-6.csv
-rw-r--r--. 1 postgres postgres 0 Nov 22 08:13 postgresql-7.csv
Also here's the complete manifest of the standby cluster:
---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: infra-demo
spec:
databases:
demo: demo
numberOfInstances: 1
patroni:
pg_hba:
- local all all trust
- host all all 10.0.0.0/8 md5
- hostssl all all 10.0.0.0/8 pam
- hostssl all +zalandos 127.0.0.1/32 pam
- host all all 127.0.0.1/32 md5
- hostssl all +zalandos ::1/128 pam
- host all all ::1/128 md5
- local replication standby trust
- hostssl replication standby all md5
- hostnossl all all all reject
- hostssl all +zalandos all pam
- hostssl all all all md5
postgresql:
version: "13"
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
teamId: infra
users:
demo:
- superuser
- createdb
volume:
size: 2Gi
standby:
s3_wal_path: "s3://replaced_bucket_name/dbtestpostgres/spilo/infra-demo/6f903fc4-2754-4111-993e-1c1e5f2b093f/wal/13"
Having the same issue, help appreciate @FxKu
We are running into the same issue and can confirm that this is an issue with WAL-E configuration.
Without configuring WAL-E everything works as expected. But as soon as we try adding the appropriate config for WAL-E, connection issues appear. After reverting the WAL-E config (we are using terraform, so changes are predictable), things again work as expected.
Environment: k8s Server: v1.22.6-eks-7d68063 Postgres-Operator: v1.7.1
We have this issue recently too. There are no obvious changes. For us the following applies:
WARNING: Retry got exception: 'connection problems'
)I've logged into the problematic replica pod and was getting aware, that the following process tree is running. This takes about 10 - 15 mins. in our case. Then magically, the processes are gone and the replica is getting bootstrapped finally:
postgres 77 75 0 11:59 ? 00:00:00 sh -c envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "00000006.history" "pg_wal/RECOVERYHISTORY"
postgres 80 77 0 11:59 ? 00:00:00 timeout 0 /scripts/restore_command.sh 00000006.history pg_wal/RECOVERYHISTORY
postgres 81 80 0 11:59 ? 00:00:00 wal-g wal-fetch 00000006.history pg_wal/RECOVERYHISTORY
What makes me suspicious here, there is no 00000006.history
on our S3 storage. I'm not sure what he tries to do here. The cluster is on timeline 5 and therefore a 00000005.history
exists.
Any help is appreciated. I'm a bit stuck here.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code 500
instead of 404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.
I'm still unsure why wal-g does so.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code
500
instead of404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist.I'm still unsure why wal-g does so.
Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.
An update from my side. In our case, the issue was on S3 storage side. The Object Storage falsely replied with error code
500
instead of404
when an object was requested which does not exist. This seem to lead to a infinite loop during cluster bootstrapping cause wal-g searches for an object, that does not exist. I'm still unsure why wal-g does so.Are you using minio? I am, and my restore process seems to be stuck too. I'm still in the evaluation phase of all this.
No, we're not using Minio here.
Please, answer some short questions which should help us to understand your problem / question better?
Hi,
we are trying to create a standby cluster but we are stuck in an infinite loop in which the pod always states "connection error". This seems to appear after the bootstrap_standby_leader-stage has successfully finished. For context: we successfully configured S3 with WAL-E in the main cluster. Also the cloning directly from S3 appears to work flawlessly. The only thing currently not working is deploying a standby cluster.
The last four messages repeat over and over again until the timeout is reached.
Additionally this is the current config pod_environment_configmap:
What are we doing wrong?
Thanks in advance, Greetings, Sonic-Y3k