Open olivier-derom opened 2 months ago
Oh no! This doesn't sound nice. Can you share some snippets or your operator configuration and service account so we can try to replicate. Our setup is not that different but our backups continue to run.
Quite a few things have change and in your case maybe require a different configuration. Spilo has some config means but likely not managable yet by the operator.
@FxKu Sure! Here are some snippets:
These are the config files of v1.12.2, but as stated earlier, the only thing I then changed is the helm chart version, and manually update spilo image and logical backup image as we use a private repo pull-through
Hope this can help!
What if you change the docker image to the previous one? ghcr.io/zalando/spilo-16:3.2-p3 Does the v1.13.0 continues to work then?
@FxKu I can confirm that the issue lies with the spilo image, by using chart 1.13.0 but spilo on 3.2-p3 the WAL archiving works correctly.
Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config configAwsOrGcp.WAL_S3_BUCKET
and once in the postgresql resource env WAL_S3_BUCKET
? there would be a third way in the pod_environment_secret/configmap
as well. I tried providing it only in the pod_environment_secret
to keep all the archiving related S3 config in one place, but the /run/etc/wal-e.d/env
is missing, so I assume the backup is not working either.
Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config
configAwsOrGcp.WAL_S3_BUCKET
and once in the postgresql resource envWAL_S3_BUCKET
? there would be a third way in thepod_environment_secret/configmap
as well. I tried providing it only in thepod_environment_secret
to keep all the archiving related S3 config in one place, but the/run/etc/wal-e.d/env
is missing, so I assume the backup is not working either.
In this dummy example it indeed does not really have any benefit of defining it twice as they are the same. The reason I define it twice is because of the order of priority. All our postgres clusters use the S3 path provided in the operatorconfig as a default, but for some clusters (e.g. ones you want to share to create standby replicas) we want to overrule that S3 location to another bucket or path.
This way if for some reason we want to change the default S3 path, it can be done on a single line, but we also have a way to overrule this default.
Not sure why your /run/etc/wal-e.d/env
is missing.
thanks for the quick reply. it make sense with the override, not sure though why must it be set in the operatorconfig. that seems to be the cuplrit of the missing /run/etc/wal-e.d/env
.
it is caused by wal-g changes: https://github.com/wal-g/wal-g/pull/1377 last working wal-g version is: https://github.com/wal-g/wal-g/releases/tag/v2.0.1 last working postgres-operator version is indeed: https://github.com/zalando/postgres-operator/releases/tag/v1.12.2 last working spilo image version is: https://github.com/zalando/spilo/releases/tag/3.2-p3
i'm still investigating wal-g bug:
it expects to provide both now AWS_ROLE_ARN and AWS_ROLE_SESSION_NAME but at the same time it doesn't allow to provide IAM IRSA session name including :
root@temporal-postgresql-0:/home/postgres# wal-g --version
wal-g version v3.0.3 3f88f3c 2024.08.08_17:53:40 PostgreSQL
root@temporal-postgresql-0:/home/postgres# wal-g-v2.0.1 --version
wal-g version v2.0.1 b7d53dd 2022.08.25_09:34:20 PostgreSQL
root@temporal-postgresql-0:/home/postgres# export AWS_ROLE_SESSION_NAME=system:serviceaccount:automation-service:postgres-pod-sa
root@temporal-postgresql-0:/home/postgres# echo $AWS_ROLE_ARN
arn:aws:iam::111111111111:role/postgres-backup-role
root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g backup-list
ERROR: 2024/10/11 15:52:43.441470 configure primary storage: configure storage with prefix "s3://postgres-backup/spilo/temporal-postgresql/12075954-67d5-4764-a7ea-df5925ca27fc/wal/15": create S3 storage: create new AWS session: configure session: assume role by ARN: WebIdentityErr: failed to retrieve credentials
caused by: ValidationError: 1 validation error detected: Value 'system:serviceaccount:automation-service:postgres-pod-sa' at 'roleSessionName' failed to satisfy constraint: Member must satisfy regular expression pattern: [\w+=,.@-]*
status code: 400, request id: ce6c3656-6228-4d5d-94a4-9ea1670d1cf6
root@temporal-postgresql-0:/home/postgres# unset AWS_ROLE_SESSION_NAME
root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g-v2.0.1 backup-list
name modified wal_segment_backup_start
base_000000010000000000000004 2024-09-20T11:34:19Z 000000010000000000000004
base_000000010000000000000006 2024-09-20T12:00:03Z 000000010000000000000006
base_00000001000000000000001F 2024-09-21T00:00:03Z 00000001000000000000001F
base_000000010000000000000038 2024-09-21T12:00:03Z 000000010000000000000038
base_000000010000000000000051 2024-09-22T00:00:03Z 000000010000000000000051
base_00000001000000000000006A 2024-09-22T12:00:03Z 00000001000000000000006A
base_000000010000000000000083 2024-09-23T00:00:03Z 000000010000000000000083
base_00000001000000000000009C 2024-09-23T12:00:03Z 00000001000000000000009C
Please, answer some short questions which should help us to understand your problem / question better?
We create logical backups (pg_dump) and WAL-G+basebackups for our clusters. We use a k8s service account which is bound to an IAM role for S3 access. postgres-operator is deployed using helm.
When running operator version 1.12.2 (and spilo 16:3.2-p3), both logicalbackup cronjob and WAL-G+basebackups work as intented. I validate the WAL backup using
PGUSER=postgres
envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
When I update the postgres-operator to 1.13.0 (and spilo to 16:3.3-p1), the logicalbackups still work, but the WAL+basebackup do not work anymore. When manually trying to create a basebackup with the same command, I get error:
It seems to be an error specific to using a service account assuming an IAM role to access S3, specifically when running basebackup. Logicalbackup are able to put the pg_dump on S3 via the same authentication method
No other values were changed other than the spilo image, and helm chart version.
Let me know if you need additional information.