WAL-G backup broken since 1.13.0, works in 1.12.2

olivier-derom commented 2 months ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? v1.13.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? EKS
Are you running Postgres Operator in production? yes
Type of issue? Bug report

We create logical backups (pg_dump) and WAL-G+basebackups for our clusters. We use a k8s service account which is bound to an IAM role for S3 access. postgres-operator is deployed using helm.

When running operator version 1.12.2 (and spilo 16:3.2-p3), both logicalbackup cronjob and WAL-G+basebackups work as intented. I validate the WAL backup using PGUSER=postgres envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"

When I update the postgres-operator to 1.13.0 (and spilo to 16:3.3-p1), the logicalbackups still work, but the WAL+basebackup do not work anymore. When manually trying to create a basebackup with the same command, I get error:

create S3 storage: create new AWS session: configure session: assume role by ARN: InvalidParameter: 1 validation error(s) found.
- minimum field size of 2, AssumeRoleInput.RoleSessionName.

It seems to be an error specific to using a service account assuming an IAM role to access S3, specifically when running basebackup. Logicalbackup are able to put the pg_dump on S3 via the same authentication method

No other values were changed other than the spilo image, and helm chart version.

Let me know if you need additional information.

FxKu commented 2 months ago

Oh no! This doesn't sound nice. Can you share some snippets or your operator configuration and service account so we can try to replicate. Our setup is not that different but our backups continue to run.

Quite a few things have change and in your case maybe require a different configuration. Spilo has some config means but likely not managable yet by the operator.

olivier-derom commented 2 months ago

@FxKu Sure! Here are some snippets:

SA Yaml (manually deployed as additional resource, not part of zalando postgres helm)

``` apiVersion: v1 automountServiceAccountToken: true imagePullSecrets: - name: mysecret kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::0123456789:role/my-IAM-role-w-S3-access labels: app.kubernetes.io/instance: postgres-operator name: postgres-operator namespace: dataplatform-prod ```

postgres operator helm values

``` postgres-operator: image: registry: privaterepo.com repository: zalando/postgres-operator imagePullSecrets: - name: mysecret configGeneral: # Spilo docker image, update manually when updating the operator # Tag and registry are not split, so we must update this manually and cannot rely on Helm default values docker_image: privaterepo.com/zalando/spilo-16:3.2-p3 configKubernetes: app.kubernetes.io/managed-by: postgres-operator enable_secrets_deletion: true watched_namespace: dataplatform-dev cluster_labels: application: spilo cluster_name_label: cluster-name pod_environment_configmap: postgres-pod-config enable_pod_antiaffinity: true enable_readiness_probe: true configPostgresPodResources: default_cpu_limit: "1" default_cpu_request: 100m default_memory_limit: 500Mi default_memory_request: 100Mi min_cpu_limit: 250m min_memory_limit: 250Mi configDebug: debug_logging: true enable_database_access: true configAwsOrGcp: AWS_REGION: eu-west-1 WAL_S3_BUCKET: mybucket/postgres-operator/WAL configLogicalBackup: # prefix for the backup job name logical_backup_job_prefix: "logical-backup-" logical_backup_provider: "s3" logical_backup_s3_region: "eu-west-1" logical_backup_s3_sse: "AES256" logical_backup_cronjob_environment_secret: "" # S3 retention time for stored backups for example "2 week" or "7 days" # recommended to also put S3 lifecycle policy on the bucket logical_backup_s3_retention_time: "" logical_backup_schedule: "30 00 * * *" # daily at 00.30 AM # Image for pods of the logical backup job (default pg_dumpall), update manually when updating the operator # Tag and registry are not split, so we must update this manually and cannot rely on Helm default values logical_backup_docker_image: privaterepo.com/zalando/postgres-operator/logical-backup:v1.12.2 logical_backup_s3_bucket: mybucket/postgres-operator/logical-backups serviceAccount: create: false # The name of the ServiceAccount to use. name: postgres-operator podServiceAccount: name: postgres-operator ```

postgres cluster yaml

``` apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: nessie-metastore spec: postgresql: version: "16" teamId: "dataplatform" volume: size: 10Gi numberOfInstances: 1 users: nessie: - superuser - createdb nessiegc: - superuser - createdb databases: nessiegc: nessiegc metastore: nessie enableLogicalBackup: true env: - name: AWS_REGION value: eu-west-1 - name: WAL_S3_BUCKET value: mybucket/postgres-operator/WAL - name: USE_WALG_BACKUP value: "true" - name: USE_WALG_RESTORE value: "true" - name: BACKUP_SCHEDULE value: "00 * * * *" - name: BACKUP_NUM_TO_RETAIN value: "96" # For 1 backup per hour, keep 4 days of base backups ```

These are the config files of v1.12.2, but as stated earlier, the only thing I then changed is the helm chart version, and manually update spilo image and logical backup image as we use a private repo pull-through

Hope this can help!

FxKu commented 2 months ago

What if you change the docker image to the previous one? ghcr.io/zalando/spilo-16:3.2-p3 Does the v1.13.0 continues to work then?

olivier-derom commented 2 months ago

@FxKu I can confirm that the issue lies with the spilo image, by using chart 1.13.0 but spilo on 3.2-p3 the WAL archiving works correctly.

nrobert13 commented 1 month ago

Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config configAwsOrGcp.WAL_S3_BUCKET and once in the postgresql resource env WAL_S3_BUCKET? there would be a third way in the pod_environment_secret/configmap as well. I tried providing it only in the pod_environment_secret to keep all the archiving related S3 config in one place, but the /run/etc/wal-e.d/env is missing, so I assume the backup is not working either.

olivier-derom commented 1 month ago

Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config configAwsOrGcp.WAL_S3_BUCKET and once in the postgresql resource env WAL_S3_BUCKET? there would be a third way in the pod_environment_secret/configmap as well. I tried providing it only in the pod_environment_secret to keep all the archiving related S3 config in one place, but the /run/etc/wal-e.d/env is missing, so I assume the backup is not working either.

In this dummy example it indeed does not really have any benefit of defining it twice as they are the same. The reason I define it twice is because of the order of priority. All our postgres clusters use the S3 path provided in the operatorconfig as a default, but for some clusters (e.g. ones you want to share to create standby replicas) we want to overrule that S3 location to another bucket or path. This way if for some reason we want to change the default S3 path, it can be done on a single line, but we also have a way to overrule this default. Not sure why your /run/etc/wal-e.d/env is missing.

nrobert13 commented 1 month ago

thanks for the quick reply. it make sense with the override, not sure though why must it be set in the operatorconfig. that seems to be the cuplrit of the missing /run/etc/wal-e.d/env.

moss2k13 commented 1 month ago

it is caused by wal-g changes: https://github.com/wal-g/wal-g/pull/1377 last working wal-g version is: https://github.com/wal-g/wal-g/releases/tag/v2.0.1 last working postgres-operator version is indeed: https://github.com/zalando/postgres-operator/releases/tag/v1.12.2 last working spilo image version is: https://github.com/zalando/spilo/releases/tag/3.2-p3

i'm still investigating wal-g bug: it expects to provide both now AWS_ROLE_ARN and AWS_ROLE_SESSION_NAME but at the same time it doesn't allow to provide IAM IRSA session name including :


root@temporal-postgresql-0:/home/postgres# wal-g --version
wal-g version v3.0.3    3f88f3c 2024.08.08_17:53:40 PostgreSQL

root@temporal-postgresql-0:/home/postgres# wal-g-v2.0.1 --version
wal-g version v2.0.1    b7d53dd 2022.08.25_09:34:20 PostgreSQL

root@temporal-postgresql-0:/home/postgres# export AWS_ROLE_SESSION_NAME=system:serviceaccount:automation-service:postgres-pod-sa

root@temporal-postgresql-0:/home/postgres# echo $AWS_ROLE_ARN
arn:aws:iam::111111111111:role/postgres-backup-role

root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g backup-list
ERROR: 2024/10/11 15:52:43.441470 configure primary storage: configure storage with prefix "s3://postgres-backup/spilo/temporal-postgresql/12075954-67d5-4764-a7ea-df5925ca27fc/wal/15": create S3 storage: create new AWS session: configure session: assume role by ARN: WebIdentityErr: failed to retrieve credentials
caused by: ValidationError: 1 validation error detected: Value 'system:serviceaccount:automation-service:postgres-pod-sa' at 'roleSessionName' failed to satisfy constraint: Member must satisfy regular expression pattern: [\w+=,.@-]*
    status code: 400, request id: ce6c3656-6228-4d5d-94a4-9ea1670d1cf6

root@temporal-postgresql-0:/home/postgres# unset AWS_ROLE_SESSION_NAME

root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g-v2.0.1 backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000004 2024-09-20T11:34:19Z 000000010000000000000004
base_000000010000000000000006 2024-09-20T12:00:03Z 000000010000000000000006
base_00000001000000000000001F 2024-09-21T00:00:03Z 00000001000000000000001F
base_000000010000000000000038 2024-09-21T12:00:03Z 000000010000000000000038
base_000000010000000000000051 2024-09-22T00:00:03Z 000000010000000000000051
base_00000001000000000000006A 2024-09-22T12:00:03Z 00000001000000000000006A
base_000000010000000000000083 2024-09-23T00:00:03Z 000000010000000000000083
base_00000001000000000000009C 2024-09-23T12:00:03Z 00000001000000000000009C

zalando / postgres-operator

WAL-G backup broken since 1.13.0, works in 1.12.2 #2747