zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.32k stars 980 forks source link

S3 WAL archiving not working as expected #2067

Open samox73 opened 2 years ago

samox73 commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

I've followed https://postgres-operator.readthedocs.io/en/latest/administrator/#wal-archiving-and-physical-basebackups to set up continuous backups. When entering the command to produce a backup manually I get

# envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
2022-10-06 09:31:09.065 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  database "root" does not exist
2022-10-06 09:31:09.096 - /scripts/postgres_backup.sh - ERROR: Recovery state unknown: 

and in the logs I spot these lines after trying to do a manual backup

postgresql-4.csv ```csv 2022-10-06 09:37:44.036 UTC,,,872,"[local]",633ea1e8.368,1,"",2022-10-06 09:37:44 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0 2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,2,"authentication",2022-10-06 09:37:44 UTC,7/137,0,LOG,00000,"connection authorized: user=root database=root application_name=psql",,,,,,,,,"","client backend",,0 2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,3,"startup",2022-10-06 09:37:44 UTC,7/137,0,FATAL,3D000,"database ""root"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:37:47.557 UTC,,,292,,633e9c7f.124,37,,2022-10-06 09:14:39 UTC,,0,LOG,00000,"archive command failed with exit code 2","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-e wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 . . . 2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,2,"authentication",2022-10-06 09:41:26 UTC,7/18,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,0 2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,3,"startup",2022-10-06 09:41:26 UTC,7/18,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:41:27.211 UTC,,,411,"[local]",633ea2c7.19b,1,"",2022-10-06 09:41:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0 2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,2,"authentication",2022-10-06 09:41:27 UTC,7/19,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=pgq ticker",,,,,,,,,"","client backend",,0 2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,3,"startup",2022-10-06 09:41:27 UTC,7/19,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:42:08.556 UTC,,,323,,633ea2a5.143,5,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:12.793 UTC,,,323,,633ea2a5.143,6,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,7,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,8,,2022-10-06 09:40:53 UTC,,0,WARNING,01000,"archiving write-ahead log file ""00000002.history"" failed too many times, will try again later",,,,,,,,,"","archiver",,0 ```
Environment variables of the pod ```bash AWS_DEFAULT_REGION:eu-central-1 AWS_REGION:eu-central-1 AWS_ROLE_ARN:arn:aws:iam::************:role/postgres-pod-role AWS_STS_REGIONAL_ENDPOINTS:regional AWS_WEB_IDENTITY_TOKEN_FILE:/var/run/secrets/eks.amazonaws.com/serviceaccount/token BACKUP_NUM_TO_RETAIN:5 BACKUP_SCHEDULE:*/5 * * * * CLONE_USE_WALG_RESTORE:true DCS_ENABLE_KUBERNETES_API:true HUMAN_ROLE:zalandos KUBERNETES_LABELS:{"application":"spilo"} KUBERNETES_ROLE_LABEL:spilo-role KUBERNETES_SCOPE_LABEL:******* KUBE_IAM_ROLE:postgres-pod-role PAM_OAUTH2:https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees PGPASSWORD_STANDBY:secretKeyRef(standby.*******-postgres-development.postgres.credentials.password) PGPASSWORD_SUPERUSER:secretKeyRef(root.*******-postgres-development.postgres.credentials.password) PGROOT:/home/postgres/pgdata/pgroot PGUSER_STANDBY:standby PGUSER_SUPERUSER:root PGVERSION:14 SCOPE:*******-postgres-development SPILO_CONFIGURATION:{"postgresql":{},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"users":{"zalandos":{"password":"","options":["CREATEDB","NOLOGIN"]}},"dcs":{}}} USE_WALG_BACKUP:true USE_WALG_RESTORE:true WAL_BUCKET_SCOPE_PREFIX: WAL_BUCKET_SCOPE_SUFFIX: WAL_S3_BUCKET:*******-postgres ```

The cron job also does not seem to be executed every 5 minutes as there are no logs statements for it.

samox73 commented 2 years ago

Running the wal-g upload command manually works without any errors

# envdir /run/etc/wal-e.d/env wal-g wal-push ../data/pg_wal/00000002.history 
INFO: 2022/10/06 12:23:12.494843 FILE PATH: 00000002.history.lz4
INFO: 2022/10/06 12:23:12.586170 FILE PATH: 000000010000000000000002.partial.lz4
INFO: 2022/10/06 12:23:12.613509 FILE PATH: 000000020000000000000003.partial.lz4
INFO: 2022/10/06 12:23:12.640760 FILE PATH: 000000020000000000000002.lz4

and I can see the data in the S3 bucket. However, now I get these errors in postgresql-4.log:

ERROR: 2022/10/06 12:23:39.465899 Couldn't check whether there is an overwrite attempt due to inner error: failed to read object: 'spilo/powerbot-postgres-development/wal/14/wal_005/00000002.history.lz4' from S3: AccessDenied: Access Denied
        status code: 403, request id: EANDRVDGZDMMEBA9, host id: 0lVxvS3DilpYcvJS9Zf1RNbRezcr17JWvKsNaMkZ9yGahqOiQ2DMi+rWY3ZnsVOBQqEk3bMdnkA=

Even though the service account is permitted to access all resources on this bucket with all action S3 actions s3:*.

Listing backups, still does not work:

# envdir /run/etc/wal-e.d/env wal-g backup-list
INFO: 2022/10/06 12:26:28.955900 No backups found
samox73 commented 2 years ago

So I figured out that for automatic backups to work, the env vars AWS_ACCESS_KEY_ID and AS_SECRET_ACCESS_KEY need to be set. I could not get it to work with the more secure Service Account linked to AWS Role with attached policy approach.

Another thing I noticed that for the manual backup to work one has to set PGUSER to the correct superuser of the postgres DB. In my case this user is postgres and the script was trying user root.

angelbarrera92 commented 1 year ago

Any news around? Im facing the same issue

oleksiytsyban commented 1 year ago

After this change https://github.com/zalando/spilo/pull/769 WAL archiving works with Service Account linked to AWS Role, if you alter postgres-pod service account with proper role ARN manually. Operator cannot do that for you. Just add variables to postgres pod: AWS_REGION: bucket region WAL_S3_BUCKET: backup bucket BACKUP_NUM_TO_RETAIN: "7" USE_WALG_BACKUP: "true" #Wal-e won't work USE_WALG_RESTORE: "true" WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPESUFFIX: "" For being able to restore from that S3 archive you would need to add two more variables to CLONE part: CLONE_AWS_REGION CLONE_WAL_S3_BUCKET CLONE_USE_WALG_RESTORE: "true" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token CLONE_AWS_ROLE_ARN: same ARN as from postgres-pod service account annotation

bck01215 commented 1 year ago

Don't run the backups as the root user. Run them as the postgres user, or whatever user the svc is executing as.

dmotte commented 7 months ago

Posting my solution here for anyone who may face the same issue.

I created these two Kubernetes resources:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-env-configmap
  namespace: postgres-operator
data:
  CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  CLONE_USE_WALG_RESTORE: "true"
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
  name: pod-env-secret
  namespace: my-namespace
data:
  CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=

And then I set the following values in the operator's Helm chart release:

configKubernetes:
  pod_environment_configmap: postgres-operator/pod-env-configmap
  pod_environment_secret: pod-env-secret

Note: I set the CLONE_AWS_ROLE_ARN env var in the Secret instead of the ConfigMap just because in my scenario I have a different role (and then a different secret) for each namespace.

rishiraj88 commented 7 months ago

Thanks, @dmotte .Will see to declare and observe resources in this pattern when applicable.