Open samox73 opened 2 years ago
Running the wal-g upload command manually works without any errors
# envdir /run/etc/wal-e.d/env wal-g wal-push ../data/pg_wal/00000002.history
INFO: 2022/10/06 12:23:12.494843 FILE PATH: 00000002.history.lz4
INFO: 2022/10/06 12:23:12.586170 FILE PATH: 000000010000000000000002.partial.lz4
INFO: 2022/10/06 12:23:12.613509 FILE PATH: 000000020000000000000003.partial.lz4
INFO: 2022/10/06 12:23:12.640760 FILE PATH: 000000020000000000000002.lz4
and I can see the data in the S3 bucket. However, now I get these errors in postgresql-4.log
:
ERROR: 2022/10/06 12:23:39.465899 Couldn't check whether there is an overwrite attempt due to inner error: failed to read object: 'spilo/powerbot-postgres-development/wal/14/wal_005/00000002.history.lz4' from S3: AccessDenied: Access Denied
status code: 403, request id: EANDRVDGZDMMEBA9, host id: 0lVxvS3DilpYcvJS9Zf1RNbRezcr17JWvKsNaMkZ9yGahqOiQ2DMi+rWY3ZnsVOBQqEk3bMdnkA=
Even though the service account is permitted to access all resources on this bucket with all action S3 actions s3:*
.
Listing backups, still does not work:
# envdir /run/etc/wal-e.d/env wal-g backup-list
INFO: 2022/10/06 12:26:28.955900 No backups found
So I figured out that for automatic backups to work, the env vars AWS_ACCESS_KEY_ID
and AS_SECRET_ACCESS_KEY
need to be set. I could not get it to work with the more secure Service Account linked to AWS Role with attached policy
approach.
Another thing I noticed that for the manual backup to work one has to set PGUSER
to the correct superuser of the postgres DB. In my case this user is postgres
and the script was trying user root
.
Any news around? Im facing the same issue
After this change https://github.com/zalando/spilo/pull/769 WAL archiving works with Service Account linked to AWS Role, if you alter postgres-pod service account with proper role ARN manually. Operator cannot do that for you. Just add variables to postgres pod: AWS_REGION: bucket region WAL_S3_BUCKET: backup bucket BACKUP_NUM_TO_RETAIN: "7" USE_WALG_BACKUP: "true" #Wal-e won't work USE_WALG_RESTORE: "true" WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPESUFFIX: "" For being able to restore from that S3 archive you would need to add two more variables to CLONE part: CLONE_AWS_REGION CLONE_WAL_S3_BUCKET CLONE_USE_WALG_RESTORE: "true" CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token CLONE_AWS_ROLE_ARN: same ARN as from postgres-pod service account annotation
Don't run the backups as the root user. Run them as the postgres user, or whatever user the svc is executing as.
Posting my solution here for anyone who may face the same issue.
I created these two Kubernetes resources:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-env-configmap
namespace: postgres-operator
data:
CLONE_AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
CLONE_USE_WALG_RESTORE: "true"
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
---
apiVersion: v1
kind: Secret
metadata:
name: pod-env-secret
namespace: my-namespace
data:
CLONE_AWS_ROLE_ARN: UHV0IHlvdXIgQVdTIHJvbGUgQVJOIGhlcmUgOikgdGhpcyBpcyBqdXN0IGFuIGV4YW1wbGU=
And then I set the following values in the operator's Helm chart release:
configKubernetes:
pod_environment_configmap: postgres-operator/pod-env-configmap
pod_environment_secret: pod-env-secret
Note: I set the CLONE_AWS_ROLE_ARN
env var in the Secret
instead of the ConfigMap
just because in my scenario I have a different role (and then a different secret) for each namespace.
Thanks, @dmotte .Will see to declare and observe resources in this pattern when applicable.
Please, answer some short questions which should help us to understand your problem / question better?
I've followed https://postgres-operator.readthedocs.io/en/latest/administrator/#wal-archiving-and-physical-basebackups to set up continuous backups. When entering the command to produce a backup manually I get
and in the logs I spot these lines after trying to do a manual backup
postgresql-4.csv
```csv 2022-10-06 09:37:44.036 UTC,,,872,"[local]",633ea1e8.368,1,"",2022-10-06 09:37:44 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0 2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,2,"authentication",2022-10-06 09:37:44 UTC,7/137,0,LOG,00000,"connection authorized: user=root database=root application_name=psql",,,,,,,,,"","client backend",,0 2022-10-06 09:37:44.036 UTC,"root","root",872,"[local]",633ea1e8.368,3,"startup",2022-10-06 09:37:44 UTC,7/137,0,FATAL,3D000,"database ""root"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:37:47.557 UTC,,,292,,633e9c7f.124,37,,2022-10-06 09:14:39 UTC,,0,LOG,00000,"archive command failed with exit code 2","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-e wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 . . . 2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,2,"authentication",2022-10-06 09:41:26 UTC,7/18,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,0 2022-10-06 09:41:26.238 UTC,"postgres","postgres",406,"[local]",633ea2c6.196,3,"startup",2022-10-06 09:41:26 UTC,7/18,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:41:27.211 UTC,,,411,"[local]",633ea2c7.19b,1,"",2022-10-06 09:41:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0 2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,2,"authentication",2022-10-06 09:41:27 UTC,7/19,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=pgq ticker",,,,,,,,,"","client backend",,0 2022-10-06 09:41:27.211 UTC,"postgres","postgres",411,"[local]",633ea2c7.19b,3,"startup",2022-10-06 09:41:27 UTC,7/19,0,FATAL,28000,"role ""postgres"" does not exist",,,,,,,,,"","client backend",,0 2022-10-06 09:42:08.556 UTC,,,323,,633ea2a5.143,5,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:12.793 UTC,,,323,,633ea2a5.143,6,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,7,,2022-10-06 09:40:53 UTC,,0,LOG,00000,"archive command failed with exit code 1","The failed archive command was: envdir ""/run/etc/wal-e.d/env"" wal-g wal-push ""pg_wal/00000002.history""",,,,,,,,"","archiver",,0 2022-10-06 09:42:17.031 UTC,,,323,,633ea2a5.143,8,,2022-10-06 09:40:53 UTC,,0,WARNING,01000,"archiving write-ahead log file ""00000002.history"" failed too many times, will try again later",,,,,,,,,"","archiver",,0 ```Environment variables of the pod
```bash AWS_DEFAULT_REGION:eu-central-1 AWS_REGION:eu-central-1 AWS_ROLE_ARN:arn:aws:iam::************:role/postgres-pod-role AWS_STS_REGIONAL_ENDPOINTS:regional AWS_WEB_IDENTITY_TOKEN_FILE:/var/run/secrets/eks.amazonaws.com/serviceaccount/token BACKUP_NUM_TO_RETAIN:5 BACKUP_SCHEDULE:*/5 * * * * CLONE_USE_WALG_RESTORE:true DCS_ENABLE_KUBERNETES_API:true HUMAN_ROLE:zalandos KUBERNETES_LABELS:{"application":"spilo"} KUBERNETES_ROLE_LABEL:spilo-role KUBERNETES_SCOPE_LABEL:******* KUBE_IAM_ROLE:postgres-pod-role PAM_OAUTH2:https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees PGPASSWORD_STANDBY:secretKeyRef(standby.*******-postgres-development.postgres.credentials.password) PGPASSWORD_SUPERUSER:secretKeyRef(root.*******-postgres-development.postgres.credentials.password) PGROOT:/home/postgres/pgdata/pgroot PGUSER_STANDBY:standby PGUSER_SUPERUSER:root PGVERSION:14 SCOPE:*******-postgres-development SPILO_CONFIGURATION:{"postgresql":{},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"users":{"zalandos":{"password":"","options":["CREATEDB","NOLOGIN"]}},"dcs":{}}} USE_WALG_BACKUP:true USE_WALG_RESTORE:true WAL_BUCKET_SCOPE_PREFIX: WAL_BUCKET_SCOPE_SUFFIX: WAL_S3_BUCKET:*******-postgres ```The cron job also does not seem to be executed every 5 minutes as there are no logs statements for it.