zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.56k stars 388 forks source link

wal-g with S3 AssumeRole broken #1009

Open bootc opened 3 months ago

bootc commented 3 months ago

I have a couple of setups using the Zalando Postgres Operator, configured to upload WALs to an S3 bucket using IAM Roles for Service Accounts (IRSA). When upgrading from spilo-16 3.2-p3 to 3.3-p1, wal-g breaks with the following errors:

ERROR: 2024/07/31 21:34:51.694882 Failed to configure multi-storage: configure primary storage: configure storage with prefix "s3://[...]": create S3 storage: create new AWS session: configure session: assume role by ARN: InvalidParameter: 1 validation error(s) found.
- minimum field size of 2, AssumeRoleInput.RoleSessionName.

I have redacted the bucket name.

I believe this is related to the wal-g upgrade from 2.0.1 to 3.0.0, and the bug is probably in there. I expect it is trying to AssumeRole with the RoleSessionName being the empty string, as no AWS_ROLE_SESSION_NAME is being supplied. Unfortunately I can't test this theory easily as AWS_ROLE_SESSION_NAME is not passed through to wal-g via configure_spilo.py.

I have reverted those clusters to 3.2-p3 for now.

danavatavu commented 2 months ago

This issue is indeed blocking. In order to use timescale license we have to build spilo images with parameter TIMESCALEDB_APACHE_ONLY=false, see issue. All the re-build actions from spilo 3.0 to 3.2 are failing due to missing libsodium 1.0.17 version https://github.com/jedisct1/libsodium/releases used in dependencies.sh file , and spilo images tags starting with 3.3 have upper mentioned issue and backups are not being saved to AWS S3.