S3 WAL path differs from logical backups path if bucket name is set in pod configmap instead of the operator's configuration

zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes

https://postgres-operator.readthedocs.io/

MIT License

4.22k stars 968 forks source link

S3 WAL path differs from logical backups path if bucket name is set in pod configmap instead of the operator's configuration #1279

Open vitobotta opened 3 years ago

vitobotta commented 3 years ago

Hi,

I have installed 1.6.0 in a new cluster to try it (since, like explained in another issue, I am having trouble updating existing installations from 1.5.0). Everything is working fine but I noticed a weird thing. Up to 1.5.0 WAL archiving and logical backups were writing to subdirectories of /spilo in the bucket. With 1.6.0 they are writing in different parent directories.

For example I have a cluster called "postgres-dynablogger-dev". WAL is being archived in /spilo/postgres-dynaboogger-dev-postgres-dynablogger-dev (not sure why the duplicate name) while logical backups write to /spilo/postgres-dynablogger-dev//logical_backups. Is it a known bug?

Thanks!

FxKu commented 3 years ago

The logical backup path looks fine. That hasn't changed, right? /spilo/postgres-dynaboogger-dev-postgres-dynablogger-dev, however, looks suspicious. Is it also set in the pod env vars like this? What has changed for WAL is that now the major version number is added to the path.

vitobotta commented 3 years ago

Hi. Yeah the path of the logical backups was the same. I used the same config I used for 1.5.0, so I didn't do anything different that might explain the weird path name for the WAL. I am using managed DB at the moment as it easier, but will try again when I have a min. Any suggestions on what I could try? Thanks!

vitobotta commented 3 years ago

@FxKu has this issue (I think it's an issue?) been fixed in 1.6.1? The behavior in 1.6.0 prevents you from having PG clusters with the same name but in different Kubernetes clusters, which was possible before because the ID was used in the path. If it has been fixed, what happens when you upgrade from 1.6.0 to 1.6.1 since the paths for the WAL stuff would be different? Thanks

FxKu commented 3 years ago

@vitobotta nothing has changed in comparison to v1.6.0. We don't see this behavior in our clusters and I'm not aware somebody else has raised this. I would assume that the UID is always used in the WAL path to make the path unique. What's you config and which Spilo version are you using now?

vitobotta commented 3 years ago

Hi @FxKu, I am using Wasabi as S3 compatible service, and use this pod configmap to configure the WAL archiving:

kubectl apply -f - <<EOF
apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: postgres-pod-config
  namespace: postgres-test
data: 
  BACKUP_SCHEDULE: "0 */12 * * *"
  USE_WALG_BACKUP: "true"
  BACKUP_NUM_TO_RETAIN: "14"
  AWS_ACCESS_KEY_ID: "..."
  AWS_SECRET_ACCESS_KEY: "..."
  AWS_ENDPOINT: "..."
  AWS_REGION: "..."
  WAL_S3_BUCKET: "..."
  WALG_DISABLE_S3_SSE: "true"
EOF

This is the folder structure created. As you can see for the logical backups the path includes the cluster ID, while for the WAL archive it does not:

postgres

What I am doing wrong? Everything works fine apart from the path of the WAL archive which doesn't include the cluster ID.

Thanks!

vitobotta commented 3 years ago

Forgot to mention that the Spilo image used with 1.6.0 is currently registry.opensource.zalan.do/acid/spilo-13:2.0-p2. When I upgraded the operator to 1.6.1 in a test K8s cluster, it did a rolling restart of the PG cluster pods without changing the image to the expected (I think) registry.opensource.zalan.do/acid/spilo-13:2.0-p4 one.

FxKu commented 3 years ago

Just saw you have variables AWS_REGION and WAL_S3_BUCKET in your pod environment config map. There are already fields for these in the operator config. So better set it only there. Maybe your additional config map with redundant variables leads to the strange folder name in your case. Not sure, why it worked before 1.6.0. Seems some behavior has changed in Spilo then.

vitobotta commented 3 years ago

@FxKu I thought that the settings in the operator where only for the case when you use AWS S3, not compatible S3 services? I can try of course, but it seems weird that the region and bucket name settings could affect the path, which includes a different name for the pg cluster. I will try anyway, thanks.

vitobotta commented 3 years ago

@FxKu To my surprise it did make a difference :D I set the region and bucket name in the operator config instead of the pod configmap and the WAL was stored in spilo/cluster-name/cluster-id as expected.

Now the question is.... if I upgrade the operator in my production cluster from 1.6.0 to 1.6.1, should I try and change this configuration or should I just leave things as they are for now? I want to avoid downtime.

Is there anyway I can "fix" my current setup while upgrading the operator without affecting the running cluster, so that the WAL stuff is written to the correct location? Thanks!

vitobotta commented 3 years ago

@FxKu

I found what's happening.... when you set the bucket name in the operator config, WAL_BUCKET_SCOPE_PREFIX is set to an empty string and WAL_BUCKET_SCOPE_SUFFIX is set to "/{cluster id}". So the path in the bucket is /spilo/{cluster-name}/{cluster-id}/wal as expected.

If you don't set the bucket name in the operator config but in the pod configmap, WAL_BUCKET_SCOPE_PREFIX is set to "{cluster-name}-{cluster-namespace}" (see this line in Spilo) and WAL_BUCKET_SCOPE_SUFFIX remains unset, so the final path is /spilo/{cluster-name}-{cluster-namespace}/wal, which is what I have now.

I am pretty sure that in the past I had the expected behavior even with my usual configuration and I found the confirmation that this was changed 7 months ago - see this.

I am still testing what happens if I upgrade the operator from my existing configuration and it seems that the cluster just continues to work as before. So I will just upgrade and leave things as they are in my production cluster to avoid downtime, keeping notes about this thing for the next time I install the operator in a new K8s cluster.

OlleLarsson commented 3 years ago

I ran into this problem as well. Maybe it could be clarified in the docs that it is recommended to set wal bucket through the operator config and not via configmap/secret.